powerpoint, 2.8m

Smart Storage and LinuxAn EMC Perspective

Ric [email protected]

mailto:[email protected]

Why Smart Storage?

➢ Central control of critical data➤ One central resource to fail-over in disaster

planning➤ Banks, trading floor, air lines want zero downtime

➢ Smart storage is shared by all hosts & OS’es➤ Amortize the costs of high availability and disaster

planning over all of your hosts➤ Use different OS’es for different jobs (UNIX for the

web, IBM mainframes for data processing)➤ Zero-time “transfer” from host to host when both

are connected➤ Enables cluster file systems

Data Center Storage Systems

➢ Change the way you think of storage➤ Shared Connectivity Model➤ “Magic” Disks➤ Scales to new capacity➤ Storage that runs for years at a time

➢ Symmetrix case study➤ Symmetrix 8000 Architecture➤ Symmetrix Applications

➢ Data center class operating systems

Traditional Model of Connectivity

➢ Direct Connect➤ Disk attached directly to host ➤ Private - OS controls access and provides security➤ Storage I/O traffic only

• Separate system used to support network I/O (networking, web browsing, NFS, etc)

Shared Models of Connectivity

➢ VMS Cluster➤ Shared disk & partitions➤ Same OS on each node➤ Scales to dozens of nodes

➢ IBM Mainframes➤ Shared disk & partitions➤ Same OS on each node➤ Handful of nodes

➢ Network Disks➤ Shared disk/private partition➤ Same OS➤ Raw/block access via network➤ Handful of nodes

New Models of Connectivity

➢ Every host in a data center could be connected to the same storage system➤ Heterogeneous OS & data format (CKD & FBA)➤ Management challenge: No central authority to

provide access control

SharedStorage

IRIX

DGUX

FreeBSD

MVS

VMS

LinuxSolaris

HPUX

NT

Magic Disks

➢ Instant copy➤ Devices, files or data

bases

➢ Remote data mirroring ➤ Metropolitan area➤ 100’s of kilometers

➢ 1000’s of virtual disks➤ Dynamic load balancing

➢ Behind the scenes backup ➤ No host involved

Scalable Storage Systems

➢ Current systems support➤ 10’s of terabytes➤ Dozens of SCSI, fibre channel, ESCON channels

per host➤ Highly available (years of run time)➤ Online code upgrades➤ Potentially 100’s of hosts connected to the same

device

➢ Support for chaining storage boxes together locally or remotely

Longevity

➢ Data should be forever➤ Storage needs to overcome network failures,

power failures, blizzards, asteroid strikes …➤ Some boxes have run for over 5 years without a

reboot or halt of operations➢ Storage features

➤ No single point of failure inside the box➤ At least 2 connections to a host➤ Online code upgrades and patches➤ Call home on error, ability to fix field problems

without disruptions➤ Remote data mirroring for real disasters

Symmetrix Architecture

➢ 32 PowerPC 750’s based “directors”

➢ Up to 32 GB of central “cache” for user data

➢ Support for SCSI, Fibre channel, Escon, …

➢ 384 drives (over 28 TB with 73 GB units)

Symmetrix Basic Architecture

Data Flow through a Symm

Read Performance

Prefetch is Key

➢ Read hit gets RAM speed, read miss is spindle speed

➢ What helps cached storage array performance?➤ Contiguous allocation of files (extent-based file

systems) preserve logical to physical mapping➤ Hints from the host could help prediction

➢ What might hurt performance?➤ Clustering small, unrelated writes into contiguous

blocks (foils prefetch on later read of data)➤ Truly random read IO’s

Symmetrix Applications

➢ Instant copy➤ TimeFinder

➢ Remote data copy➤ SRDF (Symmetrix Remote Data Facility)

➢ Serverless Backup and Restore➤ Fastrax

➢ Mainframe & UNIX data sharing➤ IFS

Business Continuance Problem“Normal” Daily Operations Cycle

Online Day

BACKUP /BACKUP /DSSDSS

Resume Online Day

“Race to SunriseSunrise”

4 Hours of Data Inaccessibility*

2 am

6 am

➢ Creation and control of a copy of any active application volume

➢ Capability to allow the new copy to be used by another application or system

➤ Continuous availability of production data during backups, decision support, batch queries, DW loading, Year 2000 testing, application testing, etc.

➢ Ability to create multiple copies of a single application volume

➢ Non-disruptive re-synchronization when second application is complete

TimeFinder

Backups

Decision Support

Data Warehousing

Euro Conversion

BCV is a copy of real production data

Sales BUSINESS

CONTINUANCE

VOLUME

BUSINESS

CONTINUANCE

VOLUME

BUSINESS

CONTINUANCE

VOLUME

PRODUCTION

APPLICATION

VOLUME

PRODUCTION

APPLICATION

VOLUME

PRODUCTION

APPLICATION

VOLUME

Business Continuance Volumes

➢ A Business Continuation Volume (BCV) is created and controlled at the logical volume level

➢ Physical drive sizes can be different, logical size must be identical

➢ Several ACTIVE copies of data at once per Symmetrix

Using TimeFinder

➢ Establish BCV ➢ Stop transactions to clear

buffers➢ Split BCV ➢ Start transactions ➢ Execute against BCVs➢ Re-establish BCV

M1M1

BCVBCV

M2M2

Re-Establishing a BCV Pair

■ BCV pair “PROD” and “BCV” have been split

■ Tracks on “PROD” updated after split

■ Tracks on ‘BCV’ updated after split

■ Symmetrix keeps table of these “invalid” tracks after split

■ At re-establish BCV pair, “invalid” tracks are written from “PROD” to “BCV”

■ Synch complete

PROD M1PROD M1

Split BCV Pair

BCV M1BCV M1

UPDATEDUPDATED

UPDATEDUPDATED

UPDATED

Re-Establish BCV Pair

PRODPROD BCVBCV

UPDATEDUPDATED

UPDATEDUPDATED

UPDATED

Restore a BCV Pair

■ BCV pair “PROD” and “BCV” have been split

■ Tracks on “PROD” updated after split

■ Tracks on “BCV” updated after split

■ Symmetrix keeps table of these “invalid” tracks after split

■ At restore BCV pair, “invalid” tracks are written from “BCV to PROD”

■ Synch complete

PRODPROD

Split BCV Pair

BCVBCV

UPDATEDUPDATED

UPDATEDUPDATED

UPDATED

Restore BCV Pair

PRODPROD BCVBCV

UPDATEDUPDATED

UPDATEDUPDATED

UPDATED

Make as Many Copies as Needed

■ Establish BCV 1

■ Split BCV 1

■ Establish BCV 2

■ Split BCV 2

■ Establish BCV 3

M1M1

BCV 1BCV 1

M2M2

BCV 2BCV 2 BCV 3BCV 3

4 PM

5 PM

6 PM

The Purpose of SRDF

➢ Local data copies are not enough➢ Maximalist

➤ Provide a remote copy of the data that will be as usable after a disaster as the primary copy would have been.

➢ Minimalist➤ Provide a means for generating periodic physical

backups of the data.

Synchronous Data Mirroring

➢ Write is received from the host into the cache of the source

➢ I/O is transmitted to the cache of the target

➢ ACK is provided by the target back to the cache of the source

➢ Ending status is presented to the host

➢ Symmetrix systems destage writes to disk

➢ Useful for disaster recovery

- S e m i S y n c h r o n o u sM ir r o r in g

➢ An I/O write is received from the host/server into the cache of the source

➢ Ending status is presented to the host/server.

➢ I/O is transmitted to the cache of the target

➢ ACK is sent by the target back to the cache of the source

➢ Each Symmetrix system destages writes to disk

➢ Useful for adaptive copy

Backup / Restore of Big Data

➢ Exploding amounts of data cause backups to run on too long➤ How long does it take you to backup 1 TB of data?➤ Shrinking backup window and constant pressure for

continuous application up-time ➢ Avoid using production environment for backup

➤ No server CPU or I/O channels ➤ No involvement of regular network

➢ Performance must scale to match customer’s growth➢ Heterogeneous host support

Location 1 Location 2

Tape Library

Tape Library

SCSI

SCSIFastraxData Engine

UNIX

Linux

Fastrax EnabledBackup/Restore

Applications

SYMAPI

UNIX

STD1 STD2BCV1

R1 R2

BCV2

Fibre Channel PtP Link(s)

Fastrax Overview

Host to Tape Data Flow

SymmetrixFastrax

Tape LibraryHost

Fastrax Performance

➢ Performance scales with the number of data movers in the Fastrax box & number of tape devices

➢ Restore runs as fast as backup➢ No performance impact on host during restore or backup

RAF

RAF

RAF

RAF DM

DM

DM

DM

SRDF

Fastrax

Moving Data from Mainframes to UNIX

InfoMover File System

➢ Transparent availability of MVS data to Unix hosts➤ MVS datasets available as native Unix files

➢ Sharing a single copy of MVS datasets ➢ Uses MVS security and locking

➤ Standard MVS access methods for locking + security

Mainframe✔ IBM MVS / OS390

Open Systems✔ IBM AIX✔ HP HP-UX✔ Sun Solaris

ESCONChannel

ParallelChannel

FWD SCSIUltra SCSI

FibreChannel

Minimal NetworkOverhead-- No data transfer over network! --

MVSData

Symmetrix with ESP

IFS Implementation

Symmetrix API’s

Symmetrix API Overview

➢ SYMAPI Core Library➤ Used by “Thin” and Full Clients

➢ SYMAPI Mapping Library➢ SYMCLI Command Line Interface

Symmetrix API’s

➢ SYMAPI are the high level functions➤ Used by EMC’s ISV partners (Oracle, Veritas, etc)

and by EMC applications

➢ SYMCLI is the “Command Line Interface” which invoke SYMAPI➤ Used by end customers and some ISV

applications.

Basic Architecture

Symmetrix ApplicationProgramming Interface

(SymAPI)

SymmetrixCommand

LineInterpreter(SymCli)

OtherStorage

ManagementApplications

User access to the Solutions Enableris via the SymCli or

Storage Management Application

Client-Server Architecture

➢ Symapi Server runs on the host computer connected to the Symmetrix storage controller

➢ Symapi client runs on one or more host computers

Client Host

Server Host

Thin Client Host

SymAPIlibrary

SymAPIServer

StorageManagementApplications

StorageManagementApplications

SymAPI Client

ThinSymAPI Client

SymmAPI ComponentsInitialization

Discover and Update

Configuration

Gatekeepers

TimeFinder Functions

Device Groups

DeltaMark Functions

SRDF Functions

Statistics

Mapping Functions Base Controls

Calypso Controls

Optimizer Controls

InfoSharing

Data Object Resolve

RDBMS Data File

File System

Logical Volume

Host Physical Device

Symmetrix Device

Extents

File System Mapping

➢ File System mapping information includes:➤ File System attributes and

host physical location.➤ Directory attributes and

contents.➤ File attributes and host

physical extent information, including inode information, fragment size.

I-nodes Directories File extents

Data Center Hosts

Solaris & Sun Starfire

➢ Hardware➤ Up to 62 IO Channels➤ 64 CPU’s➤ 64 GB of RAM➤ 60 TB of disk➤ Supports multiple

domains➢ Starfire & Symmetrix

➤ ~20% use more than 32 IO channels

➤ Most use 4 to 8 IO channels per domain

➤ Oracle instance usually above 1 TB

HPUX & HP 9000 Superdome

➢ Hardware➤ 192 IO Channels➤ 64 CPU’s cards➤ 128 GB RAM➤ 1 PB of storage

➢ Superdome and Symm➤ 16 LUNS per target➤ Want us to support

more than 4000 logical volumes!

Solaris and Fujitsu GP7000F M1000

➢ Hardware➤ 6-48 I/O slots➤ 4-32 CPU’s➤ Cross-Bar Switch ➤ 32 GB RAM➤ 64-bit PCI bus➤ Up to 70TB of storage

Solaris and Fujitsu GP7000F M2000

➢ Hardware➤ 12-192 I/O slots➤ 8-128 CPU’s➤ Cross-Bar Switch ➤ 256 GB RAM➤ 64-bit PCI bus➤ Up to 70TB of storage

AIX 5L & IBM RS/6000 SP

➢ Hardware➤ Scale to 512 Nodes

(over 8000 CPUs)➤ 32 TB RAM➤ 473 TB Internal

Storage Capacity➤ High Speed

Interconnect 1GB/sec per channel with SP Switch2

➤ Partitioned Workloads➤ Thousands of IO

Channels

IBM RS/6000 pSeries 680 AIX 5L

➢ Hardware➤ 24 CPUs

• 64-bit RS64 IV➤ 600MHz➤ 96 MB RAM➤ 873.3 GB Internal

Storage Capacity➤ 53 PCI slots

• 33 – 32bit/20-64bit

Really Big Data

➢IBM (Sequent) NUMA➤16 NUMA “Quads”

• 4 way/ 450 MHz CPUs• 2 GB Memory• 4 x 100MB/s FC-SW

➢Oracle 8.1.5 with up to 42 TB (mirrored) DB

➢EMC Symmetrix➤20 Small Symm 4’s➤2 Medium Symm 4’s

IQ

LI

NK

CascadedFibre

ChannelSwitch

Pair

CascadedFibre

ChannelSwitch

Pair

EMCDISK

NUMA-Q 2000 XEON QUAD





NUMA-Q 2000 XEONQUAD






NUMA-Q 2000 XEONQUAD

Fabric Domain 1

Fabric Domain 2

Fabric Domain 3

Fabric Domain 4

EMC DISK

EMC DISK

EMC DISK

EMC DISK

EMC DISK

Windows 2000 on IA32

➢ Usually lots of small (1u or 2u) boxes share a Symmetrix➤ 4 to 8 IO channels per box➤ Qualified up to 1 TB per meta volume (although

usually deployed with ½ TB or less)

➢ Management is a challenge➢ Will 2000 on IA64 handle big data better?

Linux Data Center Wish List

Lots of Devices

➢ Customers can uses hundreds of targets and LUN’s (logical volumes)➤ 128 SCSI devices per system is too few

➢ Better naming system to track lots of disks➢ Persistence for “not ready” devices in the

name space would help some of our features➢ devfs solves some of this

➤ Rational naming scheme➤ Potential for tons of disk devices (need SCSI driver

work as well)

Support for Dynamic Data

➢ What happens when the LV changes under a running file system? Adding new logical volumes?➤ Happens with TimeFinder, RDF, Fastrax➤ Requires a remounting, reloading drivers, rebooting?➤ API’s can be used to give “heads up” before events

➢ Must be able to invalidate➤ Data, name and attribute caches for individual files or

logical volumes➢ Support for dynamically loaded, layered drivers➢ Dynamic allocation of devices

➤ Especially important for LUN’s➤ Add & remove devices as fibre channel fabric changes

Keep it Open

➢ Open source is good for us➤ We can fix it or support it if you don’t want to➤ No need to reverse engineer some closed source

FS/LVM

➢ Leverage storage API’s ➤ Add hooks to Linux file systems, LVM’s, sys admin

tools

➢ Make Linux manageable➤ Good management tools are crucial in large data

centers

New Technology Opportunities

➢ Linux can explore new technologies faster than most➢ iSCSI

➤ SCSI over TCP for remote data copy?➤ SCSI over TCP for host storage connection?➤ High speed/zero-copy TCP is important to storage here!

➢ Infiniband➤ Initially targeted at PCI replacement➤ High speed, high performance cluster infrastructure for

file systems, LVM’s, etc• Multi gigabits/sec (2.5 GB/sec up to 30 GB/sec)

➤ Support for IB as a storage connection?➢ Cluster file systems

Linux at EMC

➢ Full support for Linux in SymAPI, RDF, TimeFinder, etc

➢ Working with partners in the application space and the OS space to support Linux ➤ Oracle Open World Demo of Oracle on Linux with over

20 Symms (could reach 1PB of storage!)

EMC Symmetrix Enterprise Storage

EMC ConnectrixEnterprise FiberChannel Switch

CentralizedMonitoring and

Management

MOSIX and Linux Cluster File Systems

Our Problem: Code Builds

➢ Over 70 OS developers➤ Each developer builds 15 variations of the OS➤ Each variation compiles over a million lines of

code➤ Full build uses gigabytes of space, with 100k

temporary files

➢ User sandboxes stored in home directory over NFS

➢ Full build took around 2 hours➢ 2 users could build at once

Our Original Environment

➢ Software➤ GNU tool chain➤ CVS for source control➤ Platform Computing’s Load Sharing Facility➤ Solaris on build nodes

➢ Hardware➤ EMC NFS server (Celerra) with EMC Symmetrix

back end➤ 26 SUN Ultra-2 (dual 300 MHz CPU) boxes➤ FDDI ring used for interconnect

EMC's LSF Cluster

LSF Architecture

➢ Distributed process scheduling and remote execution ➤ No kernel modifications➤ Prefers to use static placement for load balancing➤ Applications need to link special library➤ License server controls cluster access

➢ Master node in cluster➤ Manages load information➤ Makes scheduling decisions for all nodes

➢ Uses modified GNU Make (lsmake)

MOSIX Architecture

➢ Provide transparent, dynamic migration➤ Processes can migrate at any time➤ No user intervention required➤ Process thinks it is still running on its creation

node

➢ Dynamic load balancing➤ Use decentralized algorithm to continually level

load in the cluster➤ Based on number of CPU's, speed of CPU's,

RAM, etc

➢ Worked great for distributed builds in 1989

MOSIX Mechanism

➢ Each process has a unique home node➤ UHN is the node of the processes creation

➤ Process appears to be running at its UHN

➤ Invisible after migration to others on its new node

➢ UHN runs a deputy ➤ Encapsulates system state for migrated process

➤ Acts as a proxy for some location-sensitive system calls after migration

➤ Significant performance hit for IO over NFS, for example

MOSIX Migration

Link layerLink

layer

deputy

remote

User level

Kernel

localprocess

User level

Kernel

NF

S

MOSIX Enhancements

➢ MOSIX added static placement and remote execution➤ Leverage the load balancing infrastructure for

placement decisions➤ Avoid creation of deputies

➢ Lock remotely spawned processes down just in case

➢ Fix several NFS caching related bugs➢ Modify some of our makefile rules

MOSIX Remote Execution

Link layerLink

layer

deputy

remote

User level

Kernel

localprocess

User level

Kernel

NF

S

NF

S

EMC MOSIX cluster

➢ EMC’s original MOSIX cluster ➤ Compute nodes changed from LSF to MOSIX➤ Network changed from FDDI to 100 megabit

ethernet.

➢ The MOSIX cluster immediately moved the bottleneck from the cluster to the network and I/O systems.

➢ Performance was great, but we can do better!

Latest Hardware Changes

➢ Network upgrades➤ New switch deployed➤ Nodes to switch use 100 megabit ethernet➤ Switch to NFS server uses gigabit ethernet

➢ NFS upgrades➤ 50 gigabyte, striped file systems per user

(compared to 9 gigabyte non-striped file systems)➤ Fast/wide differential SCSI between server and

storage➢ Cluster upgrades

➤ Added 28 more compute nodes➤ Added 4 “submittal” nodes

EMC MOSIX Cluster

SCSI10

0 M

bit

Gigabit Ether

2 EMC Celerra NFSServers

4 SymmetrixSystems

Cisco 6506

52 VA Linux 3500s

Gigabit Eth

er

SCSI

SCSI

SCSI

Performance

➢ Running Red Hat 6.0 with 2.2.10 kernel (MOSIX and NFS patches applied)

➢ Builds are now around 15-20 minutes (down from 1-1.5 hours)➢ Over 35 concurrent builds at once

Build Submissions

100MbitCorporate Network

100 MbitCluster Network

SwitchSwitch

Windows

Sun Workstation

Linux

FreeBSD HP/UX

Selfish DualNIC nodes

Compute nodes

Compute nodes

Symmetrix Symmetrix Symmetrix Symmetrix

NFS Server NFS Server

Gigabit Ethernet

SCSI

100 Mbit

Cluster File System & MOSIX

SCSI

100

Mbit

1 EMC Celerra NFSServer

4 SymmetrixSystems

Cisco 6506

52 VA Linux 3500s

Gigabit

Ether

SCSI

SCSI

SCSI

Fibre Channel

Fibre ChannelFibre Channel

Fibre ChannelFibre Channel

Connectrix

DFSA Overview

➢ DFSA provides the structure to allow migrated processes to always do local IO

➢ MFS (MOSIX File System) created➤ No caching per node, write through➤ Serverless - all nodes can export/import files➤ Prototype for DFSA testing➤ Works like non-caching NFS

DFSA Requirements

➢ One active inode/buffer in the cluster for each file

➢ Time-stamps are cluster-wide, increasing➢ Some new FS operations

➤ Identify: encapsulate dentry info➤ Compare: are two files the same?➤ Create: produce a new file from SB/ID info

➢ Some new inode operations➤ Checkpath: verify path to file is unique➤ Dotdot: give true parent directory

Information

➢ MOSIX: ➤ http://www.mosix.org/

➢ GFS:➤ http://www.globalfilesystem.org/

➢ Migration Information➤ Process Migration, Milojicic, et al. To appear in

ACM Computing Surveys, 2000.➤ Mobility: Processes, Computers and Agents,

Milojicic, Douglis and Wheeler, ACM Press.

http://www.mosix.org/



powerpoint, 2.8m

Technology

downtimesmart storage

storage array performance

chaining storage boxes

central authority

gb of central cache

private os controls

fibre channel

s of hosts