riding the dot-com wave: a case study in extreme vmscluster scalability

Riding the Dot-Com Wave:A Case Study in ExtremeVMScluster Scalability

CETS2001 Session 1255

Wednesday, Sept. 12, 2:45 pm, 303B

Keith Parris

Topics

• Scale of workload and configuration growth

• Changes made to scale the cluster

• Challenges to extreme scalability and high availability

• Surprises along the way

• Lessons learned

Hardware Growth

• 1981: 1 Alpha Microsystems PC• 1983: 1 VAX 11/750• 1993: 1 MicroVAX 4100• 1996: 4 VAX 7700s, 1 8400• 1997: 6 VAX 7700s, 2 VAX 7800s, 2 8400s• 1999: 18 GS-140s• 2001: 2 clusters; one with 12 GS-140s, the other

with 3 GS-140s and 2 GS-160s

Workload Growth Rate

• As measured in yearly peak transaction counts– 1996-1997: 2X– 1997-1998: 2X– 1998-1999: 2X– 1999-2000: 3X

• We’ll focus on these years

Scaling the Cluster: Memory

• Went from 1 GB to 20 GB on systems

Scaling the Cluster: CPU

• Upgraded VAX 7700 nodes by adding CPUs

• Upgraded key nodes from VAX 7700 to VAX 7800 CPUs

• Ported application from VAX to Alpha• Went from 2-CPU 8400s to 6-CPU GS-

140s, then added 12-CPU GS-160s– From 200 Mhz EV4 chips to 731 Mhz EV67

Scaling the Cluster: I/O

• Went from JBOD to RAID, and raised number of members in RAID arrays over time

• Added RMS Global Buffers• 3600 RPM magnetic disks to 5400 RPM to 7200

RPM to 10K RPM• Put hot files onto large arrays of Solid-State

Disks• Upgraded from CMD controllers to HSJ40s;

added writeback cache; upgraded to HSJ50s and doubled # of controllers; upgraded to HSJ80s

Scaling the Cluster: I/O

• Changed from shadowsets of controller-based stripesets to use host-based RAID 0+1 arrays of controller-based mirrorsets

– Avoided any single controller being a bottleneck by spreading RAID array members across multiple controllers

– Provides faster time-to-repair for shadowset member failures

– Provided faster cross-site shadow copy times

Shadowsets of Stripesets• Volume shadowing

thinks it sees large disks

• Shadow copies and merges occur sequentially across entire surface

• Failure of 1 member implies full shadow copy of stripeset to fix

Controller-based stripeset

Controller-based stripeset

Host-based shadowset

Host-based RAID 0+1 arrays• Individual disks are

combined first into shadowsets

• Host-based RAID software combines the shadowsets into a RAID 0 array

• Shadowset members can be spread across multiple controllers




Host-based RAID 0+1 array

Host-based RAID 0+1 arrays• Shadow copies and

merges occur in parallel on all shadowsets at once

• Failure of 1 member requires full shadow copy of only that member to fix




Host-based RAID 0+1 array

Scaling the Cluster: I/O & Locking

• Implemented Fast_Path on CI

• Tried Memory Channel and failed– CPU 0 saturation in interrupt state occurred

when lock traffic moved from CI (with Fast_Path) to MC (without Fast_Path)

• Went from 2 CI star couplers to 6– Distributed lock traffic across CIs

Scaling the Cluster:

• Implemented Disaster-Tolerant Cluster– Effectively doubled hardware: CPUs, I/O

Subsystems, Memory– Significantly improved availability– But relatively long inter-site distance added

inter-site latency as a new potential factor in performance issues

Scaling the Cluster:Datacenter space

• Multi-site clustering and Volume Shadowing provided the opportunity to move to larger datacenters, without downtime, on 3 separate occasions

Challenges:

• Downtime cost $Millions per event

• Longer downtime meant even-larger risk– Had to resist initial pressure to favor quick

fixes over any diagnostic efforts that would lengthen downtime

• e.g. crash dump files

Challenges:

• Network focus in application design rather than Cluster focus– Triggered by history of adding node after node,

connected by DECnet, rather than forming a VMS Cluster early on

– Important functions assigned to specific nodes• Failover and load balancing problematic

– Systems had to boot/reboot in specific order

Challenges:

• Web interface design provided quick time-to-market using screen scraping, but had fragile 3-process chain with link to Unix

Challenges:

• Fragile 3-process chain with link to Unix– Failure of Unix program, TCP/IP link, or any

of the 3 processes on VMS caused all 3 VMS processes to die, incurring:

• Process run-down and cleanup• Process creation and image activations for 3 new

processes to replace the 3 which just died

– Slowing response times could cause time-outs and initiate “meltdowns”

Challenges:

• Interactive system capacity requirements in an industry with historically batch-processing mentality:

– Can’t run CPUs to 100% with interactive users like you can with overnight batch jobs

Challenges:

• Adding Solid-State Disks– Hard to isolate hot blocks

• Ended up moving entire hot RMS files to SSD array

Challenges:

• Application program techniques which worked fine under low workloads failed at higher workloads

– Closing files for backups– ‘Temporary’ infinite loop

Challenges:

• Standardization on Cisco network hardware and SNMP monitoring

– Even on GIGAswitch-based inter-site cluster link

Challenges:

• Constant pressure to port to Unix:– Sun proponents continually told management:

• “We will be off the VAX in 6 months”

– Adversely affected VMS investments at critical times

• e.g. RZ28D disks, star couplers

Surprises Along the Way:

• As more Alpha nodes were added, lock tree remastering activity caused “pauses” of 10 to 50 seconds every 2-3 minutes

– Controlled with PE1=100 during workday


• Shadowing patch c. 1997 changed algorithm for selecting disk for read operations, suddenly sending ½ of the read requests to the other site, 130 miles (4-5 milliseconds) farther away

– Subsequent patch kit allowed control of behavior with SHADOW_SYS_DISK SYSGEN parameter


• VMS may allow a lock master node to take on so much workload that CPU 0 ends up saturated in interrupt state later

– Caused CLUEXIT bugchecks and performance anomalies

• With the help of VMS Source Listings and advice from VMS Engineering, wrote programs to spread lock mastership of the hot files across a set of several nodes, and held them there using PE1


• CI Load Sharing code never got ported from VAX to Alpha

• Nodes crashing and rebooting changed assignments of which star couplers were used for lock traffic between pairs of nodes

– Caused unpredictable performance anomalies

• CSC and VMS Engineering came to the rescue with a program called MOVE_REMOTENODE_CONNECTIONS to set up a (static) load-balanced configuration


• As disks grew larger, default extent sizes and RMS bucket sizes grew by default as files were CONVERTed onto larger disks using default optimize script

– Data transfer sizes gradually grew by a factor of 14X over 4 years

– Solid-state disks don’t benefit from increased RMS bucket sizes like magnetic disks do

• Fixed by manually selecting RMS bucket sizes for hot files on solid-state disks

Challenges Left in VMScluster Scalability and High Availability• Can’t enlarge disks or RAID arrays on-

line• Can’t re-pack RMS files on-line• Can’t de-fragment open files (with DFO)• Disks are getting lots bigger but not as

much faster– I/Os per second per gigabyte is actually

falling

Lessons Learned:

• To provide good system performance one must gain knowledge of application behavior

Lessons Learned:

• High-availability systems require:– Best possible people to run them– Best available vendor support:

• Remedial

• Engineering

Lessons Learned:

• Many problems can be avoided entirely (or at least deferred) by providing “reserve” computing capacity

• Avoids saturation conditions– Avoids error paths and other seldom-exercised code

paths

• Provides headroom for peak loads, and to accommodate rapid workload growth when procurement efforts have long lead times

Lessons Learned:

• Staff size must grow with workload growth and cluster size, but with VMS clusters, not at as high a rate

• Staff size went from 1 person to 8 people (plus vendor HW/SW support) with 24X workload growth

Lessons Learned:

• Visibility of system workload and system performance is key, to:

– Spot surges in workload– Identify bottlenecks as each new one arises

• Provide quick turn-around of performance info into changes and optimizations

– Overnight, and even mid-day

Lessons Learned:

• With present technology, some scheduled downtime will be needed:– to optimize performance

– to do hardware upgrades & maintenance

• You’re going to have to have some downtime: do you want to schedule some or just deal with it when it happens on its own?– Scheduled downtime helps prevent or minimize

unscheduled downtime

Lessons Learned:

• Despite the redundancy within a cluster, a VMS Cluster viewed as a whole can be a Single Point of Failure

– Solution: Use multiple clusters, with the ability to shift customer data quickly between them if needed

– Can hide scheduled downtime from users

Lessons Learned:

• It was impossible to optimize system performance by system tuning alone

– Deep knowledge of application program behavior had to be gained, by:

• Code examination

• Constant discussions with development staff

• Observing system behavior under load

Lessons Learned:

• Application code improvements are often sorely needed, but their effect on performance can be hard to predict; they may actually hurt things, or make dramatic order-of-magnitude improvements

– They are also often found due to serendipity or sudden inspiration, so it’s also hard to plan them or to predict when they might occur

Lessons Learned:

• Effect of hardware upgrades is easier to predict: double the hardware will double the cost, and will generally provide close to double the performance

– Order-of-magnitude improvements are harder` to obtain, and more expensive

Success Factors:

• Excellent people

• Best technology

• Quick procurement, preferably proactive

• Top-notch vendor support– Services (CSC, MSE)– VMS Engineering; Storage Engineering

Speaker Contact Info

Keith ParrisE-mail: [email protected] or

[email protected]: http://www.geocities.com/keithparris/ and

http://encompasserve.org/~kparris/

Integrity Computing, Inc.2812 Preakness WayColorado Springs, CO 80916-4375(719) 392-6696

mailto:[email protected]



http://www.geocities.com/keithparris/

riding the dot-com wave: a case study in extreme vmscluster scalability

Documents