-- linux-ha release 2 lwce – sf – august, 2005 linux-ha release 2 - world-class open source ha...

41
-- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World- Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project [email protected] IBM Linux Technology Center

Upload: elwin-hawkins

Post on 30-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 2 - World-Class Open Source HA Software

Alan RobertsonProject Leader – Linux-HA project

[email protected]

IBM Linux Technology Center

Page 2: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Agenda

What is High-Availability (HA) Clustering?What is the Linux-HA project?Linux-HA applications and customersLinux-HA release 1 / Release 2 /Feature ComparisonRelease 2 DetailsDRBD – an important componentThoughts about cluster security

Page 3: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

What Is HA Clustering?

Putting together a group of computers which trust each other to provide a service even when system components fail

When one machine goes down, others take over its work

This involves IP address takeover, service takeover, etc.

New work comes to the remaining machines

Not primarily designed for high-performance

Page 4: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Redundancy eliminates Single Points Of Failure (SPOF)

Monitoring determines when things need to change Reduces cost of planned and unplanned outagesby reducing MTTR(Mean Time To Repair)

High Availability Through Redundancy and Monitoring

Page 5: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Monitoring detects failures (hardware, network, applications) Automatic Recovery from failures (no human intervention)

Managed restart or failover to standby systems, components

Failover and Restart

Page 6: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

The HA Continuum

Single node HA system (monitoring w/o redundancy)Single node HA system (monitoring w/o redundancy)

Provides for application monitoring and restartProvides for application monitoring and restart

Easy, zero-cost entry point – HA system starts init scripts instead of Easy, zero-cost entry point – HA system starts init scripts instead of /etc/init.d/rc/etc/init.d/rc (or equivalent) (or equivalent)

Addresses Solaris / Linux functional gapAddresses Solaris / Linux functional gap

Multiple Virtual Machines – Single Physical machineMultiple Virtual Machines – Single Physical machineAdds OS crash protection, rolling upgrades of OS and application – good Adds OS crash protection, rolling upgrades of OS and application – good for security fixes, etc.for security fixes, etc.

Many possibilities for interactions with virtual machines existMany possibilities for interactions with virtual machines exist

Multiple Physical Machines (“normal” cluster)Multiple Physical Machines (“normal” cluster)Adds protection against hardware failuresAdds protection against hardware failures

Split-Site (“stretch”) ClustersSplit-Site (“stretch”) ClustersAdds protection against site-wide failures (power, air-conditioning, flood, Adds protection against site-wide failures (power, air-conditioning, flood, fire)fire)

Page 7: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

What Can HA Clustering Do For You?

It cannot achieve 100% availability – nothing can.

HA Clustering designed to recover from single faults

It can make your outages very short

From about a second to a few minutes

It is like a Magician's (Illusionist's) trick:

When it goes well, the hand is faster than the eye

When it goes not-so-well, it can be reasonably visible

A good HA clustering system adds a “9” to your base availability

99->99.9, 99.9->99.99, 99.99->99.999, etc.

Complexity is the enemy of reliability!

Page 8: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Lies, Damn Lies, and Statistics

Counting nines – downtime allowed per year

99.9999% 30 sec99.999% 5 min99.99% 52 min99.9% 9 hr 99% 3.5 day

Page 9: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

The Desire for HA systems

Who wants low-Who wants low-availability systems?availability systems?

Why are so few systems High-Availability?

Page 10: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Why isn't everything HA?

Cost

Complexity

Page 11: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Page 12: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

How Does HA work?

Manage redundancy to improve service availability

Like a cluster-wide-super-init with monitoring

Even complex services are now “respawn”

on node (computer) death

on “impairment” of nodes

on loss of connectivity

for services that aren't working (not necessarily stopped)

managing complex dependency relationships

Page 13: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Single Points of Failure (SPOFs)

A single point of failure is a component whose failure will cause near-immediate failure of an entire system or service

Good HA design adds redundancy to eliminate single points of failure

Non-Obvious SPOFs can require deep expertise to spot

Page 14: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

The “Three R's” of High-Availability

Redundancy

Redundancy

Redundancy

If this sounds redundant, that's probably appropriate...

Most SPOFs are eliminated by redundancy

HA Clustering is a good way of providing and managing redundancy

Page 15: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Redundant Data Access

Replicated

Copies of data are kept updated on more than one computer in the cluster

Shared

Typically Fiber Channel Disk (SAN)

Sometimes shared SCSI

Back-end Storage (“Somebody Else's Problem”)

NFS, SMB

Back-end database

Page 16: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

The Linux-HA Project

Linux-HA is the oldest high-availability project for Linux, with the largest associated community

The core piece of Linux-HA is called “Heartbeat”(though it does much more than heartbeat)

Linux-HA has been in production since 1999, and is currently in use on about ten thousand sites

Linux-HA also runs on FreeBSD and Solaris, and is being ported to OpenBSD and others

Linux-HA is shipped with every major Linux distribution except one.

Page 17: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 1 Applications

Database Servers (DB2, Oracle, MySQL, others)

Load Balancers

Web Servers

Custom Applications

Firewalls

Retail Point of Sale Solutions

Authentication

File Servers

Proxy Servers

Medical ImagingAlmost any type server application you can think of – except SAP

Page 18: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA customersFedExFedEx – Truck Location Tracking

BBCBBC – Internet infrastructure

The Weather ChannelThe Weather Channel (weather.com)

SonySony (manufacturing)

ISO New EnglandISO New England manages power grid using 25 Linux-HA clusters

MAN Nutzfahrzeuge AGMAN Nutzfahrzeuge AG – truck manufacturing division of Man AG

Karstadt, Circuit City Karstadt, Circuit City use Linux-HA and databases each in several hundred stores

Citysavings BankCitysavings Bank in Munich (infrastructure)

Bavarian Radio StationBavarian Radio Station (Munich) coverage of 2002 Olympics in Salt Lake City

EmageonEmageon – medical imaging services

IncredimailIncredimail bases their mail service on Linux-HA on IBM hardware

University of Toledo (US)University of Toledo (US) – 20k student Computer Aided Instruction system

Page 19: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 1 capabilities

Supports 2-node clusters

Can use serial, UDP bcast, mcast, ucast communication

Fails over on node failure

Fails over on loss of IP connectivity

Capability for failing over on loss of SAN connectivity

Limited command line administrative tools to fail over, query current status, etc.

Active/Active or Active/Passive

Simple resource group dependency model

Requires external tool for resource (service) monitoring

SNMP monitoring

Page 20: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 2 capabilities

Built-in resource monitoring

Support for the OCF resource standard

Much Larger clusters supported (>= 8 nodes)

Sophisticated dependency model with rich constraint support (resources, groups, incarnations, master/slave) (needed for SAP)

XML-based resource configuration

Coming in 2.0.x:

Configuration and monitoring GUI

Support for GFS cluster filesystem

Multi-state (master/slave) resource support

Initially - no external IP, SAN monitoring

Page 21: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 1 Architecture

Page 22: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Linux-HA Release 2 Architecture(add TE and PE)

Page 23: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Resource Objects in Release 2

Release 2 supports “resource objects” which can be any of the following:

Primitive Resources

Resource Groups

Resource Clones – “n” resource objects

Multi-state resources

Page 24: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Classes of Resource Agents in R(resource primitives)

OCF – Open Cluster Framework – http://opencf.org/

Heartbeat – R1-style heartbeat resources

LSB – Standard LSB Init scripts

Stonith – Node Reset Capability

Page 25: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

An OCF primitive object

<primitive id=”WebIP” class=”ocf” type=”IPaddr” provider=”heartbeat”> <instance_attributes> <attributes> <nvpair name=”ip” value=”192.168.224.5”/> </attributes> </instance_attributes></primitive>

Attribute nvpairs are passed in environment to resource agent

Page 26: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

An LSB primitive resource object

(i. e., an init script)

<primitive id=”samba-smb-rsc” class=”lsb” type=”smb”> <instance_attributes> <attributes/> </instance_attributes></primitive>

Page 27: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Resource Groups

Resource Groups provide a shorthand for creating ordering and co-location dependencies

Each resource object in the group is declared to have linear start-after ordering relationships

Each resource object in the group is declared to have co-location dependencies on each other

This is an easy way of converting release 1 resource groups to release 2

<group id=”webserver”> <primitive/> <primitive/></group>

Page 28: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Resource Clones

Resource Clones allow one to have a resource object which runs multiple (“n”) times on the cluster

This is useful for managing

load balancing clusters where you want “n” of them to be slave servers

Cluster filesystem mount points

Cluster Alias IP addresses

Cloned resource object can be a primitive or a group

Page 29: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Multi-State (master/slave) Resources

(coming in approx. 2.0.1)Normal resources can be in one of two stable states:

running

stopped

Multi-state resources can have more than two stable states. For example:

running-as-master

running-as-slave

stopped

This is ideal for modeling replication resources like DRBD

Page 30: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Basic Dependencies in Release 2

Ordering Dependencies

start before (normally implies stop after)

start after (normally implies stop before)

Mandatory Co-location Dependencies

must be co-located with

cannot be co-located with

Page 31: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Resource Location Constraints

Mandatory Constraints:

Resource Objects can be constrained to run on any selected subset of nodes. Default depends on setting of symmetric_cluster.

Preferential Constraints:

Resource Objects can also be preferentially constrained to run on specified nodes by providing weightings for arbitrary logical conditions

The resource object is run on the node which has the highest weight (score)

Page 32: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Advanced Constraints

Nodes can have arbitrary attributes associated with them in name=value form

Attributes have types: int, string, version

Constraint expressions can use these attributes as well as node names, etc in largely arbitrary ways

Operators:

=, !=, <, >, <=, >=

defined(attrname), undefined(attrname),

colocated(resource id), not colocated(resource id)

Page 33: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Advanced Constraints (cont'd)

Each constraint is associated with particular resource, and is evaluated in the context of a particular node.

A given constraint has a boolean predicate associated with it according to the expressions before, and is associated with a weight, and condition.

If the predicate is true, then the condition is used to compute the weight associated with locating the given resource on the given node.

Conditions are given weights, positive or negative. Additionally there are special values for modeling must-have conditions

+INFINITY

-INFINITY

Page 34: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

rsc_location information

We prefer the webserver group to run on host node01

<rsc_location id=”run_Webserver” group=”webserver”> <rule id=”rule_webserver” score=100> <expression attribute=”#uname” operation=”eq” value=”node01”/> </rule></rsc_location>

Page 35: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

DRBD – RAID1 over the LAN

DRBD is a block-level replication technology

Every time a block is written on the master side, it is copied over the LAN and written on the slave side

Typically, a dedicated replication link is used

It is extremely cost-effective – common with xSeries

Worst-case around 10% throughput loss

Recent versions have very fast “full” resync

Page 36: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Page 37: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Security Considerations

Cluster: A computer whose backplane is the Internet

If this isn't scary, you don't understand...

You may think you have a secure cluster network

You're probably mistaken now

You will be in the future

Page 38: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Secure Networks are Difficult Because...

Security is not often well-understood by adminsSecurity is well-understood by “black hats”Network security is easy to breach accidentally

Users bypass it

Hardware installers don't fully understand it

Most security breaches come from “trusted” staffStaff turnover is often a big issue

Virus/Worm/P2P technologies will create new holes especially for Windows machines

Page 39: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Security Advice

Good HA software should be designed to assume insecure networks

Not all HA software assumes insecure networks

Good HA installation architects use dedicated (secure?) networks for intra-cluster HA communication

Crossover cables are reasonably secure – all else is suspect ;-)

Page 40: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

References

http://linux-ha.org/

http://linux-ha.org/download/

http://linux-ha.org/SuccessStories

http://linux-ha.org/Certifications

http://linux-ha.org/NewHeartbeatDesign

www.linux-mag.com/2003-11/availability_01.html

Page 41: -- Linux-HA Release 2 LWCE – SF – August, 2005 Linux-HA Release 2 - World-Class Open Source HA Software Alan Robertson Project Leader – Linux-HA project

-- Linux-HA Release 2 LWCE – SF – August, 2005

Legal Statements

IBM is a trademark of International Business Machines Corporation.

Linux is a registered trademark of Linus Torvalds.

Other company, product, and service names may be trademarks or service marks of others.

This work represents the views of the author and does not necessarily reflect the views of the IBM Corporation.