08- installation and maintenance of health it systems- unit 9- creating fault-tolerant systems,...

Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and DecommissioningLecture bThis material Comp8_Unit9b was developed by Duke University, funded by the Department of Health and Human Services, Office of the National Coordinator for Health Information Technology under Award Number IU24OC000024.

Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b

Creating Fault-Tolerant Systems, Backups, and DecommissioningLearning ObjectivesDefine availability, reliability, redundancy, and fault tolerance (Lecture a)Explain areas and outline rules for implementing fault tolerant systems (Lecture a)Perform risk assessment (Lecture a)Follow best practice guidelines for common implementations (Lecture b)Develop strategies for backup and restore of operating systems, applications, configuration settings, and databases (Lecture c)Decommission systems and data (Lecture c)*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault Tolerance: Computer HardwareRedundant and fault tolerant hardware costs moreComputers are workstations and serversWorkstations need little fault tolerance No critical data used interchangeablyServers need redundancy and fault toleranceHot-swap hard drivesHot-plug expansion cardsError checking and correcting, hot-add memoryRedundant and hot-swap fansRedundant power supply (PSU)Multiple servers Clustered systems are complex but highly availableMirrored servers less complex but highly availableHot spare simplest configuration but requires effort after failure

(Tulloch, 2005)*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault Tolerance: Data StorageStore data redundantly, so that single failures cause no lossRAID (Redundant Array of Independent Disks) for hard drivesRAID 0 provides no fault tolerance! Speed increase onlyRAID1 (disk mirroring): Fast reading, simple, easyRAID 5 (disk striping with distributed parity): increased speed & reliability with relatively few disks, complex Critical systems should include a hot spareRAID 6 (disk striping with double distributed parity): increased speed & additional reliability with relatively few disks, similar to RAID 5 in complexity

(Tulloch, 2005; RAID, 2012) *Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Example RAID Arrays(en:User:Cburnett, 2006)*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault Tolerance: Data Storage (contd)Store data redundantly, so that single failures cause no lossDistributed file system running over a network Distributed File System (DFS) for WindowsUsed with File Replication Service (FRS) to duplicate dataOthers will depend on platform, can include ZFS (Solaris), AFS (general UNIX), GFS (RedHat)SAN (Storage Area Network), NAS (Network Attached Storage)EMC2 and NetApp are large vendorsCloud or Hosted storage uses the Internet Let someone else worry about drives!DropboxiCloudAmazon S3Windows Azure Storage

(Tulloch, 2005) *Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault Tolerance: VirtualizationTypes of VirtualizationStorage virtualization discussed previouslyServer virtualization - virtual machines (VMs) Virtual Machine = Software emulation of physical environmentServer running VMs called a VM host multiple VMs run on single host AdvantagesEasy upgrading and scalabilitySimplified hardware management and fault toleranceEasy to integrate existing systems and infrastructureDisadvantage is a slight performance hit and more systems down with failureSome services, e.g. Databases not perfectly suited for virtualization Best practices for each service are available from the service vendor.Infrastructure virtualizationEverything accessed through remote interfacesContracted level of service is important to specifySimple devices + Internet access = Infrastructure as a Service (IaaS)

(Sanford, 2010)Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b*


Creating Fault Tolerance: Off-Site Hosting and AccessHosted servers are similar to hosted storage, but can maintain an entire environment.Web server hosting is early exampleVirtual Servers in the cloudSystem hardware extremely reliable and fault tolerant, backed by service guarantees.Ensure availability for servers with: Redundancy & fault tolerance in network infrastructure:Switches with Spanning TreeRouters with secondary or backup links Multiple Internet connections: Multihoming Uninterruptible Power Supply (UPS) & backup power in key areas, e.g. server rooms, wiring closets, critical PCs*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault Tolerance: Software as a Service (SaaS)SaaS, also known as Application Service Provider (ASP) or Cloud providerBenefits:No local hardware admin costs (except network access)Service contract guarantees very high fault toleranceAccessible from PCs, tablets potentially anything with a web browserDrawbacksCost grows as usage grows not a fixed costNetwork access can fail whose fault is it?Internet Access ProviderSaaS Host ProviderSaaS Company or software

*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault-Tolerant Systems, Backups, and DecommissioningSummary Lecture bBest Practices for providing fault tolerant computer hardware, data storage, virtualization, remote hosting, and network access*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture b


Creating Fault-Tolerant Systems, Backups, and DecommissioningReferences Lecture bReferencesRAID [cited 2012 January 31]. Retrieved from: http://en.wikipedia.org/wiki/RAID Sanford, R. (April 2010) Electronic Health Records Need a Fail-Proof Foundation to Deliver on Quality, Economy Promises. Health News Digest. Available from: http://www.healthnewsdigest.com/news/Guest_Columnist_710/Electronic_Health_Records_Need_a_Fail-Proof_Foundation_to_Deliver_on_Quality_Economy_Promises_2_printer.shtml Tulloch, M. (April 2005) Implementing Fault Tolerance on Windows Networks. Available from: http://www.windowsnetworking.com/articles_tutorials/Implementing-Fault-Tolerance-Windows-Networks.html

Acknowledgement: The following reference generally informed the unitShackhow, T. et al. (June 2008). EHR Meltdown: How to Protect Your Patient Data. Fam Pract Manag, 15(6), A3-A8. Available from: http://www.aafp.org/fpm/2008/0600/pa3.html

*Health IT Workforce Curriculum Version 3.0/Spring 2012 Installation and Maintenance of Health IT Systems Creating Fault-Tolerant Systems, Backups, and Decommissioning Lecture bImagesSlide 5: RAID 0, RAID 1, RAID 5, RAID 6 [en:User:Cburnett]. c2006 [updated 2000 Jan 28; cited 2006 Feb 15]. Available from: http://commons.wikimedia.org/wiki/Redundant_array_of_independent_disks


Welcome to Installation and Maintenance of Health IT Systems, Creating Fault -Tolerant Systems, Backups, and Decommissioning, This is lecture b.

This component, Installation and Maintenance of Health IT Systems, covers fundamentals of selection, installation, and maintenance of typical Electronic Health Records (EHR) systems.

This unit, Creating Fault-Tolerant Systems, Backups, and Decommissioning, will discuss ensuring availability and resiliency through fault tolerance, data reliability through backup, and secure decommissioning of EHR systems.*The objectives for this unit, Creating Fault-Tolerant Systems, Backups, and Decommissioning are to:

Define availability, reliability, redundancy, and fault tolerance Explain areas and outline rules for implementing fault tolerant systems Perform risk assessment Follow best practice guidelines for common implementations Develop strategies for backup and restore of operating systems, applications, configuration settings, and databases andDecommission systems and data

As healthcare organizations adopt new technology to improve their efficiency, their dependence on that technology increases exponentially. However, what happens to all of these critical applications if a failure were to occur? What about the integrity of the caregivers data in the event of a disaster?

In lecture b, we will present guidelines for fault tolerance in commonly used systems.

*Lets begin discussing best practices for fault tolerance regarding several common technologies, beginning with computer hardware.

Be aware that computers that implement fault tolerant technologies cost more.

Computer hardware can be divided into workstations (client computers, used directly) and servers (not directly accessible, used remotely).

Workstation fault tolerance is less critical for the simple reason that they should not store critical data. That is, any critical information they contain or use should immediately be delivered over the network to the server. This means that a failed workstation loses no data. Access is still available by just using another workstation. If you have several workstations that are constantly used, having a spare that is setup and available is wise. Additionally, each workstation can be less expensive because they need not use expensive fault tolerant subsystems.

Server fault tolerance is much more important. Because they store critical data, the availability of that data is paramount. In this case, its wise to invest in fault tolerant technologies and subsystems. Technologies that help the server remain powered up and accessible include:Hot-swap hard drives, allowing replacement of failed drives or addition of new ones while the server is running and available;Hot-plug expansion card slots, allowing replacement of failed expansion cards (like network or storage access cards) or the addition of new ones while the server is running and available;Error checking and correcting memory, which automatically senses small errors that can occur in memory and eliminate them;Hot-add memory, allowing for expansion of memory capacity without rebooting;Redundant and hot-swap fans to maintain server cooling as recommended to reduce component failure rates;Redundant power supplies (PSU), which for maximum utility should be plugged in to different power sources. Connecting two power supplies to the same wall power outlet protects against failure of one PSU, but not against a tripped circuit breaker for that line. Connecting to different circuits (or for large organizations to different power sources) is a best practice.

Finally, similarly to the workstations, multiple servers could be available, allowing workstations to connect to the one most available. Multiple live servers may be configured in a cluster (complex environment but high performance and availability), as mirrored (less complex but highly available), or as an online backup as a hot spare (simplest multiserver configuration, but requiring some work in case of primary server failure).

Tulloch sums hardware redundancy by saying Although expensive, these strategies can provide a simple solution to ensuring near 100% uptime in business-critical environments and the expense of their integration reflects only a fraction of the cost associated with lost production or potential errors when one of these critical components fails.

*The primary method for data storage fault tolerance is to have the same data on multiple hard drives, in case one fails. In its simplest form, its as simple as copying a file into two locations, for instance to a hard drive and to a flash drive. However, this simple duplicate copy method scales poorly, so best practices for data storage include technologies that automatically store data on multiple physical devices.

The most common of these technologies is RAID, or Redundant Array of Inexpensive Disks. This is a set of multiple drives that automatically distribute data so that even with the loss of any one drive, all data is still available. All RAID works by breaking data into small blocks that are stored on separate drives. Though there are several different types of RAID, the following are the most common.

RAID 0 is also known as data striping, provides no fault tolerance because there is no duplication of data blocks. Instead, blocks are distributed for speed or capacity increase only data can be read or written to both drives simultaneously.

RAID 1 is known as mirroring, and provides fault tolerance by duplicating each data block on two or more drives. If one drive fails, all of the data blocks are still available on the other drive. There is a write performance penalty because all data must be written twice, but a doubling in read speed because data can be read from either drive. Its useful for simple systems and relatively easy to setup.

RAID 5 is data striping with distributed parity, and requires a minimum of 3 drives, with standard configurations using 4 to 6 drives. Here, a series of data blocks are spread across all drives in a stripe, similar to RAID 0. However, for the stripe an additional parity block is created by the system and stored in the stripe (distributing the parity blocks across the drives for different stripes). In the event of a drive failure, this parity block, along with all the remaining blocks in the stripe, is used to mathematically calculate the value of the block on the missing drive. In this way, even though each block is only stored once, all data is available even after failure of a single drive. Read speed are much increased, almost linearly with the number of drives. The drawback to this system is twofold: first, there is a small performance penalty in writing blocks (as the parity block is calculated and saved along with the new block), and second when a drive fails, each read to a missing block requires reading the entire stripe to re-create the missing piece. In case of a drive failure, the array can immediately begin rebuilding the missing data onto the spare drive. This rebuilding can mean additional performance degradation until the data is rebuilt, but is desirable. If a second drive were to fail in a RAID 5, or a drive failed during this rebuilding process, ALL data would be lost (barring expensive and time-consuming third-party data reconstruction). For this reason, RAID 5 is often used with a hot spare, which is a drive with no data, but on line and ready to take the immediate place of a failed drive.

RAID 6 is very similar to RAID 5, differing by using one additional drive per array and two parity blocks per stripe instead of one. This means that a RAID 6 array can tolerate two drive failures and still retain data. The function is similar to a RAID 5 with a hot spare, but without the immediate performance degradation caused by rebuilding the failed data into the spare drive.*The images show the blocks of data and their placement in several simple examples of RAID 0, RAID 1, RAID 5, and RAID 6.

Each cylinder represents a drive, with each block of data labeled, and different files are shown as different colors. For instance, only one file is illustrated in the RAID 0 and RAID 1 images, spread over two drives. RAID 0 shows one file 8 blocks in size, RAID 1 shows one file four blocks in size, but with copies on disk 0 and disk 1.

RAID 5 and RAID 6 images show files 3 blocks in size, distributed across multiple drives, but with one parity block for RAID 5 and two parity blocks for RAID 6.

In all cases except RAID 0, you can see how the removal of any one disk (or two disks for RAID 6) still leaves all the blocks necessary to reconstruct all data in the array.*Aside from RAID, data can have increased fault tolerance through several other means. Primary among these is to distribute the information across multiple servers. Distributed file systems use networks and multiple servers to implement fault tolerance and increased reliability.

Distributed file system is a Microsoft windows server technology that allows a single file system to be visible over multiple servers, with file replication service providing the mirroring of the data in the background. Various other platforms have similar technologies, including ZFS from Sun and GFS from RedHat.

Storage area network technology refers to a system where data is accessed over a separate high-speed network that is there to provide fast and reliable access to data files only, not to support general network transmissions. SANs are expensive, but provide the greatest availability for data. Network attached storage is also designed to store data files over a network, but these use the existing network access. NAS devices are often just simplified servers that do nothing but data storage.

Cloud storage refers to data stored through transmission over the Internet. Providers like Dropbox or iCloud are workstation oriented, automatically replicating data from the workstation to other workstations and to Internet-accessible servers. Alternatively, Amazon S3 or Windows Azure Storage provide for programmatic access to data over the Internet. This allows applications which are designed for these systems the ability to access arbitrary amounts of storage while not being concerned with details such as disk capacity or redundancy level. Availability concerns move away from drive failure toward network access and throughput. Cloud storage is more expensive, with higher costs for stored data but an additional cost for the amount of data transferred. For instance, if you have 10GB of storage, you may pay a monthly fee for that storage, but also pay $1 per GB of data transferred just uploading the data the first time might cost $10!

*Virtualization is a technology that provides fault tolerance through divorcing a system from the hardware it runs on. This allows, for instance, a server to be migrated to a separate physical machine, allowing the first machine to be taken offline for maintenance or powered down. Depending on the type of virtualization offered, this can even be done while the server is still operating (live migration!)

Distributed and cloud storage systems offer one type of virtualization virtualized storage. To the system, the storage will be as available as local storage, but without the dependence on local drives. Another server can connect to this same virtualized storage and see the same information this allows for the server redundancy described in the earlier slide.

Complete server systems can be virtualized, using a container called a virtual machine. A virtual machine is software emulation of a complete physical environment: drives, memory, network interfaces whichever devices are supported by the virtual machine environment. All of the necessary information about the virtual machine is stored in files. These files can run on one virtual machine server (called a virtual machine host) then be moved to a separate virtual machine host and not be aware of the change. Indeed, this technology allows for all the data and services on an existing physical server to be converted to a virtual machine. Then, if the new virtual machine host were to have a failure, the virtual machine data files could run from a new host, potentially one with faster performance! Additionally, a single physical host with sufficient resources could run multiple concurrent virtual machines.

Though this might sound like an all your eggs in one basket approach, it simplifies fault tolerance you are able to focus efforts on securing availability of a few physical computers instead of many.

The advantages include the ability to upgrade and scale easily, simplified hardware management and fault tolerance, and ease of integration into existing systems and infrastructure. The disadvantage is that if the virtual environment experiences a complete failure, a large number of systems are affected. Additionally, there is a small performance penalty for virtualized servers with a common range from .5% to 10% over running on bare metal. Intensive memory or data access services may not be good candidates for virtualization because of this. The services vendor is able to provide recommendations for best practices.

Infrastructure virtualization is basically outsourcing of the infrastructure of your system. This requires tight integration, and good confidence in your provider and the network connecting you. Infrastructure as a service is letting all of the services be provided by a contractor great care should be taken if you are doing this with critical business systems.*Because virtualization disconnects systems from specific hardware, virtualized systems are no longer as strongly tied to location. That is, if your system can be moved from server to server at your office, it could potentially be moved from a server at your office to a server at a hosting provider. As a previous slide discussed hosted storage, hosted systems are possible. An entire computing environment can be housed and running on server computers that are maintained by a service provider that specializes in maintaining hosting platforms.

The most prevalent example of this in the past has been web server hosting many organizations exist (and have since the mid-1990s) that allow a small organization to create a website. The website is just a collection of data files and perhaps database lookups that run on hardware provided by the web server host. For larger or more adept customers, these web hosting providers would provide a server that could be more than just a website any application the customer created could be run on this reliable server that is physically maintained by the hosting provider.

Virtualization expanded the customer base for these services. Currently, hosting providers will offer standardized virtual server platforms. Customers have a server that they can access and configure as they please, and have all the physical maintenance and reliability issues handled (with service guarantees) by the provider. Rackspace, Amazon Computer Cloud, and Windows Azure are examples of large, reliable providers that offer a platform on which to build what you need.

However, these providers only handle one part of the availability problem they ensure the host is massively reliable. Availability issues can still arise in several other areas.

One is with network access problems. These network problems can be local: what if a network switch in the office closet fails? They can also be remote: what happens if the internet service provider experiences problems? Best practices here include building a robust network architecture with redundant switches using spanning-tree protocol, and routers using secondary network links. If Internet access is required to connect with the EHR, then consider multihoming having redundant Internet connections by connecting to two or more different internet service providers.

Another availability issue is with electrical power. Best practices include connecting an Uninterruptible Power Supply, or UPS to critical servers, workstations, and network access equipment. This is basically a battery that will allow the device to continue running for some time, depending on the size of the battery. Be sure that the UPS is monitored. This is a small program running on the server or workstation that communicates with the UPS and knows when power is lost, and how long the UPS will be able to keep providing power. This allows the server or workstation to have an orderly shutdown. The alternative to an orderly shutdown (a crash) can potentially cause data loss. So a UPS without monitoring is almost useless: a server that crashed 10 minutes after the power cuts out is no better than one that crashes instantly. For systems that need to run longer than the 10-60 minutes that UPSs will commonly provide, a power generator is added. Best practices for using UPSs and generators focus on the load that they must maintain. Too many systems on backup power will drain power faster, and could even cause them to fail. Be sure to consult with the vendor regarding how much power each of your critical systems requires. For large medical facilities, sufficient backup power generation capabilities are required for accreditation.*Another type of off-site hosting is via providers of Software as a Service. These providers may also be known as Application Service Providers, or ASPs. They provide complete systems, but instead of installing locally, they maintain all the data and equipment off-site, and require a connection to the provider for access to system data. Some providers may require some custom software installed on client workstations to access data, but many are designed to be used through standard web browsers.

These web-based EHR systems have a number of attractive features:No hardware maintenance of administration costs, because all servers are hosted remotelyContractually guaranteed service levels assure availabilityAccessible from multiple devices PCs, tablets, possibly anything with a web browser

There are two primary drawbacks to these systems. The first is cost an organization may pay by the physician, or by number of patient records, or by the amount of data transferred, or all of them. But as these systems scale, the monthly cost of the system rises as well. The second drawback is that the single point of failure (remote access) for the system is actually several points of failure, none of which are under the control of your organization. That is, to access a web-based EHR, your PC and network have to work properly, as well as the networks connection to the Internet. But the Internet service provider may experience problems, or the SaaS hosting provider, or the SaaS company itself. You will have contractual service agreements with each of these companies, but sorting out which of them is causing a problem may be challenging when a system is inaccessible.

*This concludes Lecture b of Creating Fault-Tolerant Systems, Backups, and Decommissioning.

Best practices were discussed for providing fault-tolerant computer hardware, data storage, virtualization, remote hosting, and network access.

*No audio.*

08- installation and maintenance of health it systems- unit 9- creating fault-tolerant systems,...

Documents