slide 1 istore: a platform for scalable, available, maintainable storage-intensive applications...

ISTORE: A Platform for Scalable, Available,

Maintainable Storage-Intensive Applications

Aaron Brown, David Oppenheimer, Jim Beck, Rich Martin, Randi Thomas, David Patterson,

and Kathy Yelick

Computer Science DivisionUniversity of California, Berkeley

http://iram.cs.berkeley.edu/istore/

ISTORE Philosophy: SAM• The ISTORE project is researching techniques for

bringing scalability, availability, and maintainability (SAM) to large server systems

• ISTORE vision: a self-testing HW/SW platform that automatically reacts to situations requiring an administrative response– brings self-maintenance to applications and storage

• ISTORE target: high-end servers for data-intensive infrastructure services– single-purpose systems managing large amounts of data for

large numbers of active network users– e.g. TB of data, 10,000s requests/sec, millions of users

Motivation: Service Demands

• Emergence of a true information infrastructure– today: e-commerce, online database services,

online backup, search engines, and web servers

– tomorrow: more of above (with ever-growing datasets), plus thin-client/PDA infrastructure support

– these services have different needs than traditionally fault-tolerant services (ATMs, telephone switch, ...)

» rapid software evolution

» unpredictable, wildly fluctuating demand and user base

» often must incorporate low-cost, off-the-shelf HW and SW components

Service Demands (2)

• Infrastructure users expect “always-on”service and constant quality of service– infrastructure must provide scalable fault-

toleranceand performance-tolerance

» to a rapidly growing and evolving application base

– failures and slowdowns have major business impact

» e.g., recent EBay, E*Trade, Schwab outages

The Need for 24x7 Availability • Today’s widely deployed systems can’t

provide 24x7 fault- and performance-tolerance– they rely on manual administration

» static data and application partitioning» human detection of and response to most anomalous

behaviors and changes in system environment

– human administrators are too expensive, too slow, too prone to mistakes

» Jim Gray reports 42% of Tandem failures due to administrator error (in 1985)

• Tomorrow’s ever-growing infrastructure systems need to be self-maintaining– self-maintaining systems anticipate problems and

handle them as they arise, automatically

Self-Maintaining Systems• Self-maintaining systems require:

– a robust platform that provides online self-testing of its hardware and software

– easy incremental scalability when existing resources stop providing desired quality of service

– rapid detection of anomalous behavior and changes in system environment

» failures, load spikes, changing access patterns, ...

– fast and flexible reaction to detected conditions– flexible specification of conditions that trigger

adaptation

• Systems deployed on the ISTORE platform will be self-maintaining

Target Application Model• Scalable applications for data storage

and access– e.g., bottom (data) tier of three-tier systems

• Desired properties:– ability to manage replicated/distributed state

» including distribution of workload across replicas

– ability to create and destroy replicas on the fly– persistence model that can tolerate node failure

without loss of data» logging of writes, soft-state, etc.

– ability to migrate service between nodes» e.g., checkpoint and restore, or kill and restart

– built-in application self-testing

Target Application Model (2)• What existing application architectures

come close to fitting this model?– parallel shared-nothing DBMSs

» IBM DB2, Teradata, Tandem SQL/MX

– distributed server applications» Lotus Notes/Domino» traditional distributed filesystems/fileservers

– cluster-aware applications (with small mods?)» LARD cluster web server (Rice)» Microsoft Cluster Server Phase 2 (?)

• What doesn’t fit?– simple 2-node “hot standby” failover clusters

» Microsoft Cluster Server Phase 1

The ISTORE Approach• Divides self-maintenance into two

components:1) reactive self-maintenance: dynamic reaction to

exceptional system events» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems

2) proactive self-maintenance: continuous online self- testing and self-analysis

» automatic characterization of system components» in situ fault injection, self-testing, and scrubbing to

detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

Reactive Self-Maintenance• ISTORE defines a layered system model

for monitoring and reaction:

Self-monitoringhardware

SW monitoring

Problem detection

Coordinationof reaction

Reaction mechanisms

Provided by ISTORE Runtime System

Provided byApplication

• ISTORE API defines interface between runtime system and app. reaction mechanisms

Polic

ies

ISTORE API

• Policies define system’s monitoring, detection, and reaction behavior

• Hardware architecture: plug-and-play intelligent devices with integrated self-monitoring, diagnostics, and fault injection hardware– intelligence used to collect and filter monitoring data– diagnostics and fault injection enhance robustness– networked to create a scalable shared-nothing cluster


Disk IntelligentDisk “Brick”

CPU, memory, diagnosticprocessor, redundant NICs

IntelligentChassis:scalable

redundantswitching,

power,env’t monitoring

x64

ISTORE-II Hardware Vision• System-on-a-chip enables computer,

memory, redundant network interfaces without significantly increasing size of disk

• Target for + 5-7 years:• 1999 IBM MicroDrive:– 1.7” x 1.4” x 0.2”

(43 mm x 36 mm x 5 mm)– 340 MB, 5400 RPM,

5 MB/s, 15 ms seek

• 2006 MicroDrive?– 9 GB, 50 MB/s

(1.6X/yr capacity, 1.4X/yr BW)

2006 ISTORE• ISTORE node

– Add 20% pad to MicroDrive size for packaging, connectors

– Then double thickness to add IRAM– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)

• Crossbar switches growing by Moore’s Law– 2x/1.5 yrs 4X transistors/3yrs– Crossbars grow by N2 2X switch/3yrs– 16 x 16 in 1999 64 x 64 in 2005

• ISTORE rack (19” x 33” x 84”)(480 mm x 840 mm x 2130

mm) – 1 tray (3” high) 16 x 32 512 ISTORE nodes– 20 trays+switches+UPS 10,240 ISTORE nodes(!)

• Each node includes extra diagnostic support– diagnostic processor: independent hardware running

monitoring and control software» monitors hardware and environmental state not

normally visible to system software» control

•reboot/power-cycle main CPU•inject simulated faults: power, bus transients, memory errors, network interface failure, ...

– separate “diagnostic network” connects the diagnostic processors of each brick

» provides independent network path to diagnostic CPU•works when brick CPU is powered off or has failed


• Software collects and filters monitoring data– hardware monitors device “health”,

environmental conditions, and indicators that software is working

» some information processed locally to provide fail-fast behavior when higher-level software deemed potentially untrustworthy

» most information passed on to software monitoring

– software monitoring layer also collects higher-level performance data, access patterns, app. heartbeats

SW monitoring

• The data is collected in a virtual “database”– desired monitoring data is selected and aggregated by

specifying “views” over the database» database schema + views hide differences in monitoring

implementation on heterogeneous HW and SW

• Running example– If ambient temperature of a shelf is rising significantly

faster than that of other shelves, » reduce power consumption on those nodes, then» if necessary, migrate non-redundant data replicas off

some nodes on that shelf and shut them down

– view: for each shelf, average temperature across all temperature sensors on that shelf

SW monitoring

• Conditions requiring administrative response are detected by observing values and/or patterns in the monitoring data– triggers specify these patterns and invoke appropriate

adaptation algorithms» input to a trigger is a view of the monitoring data» views and triggers can be specified separately to allow

•easy selection of desired reaction algorithm•easy redefinition of conditions that invoke a particular reaction

• Running example– trigger: change in temperature of one shelf > 0 and

more than twice the change in temperature of any other shelf, averaged over a one-minute period

Problem detection

• Adaptation algorithms coordinate application-level reaction mechanisms– adaptation algorithms define a sequence of operations

that address the anomaly detected by the associated trigger

– adaptation algorithms call application-implemented mechanisms via a standard API

» but are independent of application mechanism details

• Running example: coordination of reaction1) identify nodes with non-redundant data2) invoke application mechanism to migrate that data off

n of those nodes3) reduce power consumption by those n nodes4) install trigger to monitor temperature change and

shut down nodes if power reduction is ineffective


• ISTORE expects reaction mechanisms to be implemented by the application– these reaction mechanisms are application-specific

» e.g., moving data requires knowledge of data semantics, consistency policies, ...

– a research goal of ISTORE is to provide a standard API to these mechanisms

» initially, try to leverage and extend existing mechanisms to avoid wholesale rewriting of applications

•many data-intensive applications already support functionality similar to the needed mechanisms

» eventually, generalize and extend API to encompass mechanisms and needs of future applications

Reaction mechanisms

• Programmer or administrator specifies policies to control the system’s adaptive behavior– the policy compiler turns a high-level declarative

specification of desired behavior into the appropriate:» adaptation algorithms (that invoke application

mechanisms through the ISTORE API)» triggers (to invoke the adaptation algorithms when the

appropriate conditions are detected)» views (that enable monitoring needed by the triggers)

• Running example– policy: if ambient temperature of a shelf is rising

significantly faster than that of other shelves, reduce power and prepare to shut down nodes

Policies

Summary: Layered System Model

• Layered system model for monitoring and reaction provides reactive self-maintenance


SW monitoring

Problem detection


Reaction mechanisms

Provided by ISTORE Runtime System

Provided byApplication

• Self-maintenance in ISTORE also consists of proactive, continuous self-testing and analysis

Polic

ies

ISTORE API

The ISTORE Approach• Divides self-maintenance into two

components:1) reactive self-maintenance: dynamic reaction to

exceptional system events» self-diagnosing, self-monitoring hardware» software monitoring and problem detection» automatic reaction to detected problems

2) proactive self-maintenance: continuous online self- testing and self-analysis

» in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

» automatic characterization of system components

Continuous Online Self-Testing

• Self-maintaining systems should automatically carry out preventative maintenance– need aggressive in situ component testing via

» fault injection: triggering hardware and software error handling paths to verify their integrity/existence

» stress testing: pushing HW/SW components past normal operating parameters

» scrubbing: periodic restoration of potentially “decaying” hardware or software state

• ISTORE periodically isolates nodes from the system and performs extensive self-tests– nodes can be easily isolated due to ISTORE’s built-in

redundancy» even in a deployed, running system

Self-Testing: Hardware• Goals of hardware self-testing is to

detect flaky components and preserve data integrity

• Examples:– fault injection: power cycle disk to check for

stiction– stress testing: run disk controller at 100%

utilization to test behavior under load– scrubbing: read all disk sectors and rewrite any

that suffer soft errors; “fire” disk if too many errors

Self-Testing: Software• Software self-testing proactively identifies

weaknesses in software before they cause a visible failure– helps prevent failure due to bugs that only appear in

certain hardware/software configurations– helps identify bugs that occur when software is driven

into an untested state only reachable in a live system» e.g., long uptimes, heavy load, unexpected requests

• Examples– fault injection (includes HW- and SW-induced faults

that the SW is expected to handle): SCSI parity error, invalid return codes from operating system

– stress testing: heavy load, pathological requests

– scrubbing: restart/reboot long-running software

Online Self-Analysis• Self-maintaining systems require

knowledge of their components’ dynamic runtime behavior– current “plug-and-play” hardware approaches are

not sufficient» need more than just discovery of new devices’

functional capabilities and supported APIs

– also need dynamic component characterization

Characterizing HW/SW Behavior

• An ISTORE may contain black-box components– heterogeneous hardware devices– application-supplied reaction mechanisms whose

implementations are hidden

• To select and tune adaptation algorithms, the ISTORE system needs to understand the behavior of these components– in the context of a complex, live system– examples:

» characterize performance of disks in system, use that data to select destination disks for replica creation

» isolate two nodes, invoke replication from one to the other, monitor actions taken by application (e.g., how long it takes, how much data is moved)

Support for Application Self-tuning

• ISTORE’s characterization mechanisms can also help applications tune themselves– current systems require manual tuning to meet

scalability and performance goals» especially true for shared-nothing systems in which

computational and storage resources aren’t pooled

– possible research direction is to expose characterization information to application via an extension of the ISTORE API

– this would allow “aware” applications to automatically adapt their behavior based on system conditions

ISTORE API• The ISTORE API defines interfaces for

– adaptation algorithms to invoke application reaction mechanisms

» e.g., migrate data, replicate data, checkpoint, shutdown, ...

– applications to provide hints to the runtime system so it can optimize adaptation algorithms & data storage

» e.g., application tags data whose unavailability can be temporarily tolerated

– runtime system to invoke application self-testing and fault injection, and for application to report results

– runtime system to inform application about current state of system, hardware capabilities, ...

Summary• ISTORE focuses on Scalability, Availability,

and Maintainability for emerging data-intensive network applications

• ISTORE provides a platform for deploying self-maintaining systems that are up 24x7

• ISTORE will achieve self-maintenance via:– hardware platform with integrated diagnostic support– reactive self-maintenance: a layered, policy-driven

runtime system that provides a framework for monitoring and reaction

– proactive self-maintenance: support for continuous on-line self-testing and component characterization

– and a standard API for interfacing applications to the runtime system

slide 1 istore: a platform for scalable, available, maintainable storage-intensive applications...

Documents