shelp: automatic self-healing for multiple application instances in a virtual machine environment...

SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment

Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu

Bing Bing Zhou

Cluster and Grid Computing LabServices Computing Technology and System LabHuazhong University of Science and Technology

Centre for Distributed and High Performance Computing Services

School of Information TechnologiesUniversity of Sydney

Introduction

Many applications need high availability Server downtime is very costly (1hr =

$84,000~$108,000) But there are still numerous security vulnerabilities

Fix all bugs in testing is impossible Virtualization technology brings new challenges

there are more application instances in a single-machine How to guarantee high availability?

Current Approaches & limitations Rx

Change execution environment STEM

Emulate function and potentially others within a larger scope to return error values Failure-oblivious computing

Manufacture values for “out of the bounds read” Discard “out of the bounds write”

Micro-reboot Software components are fail-stop and individually recoverable

Limitations Deterministic bugs are still there Require program redesign A narrow suitability for only a small number of applications or memory bugs ……

ASSURE better address these problems [ASPLOS’09] SHelp can be considered as an extension of ASSURE to a virtualized computing

environment

ASSURE Overview Bypass the “faulty” functions

Rescue points locations in the existing application code used to handle programmer-

anticipated failures Error virtualization

force a heuristic-based error return in a function Quick recovery for future faults

Take a checkpoint once the appropriate rescue point is called

ASPLOS’09

int bad(char* buf){ char rbuf[10]; int i = 0; if(buf == NULL)

return -1; while(i < strlen(buf)) {

rbuf[i++] = *buf++; } return 0;}

input foo() bar() bad()

input foo() bar()

other()

Walk stack

Create rescue-graph

Execution Graph

Rescue Graph

ASSURE Limitations

A potential problem when the appropriate rescue point is in the

main procedure of an application

ASPLOS’09--main.c--052 int main() ...167 if (!fork()) { /* this is the child process */168 while(1)169 {

...185 if(serveconnection(new_fd)==-1) break;

...

--protocol.c--038 int serveconnection(int sockfd) ...041 char tempdata[8192], *ptr, *ptr2, *host_ptr1, *host_ptr2;043 char filename[255]; ...054 while(!strstr(tempdata, "\r\n\r\n") && !strstr(tempdata, "\n\n"))055 {056 if((numbytes=recv(sockfd, tempdata+numbytes, 4096-numbytes, 0))==-1)057 return -1;058 }059 for(loop=0; loop<4096 && tempdata[loop]!='\n' && tempdata[loop]!='\r'; loop++)060 tempstring[loop] = tempdata[loop]; ...063 ptr = strtok(tempstring, " "); ...098 Log("Connection from %s, request = \"GET %s\"", inet_ntoa(sa.sin_addr), ptr); ...

144 strcat(filename, ptr); ...

--util.c--212 void Log(char *format, ...) ...217 char temp[200], temp2[200], logfilename[255]; ...222 va_start(ap, format); // format it all into temp223 vsprintf(temp, format, ap); ...

1) Define

2) Assignment

3) Use

4) Create

5) Copy

Memory Region:Name: tempSize: 200 Byte

Buffer Overflow A

2. Create

4. Copy

Memory Region:Name: filenameSize: 255 Byte

Buffer Overflow B

3. Assignment

1. DefineCandidate rescue point B

Candidate rescue point A

Rescue point B can survive faults

Two cases High overhead for

frequently checkpointing

No rescue point is appropriate

SHelp Main Idea “Weighted” rescue point

assign weight values to rescue points When an appropriate rescue point is chosen, its associated weight

value is incremented. first select the rescue point with the largest weight value to test once

detecting a fault Error handling information sharing in VMs

A two-level storage hierarchy for rescue point management a global rescue point database in Dom0 a rescue point cache in each DomU

Weight values are updating between Dom0 and DomUs for error handling information sharing

The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs

SHelp Architecture

Sensors for detecting software faults Recovery and Test component for choosing

the appropriate rescue point

Hardware

VMM (Xen)

Management

ReportRescue Point

Database ...

Dom0

DomU

Programmers

Control Unit

Rescue Point Cache

Checkpoint & Rollback

Recovery & Test

Application 1 ... Application n

Sensors

Control Unit

Rescue Point Cache

Checkpoint & Rollback

Recovery & Test

Application 1 ... Application n

Sensors

DomU

SHelp Procedure

Determine candidate rescue points Prioritize candidate rescue points and test one by one

first test the largest weight value of rescue point Increment the corresponding weight values Quick recovery for the same stack smashing bug

Fault detected

Rescue Point Cache

Update

checkpoint

Test

Bug Report

Survival

Rescue Point DatabaseMatched

Candidate Rescue Points

Program execution

Log

Rollback to previous checkpoints

Inputs

Select and Instrument Appropriate

Rescue Point

Update Weight Value

Bug-RescueList Send

Programmers

Dom0①

②

③

④

⑤

⑥

Report Module

Implementation Details Updating the Rescue Point Cache

At the application level -> LRU At the trace level of applications -> LFUM

Consider globally maximum weight value and local hit rate for trace i

Updating Weight Values of Rescue Points Real-time updating for RP database Periodical updating for RP cache

Bug-Rescue List The stack is corrupted in stack smashing bug Get the trace need to replay program -> high overhead Record the appropriate rescue point related to the fault Choose it to probabilistically survive faults

)()( max ihwkiTraceFlag

Experimental Setup Implementation

Linux 2.6.18.8 kernel with BLCR and TCPCP checkpoint support Xen 3.2.0 and Dyninst 6.0

Platform Intel Xeon E6550, 4MB L2 cache, 1GB memory 100Mbps Ethernet connection

Applications Application Version Bug Depth

Apache

2.0.49 Off-by-one 22.0.50 Heap overflow 2

2.0.59 NULL dereference 3

Light-HTTPd0.1

Stack smashing 2

Light-HTTPd-dbz Divide-by-zero 2

ATP-HTTPd 0.4b Stack smashing 1

Null-HTTPd0.5.0

Heap overflow 1

Null-HTTPd-df Double free 3

Comparison between ASSURE and SHelp

Web server application Light-HTTPd Select the function serveconnection as the

appropriate rescue point Throughput is only about 60KB/s in ASSURE

0 5 10 15 200

2

40

2

40 5 10 15 20

Elapsed Time (sec)

SHelp

Thr

ough

put

(MB

/s)

ASSURE

SHelp Recovery Performance

First-1: new faults occur First-2: same faults occur again in local VM or

in other VMs

04

81216

2024

Fir

st-1

Fir

st-1

First-2

First-2First-2

First-1

First-1

First-1First-1

First-1F

irst

-2

Fir

st-2

Fir

st-2

Fir

st-2

Fir

st-2F

irst

-1

Tim

e (

s)

Web Server Application

Test Instrument Analysis

Benefits of the Bug-Rescue List

Subsequent: with Bug-Rescue List

0

4

8

12

16

20

24

ATP-HTTPdLight-HTTPd

SubsequentSubsequent

First-2First-1First-2

Tim

e (

s)


Test Instrument Analysis

First-1

Checkpoint/Rollback Overhead Analysis

Lightweight checkpoint and roll-back Modified BLCR with TCPCP tool support

0102030405060

NullHTTPd

ATPHTTPd

LightHTTPd

Apache2.0.59

Apache2.0.50

Apache2.0.49

Tim

e (

ms)


Checkpoint Rollback

Conclusions and Future Work

“Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently.

Future Work Integrate the COW mechanism in BLCR Evaluate the effectiveness of our system for

more complex server and client applications

Thank you!

Questions?

shelp: automatic self-healing for multiple application instances in a virtual machine environment...

Documents

appropriate rescue point

rescue point management

rescue point cache

global rescue point

candidate rescue points

appropriate slide

domu weight values

associated weight value