shelp: automatic self-healing for multiple application instances in a virtual machine environment...
TRANSCRIPT
SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment
Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang Hu
Bing Bing Zhou
Cluster and Grid Computing LabServices Computing Technology and System LabHuazhong University of Science and Technology
Centre for Distributed and High Performance Computing Services
School of Information TechnologiesUniversity of Sydney
Introduction
Many applications need high availability Server downtime is very costly (1hr =
$84,000~$108,000) But there are still numerous security vulnerabilities
Fix all bugs in testing is impossible Virtualization technology brings new challenges
there are more application instances in a single-machine How to guarantee high availability?
Current Approaches & limitations Rx
Change execution environment STEM
Emulate function and potentially others within a larger scope to return error values Failure-oblivious computing
Manufacture values for “out of the bounds read” Discard “out of the bounds write”
Micro-reboot Software components are fail-stop and individually recoverable
Limitations Deterministic bugs are still there Require program redesign A narrow suitability for only a small number of applications or memory bugs ……
ASSURE better address these problems [ASPLOS’09] SHelp can be considered as an extension of ASSURE to a virtualized computing
environment
ASSURE Overview Bypass the “faulty” functions
Rescue points locations in the existing application code used to handle programmer-
anticipated failures Error virtualization
force a heuristic-based error return in a function Quick recovery for future faults
Take a checkpoint once the appropriate rescue point is called
ASPLOS’09
int bad(char* buf){ char rbuf[10]; int i = 0; if(buf == NULL)
return -1; while(i < strlen(buf)) {
rbuf[i++] = *buf++; } return 0;}
input foo() bar() bad()
input foo() bar()
other()
Walk stack
Create rescue-graph
Execution Graph
Rescue Graph
ASSURE Limitations
A potential problem when the appropriate rescue point is in the
main procedure of an application
ASPLOS’09--main.c--052 int main() ...167 if (!fork()) { /* this is the child process */168 while(1)169 {
...185 if(serveconnection(new_fd)==-1) break;
...
--protocol.c--038 int serveconnection(int sockfd) ...041 char tempdata[8192], *ptr, *ptr2, *host_ptr1, *host_ptr2;043 char filename[255]; ...054 while(!strstr(tempdata, "\r\n\r\n") && !strstr(tempdata, "\n\n"))055 {056 if((numbytes=recv(sockfd, tempdata+numbytes, 4096-numbytes, 0))==-1)057 return -1;058 }059 for(loop=0; loop<4096 && tempdata[loop]!='\n' && tempdata[loop]!='\r'; loop++)060 tempstring[loop] = tempdata[loop]; ...063 ptr = strtok(tempstring, " "); ...098 Log("Connection from %s, request = \"GET %s\"", inet_ntoa(sa.sin_addr), ptr); ...
144 strcat(filename, ptr); ...
--util.c--212 void Log(char *format, ...) ...217 char temp[200], temp2[200], logfilename[255]; ...222 va_start(ap, format); // format it all into temp223 vsprintf(temp, format, ap); ...
1) Define
2) Assignment
3) Use
4) Create
5) Copy
Memory Region:Name: tempSize: 200 Byte
Buffer Overflow A
2. Create
4. Copy
Memory Region:Name: filenameSize: 255 Byte
Buffer Overflow B
3. Assignment
1. DefineCandidate rescue point B
Candidate rescue point A
Rescue point B can survive faults
Two cases High overhead for
frequently checkpointing
No rescue point is appropriate
SHelp Main Idea “Weighted” rescue point
assign weight values to rescue points When an appropriate rescue point is chosen, its associated weight
value is incremented. first select the rescue point with the largest weight value to test once
detecting a fault Error handling information sharing in VMs
A two-level storage hierarchy for rescue point management a global rescue point database in Dom0 a rescue point cache in each DomU
Weight values are updating between Dom0 and DomUs for error handling information sharing
The accumulative effect of added weight values in Dom0 provides a useful guideline for diagnosis of serious bugs
SHelp Architecture
Sensors for detecting software faults Recovery and Test component for choosing
the appropriate rescue point
Hardware
VMM (Xen)
Management
ReportRescue Point
Database ...
Dom0
DomU
Programmers
Control Unit
Rescue Point Cache
Checkpoint & Rollback
Recovery & Test
Application 1 ... Application n
Sensors
Control Unit
Rescue Point Cache
Checkpoint & Rollback
Recovery & Test
Application 1 ... Application n
Sensors
DomU
SHelp Procedure
Determine candidate rescue points Prioritize candidate rescue points and test one by one
first test the largest weight value of rescue point Increment the corresponding weight values Quick recovery for the same stack smashing bug
Fault detected
Rescue Point Cache
Update
checkpoint
Test
Bug Report
Survival
Rescue Point DatabaseMatched
Candidate Rescue Points
Program execution
Log
Rollback to previous checkpoints
Inputs
Select and Instrument Appropriate
Rescue Point
Update Weight Value
Bug-RescueList Send
Programmers
Dom0①
②
③
④
⑤
⑥
Report Module
Implementation Details Updating the Rescue Point Cache
At the application level -> LRU At the trace level of applications -> LFUM
Consider globally maximum weight value and local hit rate for trace i
Updating Weight Values of Rescue Points Real-time updating for RP database Periodical updating for RP cache
Bug-Rescue List The stack is corrupted in stack smashing bug Get the trace need to replay program -> high overhead Record the appropriate rescue point related to the fault Choose it to probabilistically survive faults
)()( max ihwkiTraceFlag
Experimental Setup Implementation
Linux 2.6.18.8 kernel with BLCR and TCPCP checkpoint support Xen 3.2.0 and Dyninst 6.0
Platform Intel Xeon E6550, 4MB L2 cache, 1GB memory 100Mbps Ethernet connection
Applications Application Version Bug Depth
Apache
2.0.49 Off-by-one 22.0.50 Heap overflow 2
2.0.59 NULL dereference 3
Light-HTTPd0.1
Stack smashing 2
Light-HTTPd-dbz Divide-by-zero 2
ATP-HTTPd 0.4b Stack smashing 1
Null-HTTPd0.5.0
Heap overflow 1
Null-HTTPd-df Double free 3
Comparison between ASSURE and SHelp
Web server application Light-HTTPd Select the function serveconnection as the
appropriate rescue point Throughput is only about 60KB/s in ASSURE
0 5 10 15 200
2
40
2
40 5 10 15 20
Elapsed Time (sec)
SHelp
Thr
ough
put
(MB
/s)
ASSURE
SHelp Recovery Performance
First-1: new faults occur First-2: same faults occur again in local VM or
in other VMs
04
81216
2024
Fir
st-1
Fir
st-1
First-2
First-2First-2
First-1
First-1
First-1First-1
First-1F
irst
-2
Fir
st-2
Fir
st-2
Fir
st-2
Fir
st-2F
irst
-1
Tim
e (
s)
Web Server Application
Test Instrument Analysis
Benefits of the Bug-Rescue List
Subsequent: with Bug-Rescue List
0
4
8
12
16
20
24
ATP-HTTPdLight-HTTPd
SubsequentSubsequent
First-2First-1First-2
Tim
e (
s)
Web Server Application
Test Instrument Analysis
First-1
Checkpoint/Rollback Overhead Analysis
Lightweight checkpoint and roll-back Modified BLCR with TCPCP tool support
0102030405060
NullHTTPd
ATPHTTPd
LightHTTPd
Apache2.0.59
Apache2.0.50
Apache2.0.49
Tim
e (
ms)
Web Server Application
Checkpoint Rollback
Conclusions and Future Work
“Weighted” rescue points and two-level storage hierarchy for rescue point management make the system perform more effectively and efficiently.
Future Work Integrate the COW mechanism in BLCR Evaluate the effectiveness of our system for
more complex server and client applications