bin xin, patrick eugster, xiangyu zhang dept. of computer science purdue university {xinb, peugster,...

15
Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, xyzhang}@cs.purdue.usc Lightweight Task Graph Inference for Distributed Applications Jinlin Yang Center for Software Excellence Microsoft Corp. [email protected] 2010 29th IEEE International Symposium on Reliable Distributed Systems

Upload: anna-oneal

Post on 18-Jan-2018

225 views

Category:

Documents


0 download

DESCRIPTION

Contributions Define abstraction for representing distributed executions – “Tasks” A lightweight approach to generate “Task Graphs” from the application event logs. A declarative formulation of the rules to generate Task Graphs using Prolog. Demonstrate use of Task Graph to help understand the distributed execution including anomaly detection.

TRANSCRIPT

Page 1: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Bin Xin, Patrick Eugster, Xiangyu ZhangDept. of Computer SciencePurdue University{xinb, peugster, xyzhang}@cs.purdue.usc

Lightweight Task Graph Inference for Distributed Applications

Jinlin YangCenter for Software ExcellenceMicrosoft [email protected]

2010 29th IEEE International Symposium on Reliable Distributed Systems

Page 2: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Introduction

• New Challenges to reliability as applications move to Cloud• Distinct corporate entities managing the

infrastructure and the owing the application deployed

• Application developer do not have access to lower level debugging information in case of failures/faults.• Depends on Application output or app level

custom Logs for diagnosis

• Goal: Describe the high-level structural view of a distributed program execution to facilitate easy “after the fact” diagnosis.

Page 3: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Contributions

• Define abstraction for representing distributed executions – “Tasks”

• A lightweight approach to generate “Task Graphs” from the application event logs.

• A declarative formulation of the rules to generate Task Graphs using Prolog.

• Demonstrate use of Task Graph to help understand the distributed execution including anomaly detection.

Page 4: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Relevance to SmartGrid and CiC

• Extensions• Fault Detection by real-time log processing (CEP?)

• The patterns for CEP can be defined by the application developer

• OR can be auto-generated using code augmentation and static code analysis.

• On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy)

• Shortcomings• Do not explicitly consider the “Data Repository”

• Considered only as one of the ‘tasks’. • Not sure how it handles Transactions

Page 5: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Definitions

Event: is the execution of an operation that sends (or receives) data/signal to a different thread/process (Smallest building blocks)Signaling Event: is the operation of SendingActing Event: is the operation of Receiving

Happens Before (a e b): partial ordering of events. A is the Sender and B is the receiver who acts on that signal.

Task: Autonomous computation within a thread between to “acting” events. [Astart, Aend)Task contains exactly one Acting EventZero or more Signaling Event

Task Graph: A DAG whose nodes are tasks and edges represent Happens Before relations

A Request: A pair of signaling and acting events, where the signaling event is originating from outside the System.

A Reply: A pair of signaling and acting events, where the Acting event is triggered outside the System.

E2E service Graph:

Page 6: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Example

Page 7: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

System Setup

• Uses HDFS as the example application on Cloud

• HDFS logs are not sufficient/standardized• Uses Instrumentation using a tool called

“AspectJ”• AspectJ lets the developer insert code based on

specific “rules” during compilation• Each event is logged as a 7-field tuple

• (EventID, ProcID, threadID, SourceLocation, Type, Tag, Value)

Page 8: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Constructing Task Graphs (Prolog formulation) - I

Events

A “Fact” to parse and store all events

An entry for hb is made only if the Rules on the right are true for events X & Y

Page 9: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Constructing Task Graphs (Prolog formulation) - II

Tasks

Page 10: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Issues & Solutions - I

False +ves caused by Common Sycn Objects

Notion of “Time” is required. But Global Clocks or Vector Clocks are expensive and complex.

Heuristic: Use the order of events in the event logs.

Problem:

Proposed Solution:

Page 11: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Issues & Solutions - II

False +ves caused by Communication

Multiple Writes on the same Socket.

Heuristic: Use “Packet Size” and Total Received so far to decide which write to associate to which reads.

Problem:

Proposed Solution:

Page 12: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Issues & Solutions - III

False -ves caused by Gaurded Waits

Multiple waiting threads are notified and the Lock Condition is updated before the current thread’s execution. Hence a Condition Check is required after waking up.

Manually update such cases and remove augmented code within the loop and Add a marker just after the loop.

Problem:

Proposed Solution:

Page 13: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Evaluation - I

Performance Impact

Runtime:22.2% increase in binary size38% increase in execution time

TaskGraph building using Prolog:

Page 14: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Evaluation – II (Demo)

To Help a new HDFS developer to analyze HDFS Execution

Page 15: Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference

Relevance to SmartGrid and CiC

• Extensions• Fault Detection by real-time log processing (CEP?)

• The patterns for CEP can be defined by the application developer

• OR can be auto-generated using code augmentation and static code analysis.

• On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy)

• Shortcomings• Do not explicitly consider the “Data Repository”

• Considered only as one of the ‘tasks’. • Not sure how it handles Transactions