vernier virtualized execution realizing network infrastructures enhancing reliability
Post on 09-Jan-2016
28 Views
Preview:
DESCRIPTION
TRANSCRIPT
VERNIER
Virtualized Execution Realizing Network Infrastructures Enhancing Reliability
VERNIER Project TeamDARPA Application Communities Kickoff Meeting
July 7, 2006
2Application Communities KickoffJuly 7, 2006
Outline
• Background• Project Overview
– Objectives– Project Scope– Research Challenges– Breakthrough Capabilities– Expected Results
• Team — Key Personnel and Roles• Technical Approach• Scenario Exemplars• Project Plan — Schedule and Milestones• Experimentation and Evaluation• Technology Transition Plan
3Application Communities KickoffJuly 7, 2006
Background
• Commercial-off-the-shelf (COTS) software– Large organizations, including DoD, have become dependent on it– Yet, most COTS software is not dependable enough for critical
applications• Security breaches• Misconfiguration• Bugs
• Large, homogeneous COTS deployments, such as those in DoD, accentuate the risk, since many users– Experience the same failures caused by the same vulnerabilities,
configuration errors, and bugs– Suffer the same costly, adverse consequences
• Alternatives, such as government-funded development of high-assurance systems present significant barriers in– Cost– Functionality– Performance
4Application Communities KickoffJuly 7, 2006
VERNIER Project Objectives
• Develop new technologies to deliver the benefits of scaling techniques to large application communities– Provide enhanced survivability to the DoD computing infrastructure– Enhance the cost, functionality, and performance advantages of
COTS computing environments– Investigate and develop new technologies aimed at enabling
communities of systems running similar, widely available COTS software to perform more robustly in the face of attacks and software faults
• Deliver a demonstrated, functioning, transition-ready system that implements these new AC survivability technologies– Technical approach: Augmented virtual machine monitor– Commercial transition partner: VMware, Inc.
5Application Communities KickoffJuly 7, 2006
Project Scope
• Collaborative detection and diagnosis of failures• Collaborative response to failures• Advanced situational awareness capabilities
– Collective understanding of community state– Predictive capability: Early warning of potential future problems
• Key goal: turn the size and homogeneity of the user community into an advantage by converting scattered deployments of vulnerable COTS systems into cohesive, survivable application communities that detect, diagnose, and recover from their own failures
• What COTS?– Microsoft Windows, IE, Office suite, and the like
6Application Communities KickoffJuly 7, 2006
Research Challenges
• Extracting behavioral models from binary programs– Breakthrough novel techniques required– Quasi-static state analysis for black-box binaries
• Scaled information sharing– Networked application communities sharing knowledge about the
software they run
• Intelligent, comprehensive recovery
• Predictive situational awareness– Automatic, easy-to-understand gauges
7Application Communities KickoffJuly 7, 2006
Breakthrough Capabilities
8Application Communities KickoffJuly 7, 2006
Expected Results and Impact
• COTS Product (VMware) with breakthrough capabilities for application communities
• Scalability to 100K nodes running augmented VMware and custom Vernier software
• Automatic collaborative failure diagnosis and recovery
• Survivable robust system
• Community-aware solution
9Application Communities KickoffJuly 7, 2006
VERNIER Team
• SRI International, Menlo Park, CA– Patrick Lincoln, Principal Investigator– Steve Dawson, Project manager; integration– Linda Briesemeister, Knowledge sharing; collaborative response– Hassen Saidi, Learning-based diagnosis; code analysis; situation awareness
• Stanford University– John Mitchell, Stanford PI; code analysis; host-based detection and response– Dan Boneh, Knowledge sharing protocols– Mendel Rosenblum, VMM infrastructure; collaborative response; transition
liaison
• Palo Alto Research Center (PARC)– Jim Thornton, PARC PI; configuration monitoring and response; situation
awareness– Dirk Balfanz, Community response management– Glenn Durfee, Configuration monitoring and response; situation awareness
• Technology transition partner: VMWare, Inc.
10Application Communities KickoffJuly 7, 2006
John Boyd’s OODA Loop
Note how orientation shapes observation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.
Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.
From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
Note how orientation shapes observation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.
Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.
From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
FeedForward
Observations Decision(Hypothesis)
Action(Test)
CulturalTraditions
GeneticHeritage
NewInformation Previous
Experience
Analyses &Synthesis
FeedForward
FeedForward
ImplicitGuidance& Control
ImplicitGuidance& Control
UnfoldingInteraction
WithEnvironmentUnfolding
InteractionWith
Environment Feedback
Feedback
OutsideInformation
UnfoldingCircumstances
Observe Orient Decide Act
Defense and the National Interest, http://www.d-n-i.net, 2001
11Application Communities KickoffJuly 7, 2006
VERNIER Technical Approach
12Application Communities KickoffJuly 7, 2006
Notional Host System Architecture
An Abstraction-Based Diagnosis Capability for VERNIER
Hassen Saidi, SRI
14Application Communities KickoffJuly 7, 2006
Based on the general principle: “much of security amounts to making sure that an application does what it is suppose to do…….. and nothing else!”
• Build models of applications behaviors (what the application is suppose to do).• Monitor applications behavior and report malfunctions and unintended behaviors
(deviations from behavior).• Use the recorded execution traces as raw data to a set of abstraction-based diagnosis
engines (why did the deviation from good intended behavior occurred……to the extent to which we can do a good job answering such question).
• Share the state of alerts and diagnosis among the nodes of the community (sharing the bad news.…but also the good ones!).
• Aggregate the diagnosis outputs and the alerts into a situation awareness gauge.
Objectives
VM Kernel
DynamicVMM
App OS
App 1 App 2 . . .
VERNIER OS Base
Monitoring and ControlApp & OS Execution,
Configuration, Network Traffic
Quasi-StaticCode Analysis
Learning-Based Diagnosis
Collaborative Response
Situation AwarenessGauge & UI
Global situationawareness
Collaborative diagnosis,collaborative response
Local diagnosis,local response
Safeexecution
App binaries
Runtime data
COTS
ConfigurationAnalysis
INC
RE
AS
ED
AP
PLI
CA
TIO
N C
OM
MU
NIT
Y S
UR
VIV
AB
ILIT
YSecureKnowledgeSharingNetwork
NetworkTraffic Analysis
16Application Communities KickoffJuly 7, 2006
We combine a set of well known and well established techniques:
• building increasingly accurate models of applications behaviors:– Static analysis combined with predicate abstraction to build Dyke and CFG models used for static
analysis-based intrusion detection• Implement mechanisms for monitoring sequences of states and actions of an application
for the following purposes:– Check if a known bad sequence is executed (signature-based!)– Check for previously unknown variations of known bad sequences (correlation!)– Find root-causes for unexpected malfunction and malicious exploits (Diagnosis)
• Diagnosis is performed using techniques borrowed from – Delta-debugging (root-cause diagnosis)– Anomaly detection (correlation)
• The situation awareness gauge is implemented as a platform independent web interface
Approach
17Application Communities KickoffJuly 7, 2006
Monitoring-Based Diagnosis
• We combine these techniques into two phases:– Monitoring: Applications are monitored and sequences of executions along
with configurations are stored.– Diagnosis: Differences between good runs and bad runs are the first clues
used for diagnosis• Traces of executions are sequences of:
– System calls– Method calls– Changes in configurations– The more information is stored, the better chance that malfunctions and
malicious behaviors are properly diagnosed.
18Application Communities KickoffJuly 7, 2006
Quasi-static binary analysis and predicate abstraction-based intrusion detection
• Use static analysis for recovering the control flow graph the application.– CFG generated by compliers for source code.– Recover class hierarchy for object code of OO applications.
• Build a pushdown system which is a model that represents an over approximation of the sequences of methods and system calls of the application.
– Deal with context sensitivity to match exit calls to return locations.
• Use predicate abstraction and data flow analysis to refine the pushdown system and obtain a more accurate model.
– Improving the knowledge about arguments to monitored calls.
19Application Communities KickoffJuly 7, 2006
Better Models and Better Monitoring
We are not just interested in detection intrusions, but by also generating high-level explanations of why an application deviates from its intended behavior.
• CFG and Dyke models are all over-approximations of the applications behavior (potential attacks are only discovered when the application behavior deviates from the model).
• We will use the runs of the application to generate under-approximations of the applications behavior!
• Alternatively, ever model representing an over-approximation has a dual that represents an under-approximation (over and under-approximations don’t have to be the same type of models!).
• We will combine over and under approximation to reduce the risk of missing possible attacks.
• We will refine the over and under approximations to improve the application model.
20Application Communities KickoffJuly 7, 2006
Combining over and under approximations
Over approximation(constructed by static analysis)
Under approximation(constructed from runs)
Behavior within the under approximationIs safe
Behavior outside the over approximationIs unsafeBehavior in between
Is suspicious andIs source of diagnosis
21Application Communities KickoffJuly 7, 2006
What if we don’t have a model of the application?
• We can monitor the application as a blackbox and intercept system calls:– Learn a model of good behaviors– Learn a model of bad behaviors
• Anomalies are difference between good and bad behaviors
• Borrow from delta-debugging techniques to find root-causes of misbehaviors
22Application Communities KickoffJuly 7, 2006
• There are many differences between execution traces:– Could consider arbitrary lengths of different sub-sequences
– Difference of length k should be considered where k is defined depending on the application, the size of the collected data, and the sensitivity of the analysis
Analyzing Differences between runs
a b
b
c
b
b b
a
bc
c b c b d
d
23Application Communities KickoffJuly 7, 2006
Delta Differences k=2
b
b
c
b
b b
a
bc
c b c b d
da
a bb bb cc bb d
a bb bb cc bb d
good run bad run
Both sequences have the same set of 2-events sequences. This means that, k needs to be increased and that k=2 isA too abstract way of distinguishing the two sequences
24Application Communities KickoffJuly 7, 2006
Delta Differences k=3
b
b
c
b
b b
a
bc
c b c b d
da
a b bb b cb c bc b cc b d
a b cc b b b c bc b cb b d
good run bad run
Sequence that are in red are those who appear only in the failing sequence. Sequences in blue are sequences appearingonly in the successful sequence.
a b cc b b b b d
b c bc b c
a b bb b cc b d
25Application Communities KickoffJuly 7, 2006
Diagnosis
• One of the 6 sequences that are not common to the two runs is the source of the problem: which one?!. We can rank the sequences in order of importance based on:
– Application specific criteria: use distance to common sequences for every application-specific origin of a sequence (e.g, process identity, or user identity)
– Application-independent criteria: use distance to common sequences
– Use distance to common sequences or known bad sequences by ignoring order of execution of calls
– Increasing k provides a better explanation, but generates a large number of sequences.
26Application Communities KickoffJuly 7, 2006
More abstraction
• There are more good runs than bad ones!. We need to compare the bad runs to the union of good runs: union of good runs with a single sequence cancel out the one bad run that contains all those sequences!
• Use average-sequence-weight ranking
27Application Communities KickoffJuly 7, 2006
Situation Awareness Gauge
28Application Communities KickoffJuly 7, 2006
Situation Awareness Gauge
• Implemented as a platform independent web interface: (e.g. ruby on rails)
– Content is defined by the databases content: attacks, failures, diagnosis, etc
– Gauges a simple Displays of number of attacks and failures and various parameters
– Provide a user with the possibilities of initiating responses and diagnosis activities in other nodes via the database
Configuration-based Detection, Diagnosis, Recovery, and Situational
Awareness
Jim Thornton, PARC
30Application Communities KickoffJuly 7, 2006
Importance of Configuration
• Static configuration state highly correlated with system behavior– Many attacks/bugs/errors introduced by way of a substantive change to
configuration “A central problem in system administration is the construction of a secure
and scalable scheme for maintaining configuration integrity of a computer system over the short term, while allowing configuration to evolve gradually over the long term” – Mark Burgess, author of cfengine
31Application Communities KickoffJuly 7, 2006
AC Opportunity
• Leverage scale of population to learn what are bad states in configuration space
Adaptability
ReliabilityWant to be here
Today: Every configurationchange is an uncontrolledexperiment
AC Future: Configurationchanges managed as controlledreversible trials
32Application Communities KickoffJuly 7, 2006
Live Monitoring of Configuration State
1. State analysis• Comparative diagnosis• Vulnerability assessment• Clustering similar nodes and contextualizing observations
2. Detect change events• Cluster low-level changes into transactions• Log events for problem detection, mitigation and user interaction• Share events in real-time for situational awareness
3. Active learning• Automated experiments to isolate root causes• Managed testing of official changes like patch installation
33Application Communities KickoffJuly 7, 2006
Live Control of Configuration State
• Modification for Reversibility and Experimentation– Coarse-grained: VM rollback– Medium-grained: Installer/Uninstaller activation– Fine-grained: Direct manipulation of low-level state elements
• Prevention– In-progress detection of changes– Interruption of change sequence– Reversal of partial effects
34Application Communities KickoffJuly 7, 2006
Identifying Badness
• Objective Deterministic Criteria– Rootkit detection from structural features– Published attack signatures
• Objective Heuristic Criteria– Performance outside of normal parameters
• Subjective End-User Report– Dialog with user to gather info, e.g. temporal data for failure appearance
• Administrative Policy– Rules specified by administrators within community
35Application Communities KickoffJuly 7, 2006
Local Components
VMM (VM Kernel)
App VM
App OS
App 1 App 2
Agent
VERNIER VM Experimental VM
VERNIER OS Base App OS
App 1 App 2AgentVERNIER Monitor/Control
Console(UI)
Comm
COTS
Diag
1 1
2
3
Community
36Application Communities KickoffJuly 7, 2006
Key Interfaces
1 VERNIER-Agent(TCP/IP, XML?)Registry change eventsFilesystem change eventsInstall eventsManipulate registryManipulate filesystemControl System Restore
2 VERNIER-VMM(?)SuspendResumeCheckpointRevertCloneResetLock memoryProcess eventsRead memoryRead/write disk
3 VERNIER-Community(?)Cluster managementExperience reports• Unknown• Prevalent• Known Bad • Presumed GoodState exchangeExperiment request/response
37Application Communities KickoffJuly 7, 2006
Local Functions
Local DBLocal condition detailEvent logsLabeled condition signaturesState snapshotsExperimental data
ConfigChange
Detector
NetworkEvent
Detector
BehaviorEvent
Detector
NetworkTap
AgentInside
VMM
Event Stream
Analysis & Diagnosis
ConfigurationAnalysis
BehaviorAnalysis
TrafficAnalysis
ResponseController
Firewall
Communication Manager
Community
Console
Adapting and Extending Host-based, Run-time Win32 Bot Detection for
VERNIER
Liz Stinson, Stanford
39Application Communities KickoffJuly 7, 2006
Overview
• Background on Stanford’s botnet research
• Plans for adapting and extending this work for application to VERNIER
40Application Communities KickoffJuly 7, 2006
• Network-based approaches: – Filtering (protocol, port, host, content-based)– Look for traffic patterns (e.g. DynDNS – Dagon)– Hard (encrypt traffic, permute to look like ‘normal’ traffic, …);
botwriters control the arena.
• Host-based approaches: – Ours: Have more info at host level.
Since the bot is controlled externally, use this meta-level behavioral signature as basis of detection
Exploit botnet characteristic: ongoing command and control
41Application Communities KickoffJuly 7, 2006
• Look at the syscalls made by a program– In particular at certain of their args – our sinks
• Possible sources for these sinks: – local: { mouse, keyboard, file I/O, … } – remote: { network I/O }
• An instance of external control occurs when data from a remote source reaches a sink
• Surprisingly works really well: for all bots tested (ago, dsnx, evil, g-sys, sd, spy), every command that exhibited external control was detected
Our approach
42Application Communities KickoffJuly 7, 2006
Big picture
43Application Communities KickoffJuly 7, 2006
Design
44Application Communities KickoffJuly 7, 2006
• Cause-and-effect semantics:– Tight relationship between receipt of some data over network
and subsequent use of some portion of that data in a sink
• Correlative semantics: looser relationship– Use of some data that is the same as some data received over
the network– Why necessary?
Two modes
45Application Communities KickoffJuly 7, 2006
Behaviors: ideally disjoint;@ lowest level in call stack
46Application Communities KickoffJuly 7, 2006
• Looked at 6 bots: agobot, dsnxbot, evilbot, g-sysbot, sdbot, spybot– At least 4 have totally indep code bases– g-sys non-trivially extends sd– Spybot borrows only syn flood implem from sd
• Wide variation in implementation• Every cmd that exhibited external control
detected; almost every instance external control flagged (3 false negatives)
Results
47Application Communities KickoffJuly 7, 2006
Results
48Application Communities KickoffJuly 7, 2006
• Why necessary• Why bots with C library functions statically linked in ~=
unconstrained OOB copies• In general almost as good as cause-and-effect semantics
(stat vs. dyn link)– Exceptions: cmds that format recv’d params (e.g. via
sprintf)
Correlative semantics
49Application Communities KickoffJuly 7, 2006
Comparison
50Application Communities KickoffJuly 7, 2006
Comparison
51Application Communities KickoffJuly 7, 2006
• Tested against some benign programs that interact with the network– Firefox, mIRC, Unreal IRCd
• 3 contextual false positives– IRCd: sent on X heard on Y– Firefox: dereferencing embedded links
• Artificial false positives: quite a few– mIRC: DCC capabilities– Firefox: saving contents to a file, …
Benign program testing
52Application Communities KickoffJuly 7, 2006
a) contextual false positives – not present in bots external control heuristic correctly detected but these actions under
these circumstances widely accepted as non-malicious
b) artificial false positives – not present in bots def of external control implies no user input agreeing to particular
behavior but we don’t track “explicitly clean” data (that received via kb,
mouse)
c) spurious false positivesa) any other incorrect flagging of external control
False positives
53Application Communities KickoffJuly 7, 2006
• Single behavioral meta-signature detects wide variety of behaviors on majority of Win32 bots– Resilient to differences in implementation
• Resilient in face of unconstrained OOB copies• Resilient to encryption – w/some constraints• Resilient to changes in command-and-control protocol (e.g.
from IRC to HTTP) and parameters (e.g. for rendezvous point)
Our mechanism — review
54Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (1) Reimplement BotSwat– Using correlative semantics– With improved statistical analysis comparing contents of
buffers received over the network to arguments of selected syscalls
– Probably as an entirely kernel-space implementation– May leverage some Livewire support to confirm integrity of
BotSwat and its components– May also leverage Livewire support to enable better
resilience to bot use of private encryption functions• Using its “watch memory range X (and let me know when it
changes) functionality
55Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (2) Confirm BotSwat works at detecting back-door programs
– Obtain various samples of these programs
– Determine whether additional syscalls might need to be hooked in order to provide better coverage of the functionality exported by these programs
56Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (3) Feasibility of simple approach to detecting keyloggers– If it is the case that the API call to insert self into the call chain for
receiving keyboard input (for an arbitrary window, not owned by the calling process) eventually traps to a system call, then this is a simple extension to BotSwat (a new syscall to hook)
– Otherwise, we need to provide a user-space component to achieve this
– Any process that signs itself up to receive keyboard input not destined for that process is suspect
– Can extend this paradigm to trap calls to read another process’s memory
• Win32 API has “ReadProcessMemory” function call that enables one process to read another process’s memory contents (under certain circumstances)
57Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (4) Leverage Virtual Machine Introspection (VMI) IDS technology to– Confirm integrity of kernel component of BotSwat– Confirm integrity of keyboard/mouse drivers (to ensure that
no process is able to obtain keyboard/mouse input via replacing the relevant kernel-mode device drivers)
– Possibly also augment BotSwat’s resilience to target programs’ use of private encryption functions, and the like
58Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (5) Botnet mitigation: “whistleblower”– Once some bot B is detected on some host machine via
BotSwat, obtain from B (programmatically) the C&C parameters in order to prevent C&C traffic for that botnet from entering or leaving the DoD network
• Basically, push out firewall filter
– Also push sample of bot executable to anti-malware scanner so that it can generate a signature for this malware executable
59Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (6) Botnet R&D– After detecting a bot and pushing out filters, we would like to be
able to “poke” that bot (programmatically) in a controlled environment
• Get it to generate variants of some exploit where those variants could be used as input to an automated vulnerability signature generator
• Bot would then be operating effectively as a flow classifier• Especially for zero-day exploits (or others that do not already have a
NIDS signature)
– Requires learning the command used by the bot to generate such scan/spread packets as well as learning how to gain control of the bot
– Note: this is not attempting to solve the problem of automated vulnerability signature generation, but simply to get the bot to act as a flow classifier
60Application Communities KickoffJuly 7, 2006
Plans for VERNIER
• (7) Setting the stage: generating a version of the bot that will not trip anti-malware signature scanners– From Christodorescu/Jha (“Testing Malware Detectors”), we
have techniques for performing source-code-level obfuscations, including variable renaming and encapsulating/encrypting portions of the source code
– Christodorescu/Jha showed that the major anti-virus scanners performed very poorly in response to encapsulation using hex encoding
Knowledge Sharing in VERNIER
Patrick Lincoln, SRI
62Application Communities KickoffJuly 7, 2006
Knowledge Sharing
• Need: Communication is the core concept of a community– Application communities rely on ability to share knowledge
Reliable, Efficient, Authentic, Secure
• Approach: two-tier peer-to-peer platform– Tuple space (ala Linda)– Considering JXTA, jxtaSpaces implementation of tuple spaces– Two-tier for better scalability
• If needed, hypercube hashtable index (ala Obreiter and Graf)
• Benefits: Reliable, efficient (local) knowledge sharing• Competition: Other possible methods for knowledge sharing
include explicit messaging, centralized database, and statically indexed knowledge structures. – Other approaches lack scalability, are unreliable, and can be
difficult to secure
63Application Communities KickoffJuly 7, 2006
Knowledge Sharing Levels
• Lower level (within a cluster)– Tuple space (ala Linda (Gelernter))– Simple queries
• (*, name, *) returns records regarding ‘name’
– Concurrent access and update
• Higher level (supernodes)– Nodes aggregate knowledge of an entire cluster– Use abstraction to summarize current situation– Application-level multicast to push out summaries– Supernode pushes all summary updates into local tuple space
64Application Communities KickoffJuly 7, 2006
Group Communication
• Group communication is key– For higher level, certain usual assumptions
• Reliable delivery• Ordered message delivery
• Spread (www.spread.org) as a basis for implementation of group communication– Building on secure spread and progress software (progress.com)’s
more secure, reliable, scalable variants of spread
65Application Communities KickoffJuly 7, 2006
Group Communication Security and Privacy: Secrecy and Authenticity
• Security and privacy are critical aspects of VERNIER• Must authenticate reports and ensure correctness• Confidentiality of reports
– Protecting user privacy (my files, my keystrokes)– Protect aspects of applications– Protect configuration information– Protect vulnerability detection information
• Community members send status reports to local supernode• Reports propagated throughout network
66Application Communities KickoffJuly 7, 2006
Group Communication Security
• Defense against: – network attacks sending forged messages to supernodes+ PKI
– Compromised community member sending false reports+ statistical anomaly detection (eg EMERALD)+ Virtualization
Any report generated within compromised virtual machine must be consistent with what is observed outside the virtualization layer
67Application Communities KickoffJuly 7, 2006
Group Communication Security
• Secure audit logs– Secure log of all P2P status reports– Enable post-mortem analysis on detected attacks– Cryptographic protection of log (Boneh, Waters)
• Sanitizing stats reports– Status reports reveal private information– Special encryption enabling read only by credentialed members
and search (as in search over encrpyted database) by community
• Mitigating denial of service attacks on supernodes– Re-election of supernodes when under attack
• Securing configuration update messages– PKI authenticating legitimate reports from community members
VERNIER Scenarios, Schedule, and Plans
Steve Dawson, SRI
69Application Communities KickoffJuly 7, 2006
Example Scenarios / Use Cases
• Browser crash: demonstrate both local crash recovery from a nonmalicious failure and proactive community avoidance of the same failure– Simple case: repeatable Web browser crash occurs when visiting a
particular URL• Local diagnosis: launch one or more copies of the VM, rolled back to a
known good state; play back step-by-step, observe that visiting the URL always causes the crash
• Local response: quarantine the URL• Collaborative diagnosis: problem reported to the community; other
installations attempt to replicate the problem, correlate observed behavior with relevant configuration details, discover that the problem occurs only for browser version X or earlier
• Collaborative response: recommend community-wide upgrade
– More complex variations could involve situations in which the circumstances leading to the browser crash involve multiple steps or interactions with other software
70Application Communities KickoffJuly 7, 2006
Example Scenarios / Use Cases (2)
• Phishing scenario: show how VERNIER can mitigate threats even when the attack is unknown and requires (unwitting) human participation– Cleverly constructed e-mail induces some key individuals to run a
malicious program that subsequently interferes with their ability to send and/or receive e-mail
– Local diagnosis: detect and correlate the installation actions of the unknown program; separately, affected users report difficulty with e-mail; VERNIER runs an experiment with a checkpointed VM to determine possible association with newly installed program
– Local response: malicious program automatically removed (possibly by reverting to checkpointed VM)
– Collaborative diagnosis: VERNIER instances share information about the installed program even before users report a problem; community observes use of unknown software, raising level of suspicion
– Collaborative response: warning to community against activity leading to installation of malicious program
71Application Communities KickoffJuly 7, 2006
Example Scenarios / Use Cases (3)
• Patching scenario: demonstrate mitigation of nonmalicious threats such as new software bugs
• Variation on the phishing scenario, where installation of a seemingly beneficial software patch has unintended side effects or introduces a new bug not observed previously
72Application Communities KickoffJuly 7, 2006
Schedule and Milestones
Infrastructure
VMM recovery/rollback
VMM enforcement mech
Diagnosis
Config management
Quasi- static analysis
Learning- based diagnosis
Sharing protocols
Response
Comm response mgmt
App. integration
Awareness
Situation awareness gauge
System Integration
Unit testing & integration
Scalability dev/testing
Delivery & Transition
Software packaging
Tech transition planning
Testing & Evaluation
Sys testing & metrics
Red teaming support
Management
Status Reports
Integration Milestones
PI Meetings & Demos
Red Teaming
Software & Doc Delivery
CY'06 CY'07 CY'08
Q3 Q4
Phase 1
Q5 Q6
Phase 2
Q7 Q8 Q9 Q10Q1 Q2
73Application Communities KickoffJuly 7, 2006
Experimentation and Evaluation
• Project testbed– Cluster of 300 virtual hosts
• 30 server-class physical hosts• 10 virtual nodes per server
– Housing and cluster configuration yet to be determined• Single cluster in one location?• Three clusters, one at each participant site? [Current plan]
• Software– Host OS: Linux– Guest (community) OS: Microsoft Windows– Applications: IE browser (possibly others); MS Office
• Simulations and scalability– Financially infeasible to scale to thousands of nodes– Plan is to use hybrid simulation to test scalability
• Real (live) nodes provide actual data• Simulated nodes use synthesized data generated by perturbing data
collected from real clusters’ supernodes
74Application Communities KickoffJuly 7, 2006
Success Criteria
• Metrics and targets (team-defined)– False positives (FP) / False negatives (FN)
• Phase 1: FP < 10%, FN < 20%• Phase 2: FP < 1%, FN < 2% (order of magnitude improvement)
– Percent loss of network availability• Phase 1: At most 20% per node, with at most 80% over any 500ms interval• Phase 2: At most 5% per node, with at most 20% over any 500ms interval
– Average time to recovery• Phase 1: Assuming a fix exists (not a FN), at most 30 minutes to recover the entire
community• Phase 2: At most 10 minutes
– Average network and computational overhead• No more than 30% slowdown for applications• No more than 100 KB/s average VERNIER-induced network traffic per node
– Percent accuracy of prediction• Phase 1: Effects of problems predicted within 15 minutes of onset; set of nodes
wrongly predicted (either way) differs by no more than 40% of actual• Phase 2: Prediction within 5 minutes; predicted set differs by no more than 20%
75Application Communities KickoffJuly 7, 2006
Technology Transition
• Ultimate goal of VERNIER is a COTS solution• Transition partner: VMware, Inc.
– Supporting VERNIER initially by providing VMware licenses for the testbed
– May provide limited technical assistance in developing necessary VERNIER-to-VMM APIs (details currently under discussion)
– Have agreed to define their own success criteria for the technology• Functionality, performance, cost, and other relevant goals that, if met,
would lead VMware to pursue further development and integration of VERNIER technology into the VMware product line
• Initial response suggests general agreement with the metrics we’ve already proposed (may want to tweak the numbers a bit), plus
– Breadth of operating system support
– Breadth of application support
76Application Communities KickoffJuly 7, 2006
Next Steps
• VERNIER team workshop– Full day (at least)– Brainstorming and detailed planning– Target date: week of July 17
• Continue discussions with VMware on success criteria, etc.
top related