a system for detecting anomalies in data streams for ...dddas/papers/alec_proposal_slides.pdfonline...
TRANSCRIPT
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
A System for Detecting Anomalies inData Streams for Emergency Response
Applications
Alec Pawling
University of Notre Dame
October 2, 2007
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Proposed Research
WIPER
Detection and Alert
System
Real-Time Data Source
Conclusion
Outline
Overview
Proposed Research
WIPER
Detection and Alert System
Online Anomaly Detection via ClusteringLink Sampling and Anomalous Link Detection
Real-Time Data Source
Conclusion
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Proposed Research
WIPER
Detection and Alert
System
Real-Time Data Source
Conclusion
Overview
Proposed Research
Fast, online anomaly detection in streaming sensordata
Non-relational dataRelational data
Real-time data aggregation and distribution tovarious system components
Motivation
Wireless Phone-based Emergency Response System(WIPER)
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Proposed Research
WIPER
Detection and Alert
System
Real-Time Data Source
Conclusion
Wireless Phone-Based Emergency ResponseSystem (WIPER)
Emergency Response System
Provide decision support to emergency responsemanagers
Cell phones as sensors
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Proposed Research
WIPER
Detection and Alert
System
Real-Time Data Source
Conclusion
Wireless Phone-Based Emergency ResponseSystem (WIPER)
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Outline
Overview
Detection and Alert System
Online Anomaly Detection via Clustering
Problem Definition
Related Work
An Online Hybrid Clustering Algorithm
Datasets
Experimental Setup
Results
Proposed Research
Link Sampling and Anomalous Link Detection
Real-Time Data Source
Conclusion
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Problem Definition
Problem:
How can we detect anomalies in streaming cellphone transaction data?
Challenges:
Lots of data
Limited time for detecting anomalies
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Proximity Based Anomaly Detection
Makes no assumptions about data distribution
Anomalous points are far from other points (specificdefinitions vary from application to application)
Computationally expensive
Clustering can be used to reduce computationalcomplexity
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Approaches to Data Clustering [Jain, Murty, and Flynn,1999]:
Hierarchical Clustering
Iteratively split/merge clustersComputationally expensive
Partitional Clustering
Divides the data into disjoint subsetsRelatively efficientAssumes prior knowledge of the number of cluster;prone to finding local maxima
Incremental Clustering
Consider examples one at a time; update clustersEfficient
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Leader Algorithm [Hartigan, 1975]
For each data example
Locate the closest cluster center.If the distance between the example and the clustercenter is less than a user defined threshold
Add the example to the cluster.
Otherwise, create a new cluster centered at theexample.
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Hybrid Clustering: combination of two clusteringalgorithms
Cheu et al. 2004: Use partitional algorithms toreduce data set for hierarchical algorithms
Chipman and Tibshiran 2006: Combine bottom upalgorithms with top down algorithms
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
An Online Hybrid Clustering Algorithm
For each example ~x :
Find the closest cluster Ci
Let ~µi be the centroid of Ci
Let ~σi standard deviations of the features of Ci
If d(~x , ~µi ) < l |~σi |, add ~x to Ci
Otherwise, add ~x to the set of unclustered examples
If there are km examples in the unclustered set:
Cluster the unclustered examples using k-meansFor each cluster with m or more examples:
Accept the cluster
For each cluster with less than m examples:
Return its examples to the unclustered set
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Experimental Setup
Dataset:
Real world data:
12 days of cell phone network transaction dataDiscretized into 1 minute intervals18721 examples
Feature vector:
Timestamp: hour and minuteNumber of times each service is used in the interval
5 services
Evaluation:
Compare hybrid algorithm to 1-NN anomalydetection
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Results
Ful
lT
rial 2
Tria
l 5T
rial 8
0 500 1000 1500 2000 2500
Pairwise distances
Figure: Distribution of distances between outliers and theirnearest neighbor.
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Proposed Research
New first level clustering algorithm:
Deterministic, hierarchical
Additional analysis of clusters:
Movement of clusters
Rate at which examples are added to clusters
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Outline
Overview
Detection and Alert System
Online Anomaly Detection via ClusteringLink Sampling and Anomalous Link Detection
Problem Definition
Related Work
Datasets
Implementation Details
Experimental Setup
Results
Conclusions
Proposed Research
Real-Time Data Source
Conclusion
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Problem Definition
Problem:
How does sampling a graph (network) affect ourability to identify anomalous edges (links)?
Challenges:
Large graphs
Limited time
Limited memory
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Sampling Networks
“Subnets of Scale-Free Networks are not Scale-Free”[Stumpf et al., 2005]
Sampling a network changes often changes itscharacteristics in predictable ways. [Lee et al., 2006]
Sampling from Streams
Sliding window: only contains most recent items inthe stream
Uniform sample [Vitter, 1985]: all items in thestream have equal probability of being retained bythe sample
Biased sample [Aggarwal, 2006]: compromisebetween sliding window and uniform sample
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Anomalous Link Detection [Rattigan and Jensen, 2005]
Goal: Identify “surprising” edges in a graph
Methods from link prediction literature[Liben-Nowell and Kleinberg, 2003]
For each edge, (u, v), in the graph, compute theproximity of u and v
Anomalous links have a proximity below somethreshold
Two general approaches:
Neighborhood based methodsPath based methods
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Related Work
Neighborhood based methods. Let Γ(u) be the setof vertices that are connected to u by an edge
Common neighbors: the number of neighbors sharedby u and v
|Γ(u) ∩ Γ(v)|
Jaccard’s coefficient: the probability that a neighborof u or v is a neighbor of both u and v
|Γ(u) ∩ Γ(v)|
|Γ(u) ∪ Γ(v)|
Path based method
Rooted PageRank: the probability that a randomwalk starting at u will reach v if the walk fails ateach step with some probability
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Datasets
Cell phone network: transactions initiated bymembers of a single service provider
SMS: one day of text message transactionsPhone: one day of call transactions
Enron: snapshot of Enron email server. Containsemails to and from @enron.com addresses, May 10,1999 to January 31, 2002
vertices transaction edges
SMS (1 day) 2,350,793 3,339,708 1,597,818Call (1 day) 6,261,633 8,019,290 5,243,128Enron 25,854 1,033,638 201,243
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Implementation Details
Implementation is straightforward for commonneighbors and Jaccard’s coefficient
Rooted PageRank is typically determined using thestationary distribution of a Markov Chain
Stationary distribution is computed by repeatedmatrix multiplicationsMatrices for the SMS and call datasets are too largeto store in main memory
We use a series of random walks to approximaterooted PageRank
Bound the walk length using a geometricdistributionTotal number of random walks is based on theaverage degree of the graph
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Experimental Setup
Three sampling methods: sliding window, uniformsampling, and biased sampling
Three anomalous link detection methods: commonneighbors, Jaccard’s coefficient, and rootedPageRank
Sample sizes range from 10% to 90% of thetransactions
Evaluate using Spearman’s rank correlation
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Results
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Ran
k C
orre
latio
n
Fraction of Data Set
Uniform sampleBiased sampleSliding window 0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Ran
k C
orre
latio
n
Fraction of Data Set
Uniform sampleBiased sampleSliding window
Figure: Rank correlations for call dataset. Left: Jaccard’scoefficient. Right: rooted PageRank.
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Observations
Rooted PageRank performs better on smaller samples
Rooted PageRank is computationally expensive
Better to use Jaccard’s coefficient with larger samples.
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Online Anomaly Detectionvia Clustering
Link Sampling andAnomalous Link Detection
Real-Time Data Source
Conclusion
Proposed Research
Extract and analyze city level subgraphs
Investigate changes in Jaccard’s coefficientdistribution over time
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Outline
Overview
Detection and Alert System
Online Anomaly Detection via ClusteringLink Sampling and Anomalous Link Detection
Real-Time Data Source
Overview
Prototype Implementation
Experimental Setup
Results
Conclusions
Proposed Research
Conclusion
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Overview
Motivation:
Use existing cell phone network as a sensor network
Advantages:
Cheap deployment
Disadvantages:
No control over the network
Goal:
Receive transaction data from the cellular serviceprovider
Summarize and distribute data to clients (DSS,DAS, SPS) in real-time
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Overview
Incoming data:
Time at which service was initiated
The network service used
Anonymized values indicating people involved inusing the service
Towers involved in providing the service
Outgoing data:
Stream of interval summaries
Each item in the stream consists of
A timestamp indicating the end of the intervalA vector containing the number of times eachservice was used in the interval
Clients specify interval length
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Prototype Implementation
Ruby:
Interpreted language
Web-services support
Multi-threading support with large priority space
Assumption:
Data from service provider arrives in order
Periodic Task Model:
Periodic tasks: send data to clients
For each client: a task executes at the end of everyintervalDeadline is the end of the next interval
Aperiodic tasks: maintain interval summaries
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Experimental Setup
Setup:
2 to 24 clients
Task periods of 0.05, 0.06, 0.07, 0.08, 0.09 seconds
Constant transaction streams: 100 transactions /second
Four evaluation measures:
the rate of missed deadlines
the rate of skipped tasks
the average delay for the periodic tasks
the correctness of the data source output
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Results
Observations:
System fails (incorrect output) with a low utilization(≈ 0.26)
In many cases, tasks were released after deadline,skipped
Conclusion:
Periodic task model is too inflexible for this system
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Proposed Research
Use rate-based execution model [Jeffay andGoddard, 1999]
Parameterize with:
Maximum expected aperiodic task rate
Desired aperiodic task response time
When aperiodic task rate exceeds maximumexpected rate:
Deadlines shift, response time decays
Remove assumption that transaction stream arrivesin order
Sporadic tasks with dynamic release times todistribute summaries to clientMinimize data loss, minimize delay
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Outline
Overview
Detection and Alert System
Online Anomaly Detection via ClusteringLink Sampling and Anomalous Link Detection
Real-Time Data Source
Conclusion
Summary of Proposed Research
Proposed Schedule
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Summary of Proposed Research
Detection and Alert SystemOnline Anomaly Detection via Clustering
Extend hybrid clustering algorithm into a
streaming algorithm
Link Sampling and Anomalous Link Detection
Identify feasible methods for reducing graph data
for online analysis
Identify graph features that can be quickly
computed and allow the identification of
anomalous behavior in graphs
Real-Time Data Source
Develop a real-time system for distributingsummaries of streaming transaction data to clientsHandle out of order data arrival dynamicallyOnline minimization of dropped data andpropagation delay
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Published Papers
Online Anomaly Detection via Clustering:
Proceedings of the North American Association forComputational Social and Organization Science,2006. (Best student paper.)
Computational and Mathematical OrganizationTheory. To appear.
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Proposed Schedule
Detection and Alert System
Online Anomaly Detection via Clustering
New conference paper early in 2008New journal paper late in 2008
Anomaly Detection in Graphs
Conference submission (SIAM) in October 2007Additional conference submission early in 2008Journal submissions in late 2008 or early 2009
Real-Time Data Source:
Conference submission in mid 2008 (describing acompletely redesigned and rebuilt system)
Journal submission in early 2009
Dissertation Defense: March 2009
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Acknowledgments
The material presented here is based in part upon worksupported by the National Science Foundation, theDDDAS Program, under grant No. CNS-050348.
The committee:
Dr. Chaudhary
Dr. Chawla
Dr. Poellabauer
The outside chair:
Dr. Hachen
My advisor:
Dr. Madey
A System for Detecting
Anomalies in Data
Streams for Emergency
Response Applications
Alec Pawling
Overview
Detection and Alert
System
Real-Time Data Source
Conclusion
Summary of ProposedResearch
Published Papers
Proposed Schedule
Questions?