4/24/09 - ksu spatiotemporal stream mining using emm margaret h. dunham southern methodist...
Post on 19-Dec-2015
214 views
TRANSCRIPT
4/24/09 - KSU
Spatiotemporal Stream Mining Using EMM
Margaret H. DunhamSouthern Methodist University
Dallas, Texas 75275
This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841
1
Completely Data Driven Model
No assumptions about data We only know the general format of the data
THE DATA WILL TELL US WHAT THE MODEL SHOULD LOOK LIKE!
2
WARNING
4/24/09 - KSU
Motivation
A growing number of applications generate streams of data. Computer network monitoring data Call detail records in telecommunications (Cisco VoIP
2003) Highway transportation traffic data (MnDot 2005) Online web purchase log records (JCPenney 2003,
Travelociy 2005) Sensor network data (Ouse, Derwent 2002) Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
34/24/09 - KSU
4
EMM Build<18,10,3,3,1,0,
0>
<17,10,2,3,1,0,
0>
<16,9,2,3,1,0,0
>
<14,8,2,3,1,0,0
>
<14,8,2,3,0,0,0
>
<18,10,3,3,1,1,
0.>
…
1/3
N1
N2
2/3
N3
1/11/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/31/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/21/1
N1
1
4/24/09 - KSU
Spatiotemporal Stream Mining Using EMM
Spatiotemporal Stream Data EMM vs MM vs other dynamic MM
techniques EMM Overview EMM Applications
54/24/09 - KSU
6
Spatiotemporal Environment Observations arriving in a stream At any time, t, we can view the state of
the problem as represented by a vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V1 V2 … Vq
S1 S11 S12 … S1q
S2 S21 S22 … S2q
… … … … …Sn Sn1 Sn2 … Snq
Time 4/24/09 - KSU
7
Data Stream Modeling Requirements Single pass: Each record is examined at most once Bounded storage: Limited Memory for storing
synopsis Real-time: Per record processing time must be low Summarization (Synopsis )of data Use data NOT SAMPLE Temporal and Spatial Dynamic Continuous (infinite stream) Learn Forget Sublinear growth rate - Clustering
74/24/09 - KSU
8
MMA first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state
A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:
S ={N1,N2, …, Nm}, and A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).
4/24/09 - KSU
9
Problem with Markov Chains
The required structure of the MC may not be certain at the model construction time.
As the real world being modeled by the MC changes, so should the structure of the MC.
Not scalable – grows linearly as number of events. Our solution:
Extensible Markov Model (EMM) Cluster real world events Allow Markov chain to grow and shrink dynamically
4/24/09 - KSU
10
Extensible Markov Model (EMM)
Time Varying Discrete First Order Markov Model Nodes (Vertices) are clusters of real world
observations. Learning continues during application phase. Learning:
Transition probabilities between nodes Node labels (centroid/medoid of cluster) Nodes are added and removed as data arrives
4/24/09 - KSU
11
Related Work Splitting Nodes in HMMs
Create new states by splitting an existing state M.J. Black and Y. Yacoob,”Recognizing facial expressions in image sequences using
local parameterized models of image motion”, Int. Journal of Computer Vision, 25(1), 1997, 23-48.
Dynamic Markov Modeling States and transitions are cloned G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic Markov
Modeling,” The Computer Journal, Vol. 30, No. 6, 1987. Augmented Markov Model (AMM)
Creates new states if the input data has never been seen in the model, and transition probabilities are adjusted
Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior using a model of interaction dynamics,” Proceedings, the Third International Conference on Autonomous Agents (agents ’99), Seattle, Washington
4/24/09 - KSU
12
EMM vs AMMOur proposed EMM model is similar to AMM, but is more flexible: EMM continues to learn during the application phase. The EMM is a generic incremental model whose nodes can
have any kind of representatives. State matching is determined using a clustering technique. EMM not only allows the creation of new nodes, but deletion
(or merging) of existing nodes. This allows the EMM model to “forget” old information which may not be relevant in the future. It also allows the EMM to adapt to any main memory constraints for large scale datasets.
EMM performs one scan of data and therefore is suitable for online data processing.
4/24/09 - KSU
EMM Operations
Input: EMM Output: EMM’
EMM Build – Modify/add nodes/arcs based on input observations
EMM Prune – Removes nodes/arcs EMM Merge – Combine multiple EMM nodes EMM Split – Split a node into multiple nodes EMM Age – Modify relative weights of old versus new
oberservations EMM Combine – Merge multiple EMMS by merging
specific states and transitions.
144/24/09 - KSU
Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Loc_6 Loc_7
1 20 50 100 30 25 4 10
2 20 80 50 20 10 10 10
3 40 30 75 20 30 20 25
4 15 60 30 30 10 10 15
5 40 15 25 10 35 40 9
6 5 5 40 35 10 5 4
7 0 35 55 2 1 3 5
8 20 60 30 11 20 15 10
9 45 40 15 18 20 20 15
10 15 20 40 40 10 10 14
11 5 45 55 10 10 15 0
12 10 30 10 4 15 15 10
Example from rEMM (R Package Available)
Courtesy Mike Hahsler
16
EMM Prune
N2
N1 N3
N5 N6
2/2
1/3
1/3
1/3
1/2
N1 N3
N5 N6
1/61/6
1/6
1/31/3
1/3Delete N2
4/24/09 - KSU
18
EMM Advantages
Dynamic Adaptable Use of clustering Learns rare event Sublinear Growth Rate Creation/evaluation quasi-real time Distributed / Hierarchical extensions Overlap Learning and Testing
4/24/09 - KSU
EMM Applications
Predict – Forecast future state values. Evaluate (Score) – Assess degree of model
compliance. Find the probability that a new observation belongs to the same class of data modeled by the given EMM.
Analyze – Report model characteristics concerning EMM.
Visualize – Draw graph Probe – Report specific detailed information
about a state (if available)
194/24/09 - KSU
EMM Results
Predicting FloodingOuse and Derwent – River flow data from
Englandhttp://www.nercwallingford.ac.uk/ih/nrfa/index.html
Rare Event DetectionVoIP Traffic Data obtained at Cisco SystemsMinnesota Traffic Data
ClassificationDNA/RNA Sequence Analysis
204/24/09 - KSU
Derwent River (UK)
21
28043
28011
28048
28010
28023
28117
4/24/09 - KSU
0
100
200
300
400
500
600
700
800
1 108 215 322 429 536 643 750 857 964 1071 1178 1285 1392 1499
num
ber o
f sta
te in
mod
el
number of input data (total 1574)
threshold 0.994
threshold 0.995
threshold 0.996
threshold 0.997
threshold 0.998
22
Sublinear Growth Rate
Data SimThreshold
0.99 0.992 0.994 0.996 0.998
Derwent
Jaccrd 156 190 268 389 667Dice 72 92 123 191 389
Cosine 11 14 19 31 61Ovrlap 2 2 3 3 4
Ouse
Jaccrd 56 66 81 105 162Dice 40 43 52 66 105
Cosine 6 8 10 13 24Ovrlap 1 1 1 1 1
4/24/09 - KSU
23
Prediction Error Rates
Normalized Absolute Ratio Error (NARE)
NARE =
Root Means Square (RMS)
RMS =
N
t
N
t
tO
tPtO
1
1
)(
|)()(|
N
tPtON
t
1
2))()((
4/24/09 - KSU
24
EMM Performance – Prediction (Ouse)
NARE RMSNo of States
RLF 0.321423 1.5389
EMMTh=0.95 0.068443 0.43774 20Th=0.99 0.046379 0.4496 56
Th=0.995 0.055184 0.57785 92
4/24/09 - KSU
25
EMM Water Level Prediction – Ouse Data
0
1
2
3
4
5
6
7
8
1
38
75
112
14
9
18
6
22
3
26
0
29
7
33
4
37
1
40
8
44
5
48
2
51
9
55
6
59
3
63
0
66
7
Input Time Series
Wa
ter
Le
ve
l (m
)
RLF Prediction EMM Prediction Observed
4/24/09 - KSU
26
Rare Event
Rare - Anomalous – Surprising Out of the ordinary Not outlier detection
No knowledge of data distribution Data is not static Must take temporal and spatial values into
account May be interested in sequence of events
Ex: Snow in upstate New York is not rare Snow in upstate New York in June is rare
Rare events may change over time
4/24/09 - KSU
27
Rare Event Examples
The amount of traffic through a site in a particular time interval as extremely high or low.
The type of traffic (i.e. source IP addresses or destination addresses) is unusual.
Current traffic behavior is unusual based on recent precious traffic behavior.
Unusual behavior at several sites.
4/24/09 - KSU
28
Rare Event Detection Applications
Intrusion Detection Fraud Flooding Unusual automobile/network traffic
4/24/09 - KSU
30
Our Approach
By learning what is normal, the model can predict what is not
Normal is based on likelihood of occurrence Use EMM to build model of behavior We view a rare event as:
Unusual event Transition between events states which does
not frequently occur. Base rare event detection on determining events
or transitions between events that do not frequently occur.
Continue learning
4/24/09 - KSU
31
EMMRare
EMMRare algorithm indicates if the current input event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs:
The frequency of the node at time t+1 is below this threshold
The updated transition probability of the MC transition from node at time t to the node at t+1 is below the threshold
4/24/09 - KSU
32
Determining Rare
Occurrence Frequency (OFc) of a node Nc :
OFc =
Normalized Transition Probability (NTPmn), from one state, Nm, to another, Nn :
NTPmn =
c ii
CN CN
mn ii
CL CN
4/24/09 - KSU
33
EMMRareGiven:
• Rule#1: CNi <= thCN
• Rule#2: CLij <= thCL
• Rule#3: OFc <= thOF
• Rule#4: NTPmn <= thNTP
Input: Gt: EMM at time t
i: Current state at time t R= {R1, R2,…,RN}: A set of rules
Output: At: Boolean alarm at time t
Algorithm: At =
1 Ri = True
0 Ri = False4/24/09 - KSU
Temporal Heat Map
Also called Temporal Chaos Game Representation (TCGR) Temporal Heat Map (THM) is a visualization technique for streaming
data derived from multiple sensors. It is a two dimensional structure similar to an infinite table. Each row of the table is associated with one sensor value. Each column of the table is associated with a point in time. Each cell within the THM is a color representation of the sensor
value Colors normalized (in our examples)
0 – While 0.5 – Blue 1.0 - Red
364/24/09 - KSU
37
Cisco – Internal VoIP Traffic Data
• Time →
•V
alue
s →
• Complete Stream: CiscoEMM.png
• VoIP traffic data was provided by Cisco Systems and represents logged VoIP traffic in their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17 11:29:11 2003.
4/24/09 - KSU
38
Rare Event Detection
Weekdays Weekend
Minnesota DOT Traffic Data
Detected unusual weekend traffic pattern
4/24/09 - KSU
39
TCGR Exampleacgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
A C G T
Pos 0-8 2 3 3 1
Pos 1-9 1 3 3 2
…Pos 34-42 2 4 2 1
A C G T
Pos 0-8 0.4 0.6 0.6 0.2
Pos 1-9 0.2 0.6 0.6 0.4
…Pos 34-42 0.4 0.8 0.4 0.2
4/24/09 - KSU
41
TCGR Example (cont’d)
Window 0: Pos 0-8Window 1: Pos 1-9
Window 17: Pos 17-25Window 18: Pos 18-26
Window 34: Pos 34-42
acgtgcacgcgtgcacgt
tccggaaccccggaacca
ccacgtcga
A C G T
4/24/09 - KSU
43
TCGR – Mature miRNA(Window=5; Pattern=3)
All Mature
Mus musculus
Homo sapiens
C. elegans
ACG CGC GCG UCG4/24/09 - KSU
44
Research Approach
1. Represent potential miRNA sequence with TCGR sequence of count vectors
2. Create EMM using count vectors for known miRNA (miRNA stem loops, miRNA targets)
3. Predict unknown sequence to be miRNA (miRNA stem loop, miRNA target) based on normalized product of transition probabilities along clustering path in EMM
4/24/09 - KSU
45
Related Work 1
Predicted occurrence of pre-miRNA segments form a set of hairpin sequences
No assumptions about biological function or conservation across species.
Used SVMs to differentiate the structure of hiarpin segments that contained pre-miRNAs from those that did not.
Sensitivey of 93.3% Specificity of 88.1%
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
4/24/09 - KSU
46
Preliminary Test Data1
Positive Training: This dataset consists of 163 human pre-miRNAs with lengths of 62-119.
Negative Training: This dataset was obtained from protein coding regions of human RefSeq genes. As these are from coding regions it is likely that there are no true pre-miRNAs in this data. This dataset contains 168 sequences with lengths between 63 and 110 characters.
Positive Test: This dataset contains 30 pre-miRNAs. Negative Test: This dataset contains 1000 randomly
chosen sequences from coding regions.
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure-Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
4/24/09 - KSU
References1) Margaret H. Dunham, Nathaniel Ayewah, Zhigang Li, Kathryn Bean, and Jie Huang, “Spatiotemporal Prediction
Using Data Mining Tools,” Chapter XI in Spatial Databases: Technologies, Techniques and Trends, Yannis Manolopouos, Apostolos N. Papadopoulos and Michael Gr. Vassilakopoulos, Editors, 2005, Idea Group Publishing, pp 251-271.
2) Margaret H. Dunham, Yu Meng, and, Jie Huang, “Extensible Markov Model,” Proceedings IEEE ICDM Conference, November 2004, pp 371-374.
3) Yu Meng, Margaret Dunham, Marco Marchetti, and Jie Huang, ”Rare Event Detection in a Spatiotemporal Environment,” Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 629-634.
4) Yu Meng and Margaret H. Dunham, “Online Mining of Risk Level of Traffic Anomalies with User's Feedbacks,” Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 176-181.
5) Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.
6) Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.
7) Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA Structure using Temporal CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering (BIBE06), October 16-18, 2006, Washington D.C. ,pp 171-178.
8) Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” 2009, accepted to appear LDM conference, 2009.
4/24/09 - KSU 50