anomaly detection using projective markov models

Post on 21-Nov-2014

1.873 Views

Category:

Education

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at the 2009 CDC, Shanghai Anomaly Detection Using Projective Markov Models in a Distributed Sensor Network Sean Meyn, Amit Surana, Yiqing Lin, and Satish Narayanan https://netfiles.uiuc.edu/meyn/www/spm_files/Mismatch/Mismatch.html

TRANSCRIPT

Anomaly Detection Using Projective Markov Models in a Distributed Sensor Network

Sean MeynDepartment of Electrical and Computer Engineeringand the Coordinated Science Laboratory University of Illinois

Joint work with Amit Surana, Yiqing Lin, and Satish Narayanan, United Technologies Research Center

Acknowledgements: Research supported by United Technologies Research Center and the National Science Foundation, CCF 07-29031

Outline

• Detection in a Sensor Network

• Multiple Models for Distributed Detection

• Application to a Building Security

• Detection in a Sensor Network

• Multiple Models for Distributed Detection

• Application to a Building Security

IDetection in a Sensor Network

Problem Statement

Detect anomalous behavior based on

• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly

Interest at UTRC: building monitoring for security and energy efficiency

Detect anomalous behavior based on

• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly

Interest at UTRC: building monitoring for security and energy efficiency

Challenges and Resolution

1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection

Detect anomalous behavior based on

• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly

1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection

Detect anomalous behavior based on

• A large number of heterogeneous sensors • A large heterogeneous region • Partial information regarding anomalous behavior • Complex behavior with or without anomaly

Challenges and Resolution

1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection

Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.

1. Model for anomalous behavior 2. Complexity of normal behavior 3. Lack of coordinated action, or communication constraints 4. High variance associated with detection

Approach: Projective Markov models address 1 - 3. Paramaterized models address 4.

IIMultiple Models for Distributed Detection

Qβ∗( )

Q ( )ηπ0

π0

π1π1

Binary Hypothesis Testing - Geometric View

For ease of explanation only: The classical i.i.d. setting

For starters: Binary hypothesis testing

For ease of explanation only: The classical i.i.d. setting

For starters: Binary hypothesis testing

Z is an i.i.d. sequence on a finite state space

marginal under model of normal behaviorπ0

marginal under model of anomalous behaviorπ1

Binary Hypothesis Testing - Geometric View

For ease of explanation only: The classical i.i.d. setting

For starters: Binary hypothesis testing

For ease of explanation only: The classical i.i.d. setting

For starters: Binary hypothesis testing

Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,

Z is an i.i.d. sequence on a finite state space

marginal under model of normal behaviorπ0

marginal under model of anomalous behavior

Log-likelihood Ratio Test Log-likelihood Ratio

π1

L = log(dπ1/dπ0)φ(ZT1 ) = I

1

T

T

t=1

L(Z(t)) ≥ τ

Binary Hypothesis Testing - Geometric View

Geometry:Geometry:

Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,

Log-likelihood Ratio Test Log-likelihood Ratio

L = log(dπ1/dπ0)φ(ZT1 ) = I

1

T

T

t=1

L(Z(t)) ≥ τ

Qβ∗( )

Q ( )η

π0

π0

π1π1

Separating hyperplane

Binary Hypothesis Testing - Geometric View

Geometry:Geometry:

Optimal test: Under Neyman-Pearson or Bayesian criteria,Optimal test: Under Neyman-Pearson or Bayesian criteria,

Log-likelihood Ratio Test Log-likelihood Ratio

L = log(dπ1/dπ0)φ(ZT1 ) = I

1

T

T

t=1

L(Z(t)) ≥ τ

Qβ∗( )

Q ( )η

π0

π0

π1π1

Separating hyperplane:

{µ :

∫L(z)µ(dz) = τ

}

Qη(π0) = µ : D(µ‖π0) < η{ }

Binary Hypothesis Testing - Geometric View

Geometry:Geometry:

LLR test: Declare Anomaly if empirical distributions lie outside of lower half spaceLLR test: Declare Anomaly if empirical distributions lie outside of lower half space

Qβ∗( )

Q ( )η

π0

π0

π1π1

{µ :

∫L(z)µ(dz) = τ

}

ΓT (z) =:1

T

T∑

t=1

I Z(t) = z , z ∈ Z{ }

Universal Detection

Anomalous behavior is not modeledAnomalous behavior is not modeled

Alarm is sounded if empirical distribution lies outside divergence nbd

Alarm is sounded if empirical distribution lies outside divergence nbd

π0

Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

sean
Typewritten Text
[Hoeffding, 1965]

Universal Detection

Anomalous behavior is not modeledAnomalous behavior is not modeled

Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test

Alarm is sounded if empirical distribution lies outside divergence nbd

Alarm is sounded if empirical distribution lies outside divergence nbd

π0Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

Universal Detection

Anomalous behavior is not modeledAnomalous behavior is not modeled

Good news: For large T, performance approaches optimality of LLR testGood news: For large T, performance approaches optimality of LLR test

Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z

Bad news: For finite T, variance of statistic grows linearly with size of observation alphabet Z

Alarm is sounded if empirical distribution lies outside divergence nbd

Alarm is sounded if empirical distribution lies outside divergence nbd

π0Q ( )η π0

{ }

{ }

φ(ZT1 ) = I ΓT Q∈� η(π0)

= I D(ΓT ‖π0) ≥ η

D(ΓT ‖π0)

sean
Typewritten Text
[Unnikrishnan, Huang, M., Surana, Veeravalli, 2009] [Clarke and Barron, 1990]
sean
Typewritten Text

Universal Detection - Multiple Models

Suppose we extract features of the observations,Suppose we extract features of the observations,

Selected based onSelected based on

Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

Universal Detection - Multiple Models

Suppose we extract features of the observations,Suppose we extract features of the observations,

Selected based onSelected based on

Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

i

Qη(π0i )

}

Universal Detection - Multiple Models

Suppose we extract features of the observations,Suppose we extract features of the observations,

Selected based onSelected based on

Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:

Geometry:Geometry:

Qη(π01)

Qη(π02)

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

i

Qη(π0i )

}

Universal Detection - Multiple Models

Suppose we extract features of the observations,Suppose we extract features of the observations,

Selected based onSelected based on

Zi(t) = ϕi(Z(t)), 1 ≤ i ≤ n Zi(t) ∼ π0i

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

• Constraints, such as sensor locations• Prior knowledge regarding anomalous behavior• Variance reduction

Optimal combination of features for optimal detection:Optimal combination of features for optimal detection:

Geometry:Geometry:

Qη(π01)

Qη(π02)

φ(ZT1 ) = I Γi(T ) Q∈� η(π0

i ) for some i{ }

= I{ΓT �∈

i

Qη(π0i )

}

Safe RegionSafe Region

Markov Models

KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate

wherewhere andand are the distributions of are the distributions of

assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P

J(Q‖P ) = limn→∞

1

nD(γ(n)‖π(n))

γ(n) π(n)

= D(γ(2)‖π(2)) − D(γ‖π)

(Z(1), . . . , Z(n))

Equal under Markov assumption

Markov Models

KL Divergence is replaced by relative entropy rateKL Divergence is replaced by relative entropy rate

Local models? Shannon-Mori-Zwanzig projection:Local models? Shannon-Mori-Zwanzig projection:

wherewhere andand are the distributions of are the distributions of

assumed Markovian, with transition matrices Q and Passumed Markovian, with transition matrices Q and P

J(Q‖P ) = limn→∞

1

nD(γ(n)‖π(n))

γ(n) π(n)

= D(γ(2)‖π(2)) − D(γ‖π)

(Z(1), . . . , Z(n))

P (x, y) :=π(2)(x, y)

π(x)

Equal under Markov assumption

Markov Models

Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:

Option B: Parameterized models:Option B: Parameterized models:

θ chosen using ML estimationθ chosen using ML estimation

Local models? Two approaches:Local models? Two approaches:

P (x, y) :=π(2)(x, y)

π(x)

π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm

Markov Models

Option A: Shannon-Mori-Zwanzig projection:Option A: Shannon-Mori-Zwanzig projection:

Option B: Parameterized models:Option B: Parameterized models:

Advantage of option B: Variance grows with dimension m, not the cardinality of the observation spaceAdvantage of option B: Variance grows with dimension m, not the cardinality of the observation space

θ chosen using ML estimationθ chosen using ML estimation

Local models? Two approaches:Local models? Two approaches:

P (x, y) :=π(2)(x, y)

π(x)

π(2)θ (x, y) = eθTψ(x,y), θ ∈ Rm

IIIApplication to Building Security

Building Testbed at UTRC

Eleven Markov models for occupancy based on eleven zonesEleven Markov models for occupancy based on eleven zones

Option A: Empirical Markov model Option A: Empirical Markov model

Option B: Queueing modelOption B: Queueing model

Video cameraVideo camera

ZoneZone

[Smith and Towsley, 1981][Smith and Towsley, 1981]

1210

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Experiment Architecture12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Scenarios:Scenarios:

Capture a range of unusual traffic patterns in a buildingCapture a range of unusual traffic patterns in a building

1 Convergence: Numerous occupants converge to a single zone

2 Divergence: Numerous occupants leave a single zone

3 Idleness: Numerous occupants converge to a single zone

4 Loitering: Numerous occupants converge to a single zone

5 High occupancy: Higher than normal occupancy in combined zones

1 Convergence: Numerous occupants converge to a single zone

2 Divergence: Numerous occupants leave a single zone

3 Idleness: Numerous occupants converge to a single zone

4 Loitering: Numerous occupants converge to a single zone

5 High occupancy: Higher than normal occupancy in combined zones

Typical ROC Curves12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

0.2 0.4 0.6 0.80.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1Empirical model

Test statistic: Based on a moving window of length δ

Semi-empirical model

0

0.2

0.4

0.6

0.8

1

1010

Delay

52

7

R(t) =

J(Γ(2)δ0,t

Γ(2)δ0,t

‖π(2)) Empirical bivariate empirical distribution

1

δ0

0

t∑

k=t−δ0+1

log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )

Typical ROC Curves12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

0.2 0.4 0.6 0.80.2 0.4 0.6 0.80

0.2

0.4

0.6

0.8

1Empirical model

Test statistic: Based on a moving window of length δ

Semi-empirical model

0

0.2

0.4

0.6

0.8

1

1010

Delay

52

7

R(t) =

J(Γ(2)δ0,t

Γ(2)δ0,t

‖π(2)) Empirical bivariate empirical distribution

1

δ0

0

t∑

k=t−δ0+1

log �t,δ0 �t,δ0Semi-empirical Likelihood ratio using ML estimate,( )

�t,δ0 :=Pt,δ0(Z(k), Z(k + 1))

P (Z(k), Z(k + 1))

Centralized Detection12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Max

imum

of s

tatis

tics

Empirical Statistic

Delay is similar using either test

Semi-empirical statistic is far more disciminating

30 40 50 600

5

10

15

Semi-empirical Statistic

Anomalous episode:Convergence to zone 6

6

Decentralized Detection: Divergence12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

Empirical Statistic

Delay is similar using either test

Many false alarms from empirical statistic

Semi-empirical Statistic

Anomalous episode:Divergence from zone 5

88

99

5

10 20 30 40

05

1015

05

1015

05

1015

Z4Z5

Z6

Decentralized Detection: Occupancy12

10

8

75

14

1

2 3 4 6

9

1113

16

1

2 34

5

6

78910

1115

30 40 50 60

05

1015

05

1015

Empirical Statistic

Empirical statistic clairvoyant?

Missed detection using empirical statistic

Semi-empirical Statistic

Anomalous episode:10% higher occupancy in zones 5 and 6

99

5

6

Z5Z6

ConclusionsEmpirical Statistic

Semi-empirical Statistic

Contributions:

• Feasibility of an anomaly detection framework using projected Markov models.

• Advantages of semi-empirical Markov models

ConclusionsEmpirical Statistic

Semi-empirical Statistic

Current research:

• Feature selection for distributed detection

• Active learning - e.g., query for additional data

• Diagnosis

• Response

Contributions:

• Feasibility of an anomaly detection framework using projected Markov models.

• Advantages of semi-empirical Markov models

References

[1,3,4,5,6,8] Geometry

[2,3,4,6,8] Universal detection

[4,7] Variance in detection

and parameter estimation

.razsisC.I]1[ I-divergence geometry of probability distributionsand minimization problems. Ann. Probab., 3:146–158, 1975.

[2] O. Zeitouni and M. Gutman. On universal hypotheses testingvia large deviations. IEEE Trans. Inform. Theory, 37(2):285–290, 1991.

[3] C. Pandit and S. P. Meyn. Worst-case large-deviations withapplication to queueing and information theory. Stoch. Proc.Applns., 116(5):724–756, May 2006.

[4] J. Unnikrishnan, D. Huang, S. Meyn, A. Surana, and V. Veer-avalli. Universal and composite hypothesis testing via mis-matched divergence. CoRR and submitted for publication,IEEE Trans. IT., abs/0909.2234, 2009.

[5] S. Borade and L. Zheng. I-projection and the geometryof error exponents. In Proceedings of the Forty-FourthAnnual Allerton Conference on Communication, Control, andComputing, Sept 27-29, 2006, UIUC, Illinois, USA, 2006.

[6] E. Abbe, M. Medard, S. Meyn, and L. Zheng. Finding thebest mismatched detector for channel coding and hypothesistesting. Information Theory and Applications Workshop, 2007,pages 284–288, 29 2007-Feb. 2 2007.

[7] B. S. Clarke and A. R. Barron. Information-theoretic asymp-totics of Bayes methods. IEEE Trans. Inform. Theory,36(3):453–471, 1990.

[8] D. Huang, J. Unnikrishnan, S. Meyn, V. Veeravalli, andA. Surana. Statistical SVMs for robust detection, supervisedlearning, and universal classification. In Proceedings of theInformation Theory Workshop on Networking and InformationTheory, Volos, Greece., 2009.

top related