anomaly detection in scikit-learn and new tools from ... · figure:roc and pr curve on smtp dataset...

43
Anomaly Detection in Scikit-Learn and new tools from Multivariate Extreme Value Theory Nicolas Goix Supervision: Detecting Anomalies with Multivariate Extremes: St´ ephan Cl ´ emenc ¸on and Anne Sabourin Contributions to Scikit-Learn: Alexandre Gramfort LTCI, CNRS, T´ el´ ecom ParisTech, Universit´ e Paris-Saclay 1

Upload: others

Post on 22-Nov-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection in Scikit-Learn and new toolsfrom Multivariate Extreme Value Theory

Nicolas Goix

Supervision:

Detecting Anomalies with Multivariate Extremes:Stephan Clemencon and Anne Sabourin

Contributions to Scikit-Learn: Alexandre Gramfort

LTCI, CNRS, Telecom ParisTech, Universite Paris-Saclay

1

Page 2: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

1 Anomaly Detection and Scikit-Learn

2 Multivariate EVT & Representation of Extremes

3 Estimation

4 Experiments

2

Page 3: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

Anomaly Detection (AD)What is Anomaly Detection ?

”Finding patterns in the data that do not conform to expected behavior”

Huge number of applications: Network intrusions, credit card frauddetection, insurance, finance, military surveillance,...

3

Page 4: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

Machine Learning context

Different kind of Anomaly Detection

• Supervised AD- Labels available for both normal data and anomalies- Similar to rare class mining

• Semi-supervised AD (Novelty Detection)- Only normal data available to train- The algorithm learns on normal data only

• Unsupervised AD (Outlier Detection)- no labels, training set = normal + abnormal data- Assumption: anomalies are very rare

4

Page 5: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

Important litterature in Anomaly Detection:

• statistical AD techniquesfit a statistical model for normal behaviorex: EllipticEnvelope

• density-based- ex: Local Outlier Factor (LOF) and variantes (COF ODIN LOCI)

• Support estimation - OneClassSVM - MV-set estimate• high-dimensional techniques: - Spectral Techniques - Random

Forest - Isolation Forest

5

Page 6: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

Isolation Forest:

6

Page 7: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

7

Page 8: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

IsolationForest.fit(X)

IsolationForestInputs: X, n estimators, max samples

Output: Forest with:

• # trees = n estimators• sub-sampling size = max samples• maximal depth max depth = int(log2 max samples)

Complexity: O(n estimators max samples log(max samples))

default: n estimators=100, max samples=256

8

Page 9: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

IsolationForest.predict(X)

Finding the depth in each tree

depth ( Tree , X ) :# − Finds the depth l e v e l o f the l e a f node# f o r each sample x i n X .# − Add average path length ( n samp les in l ea f )# i f x not i s o l a t e d

score(x ,n) = 2−E(depth(x))

c(n)

Complexity: O( n samples n estimators log(max samples))

9

Page 10: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

Examples• code example:

>> from sk learn . ensemble import I s o l a t i o n F o r e s t>> IF = I s o l a t i o n F o r e s t ( )>> IF . f i t ( X t r a i n ) # b u i l d the t rees>> IF . p r e d i c t ( X tes t ) # f i n d the average depth

• plotting decision function:

10

Page 11: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

n samples normal = 150n samples out ie rs = 50

11

Page 12: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Anomaly Detection and Scikit-Learn

n samples normal = 150n samples out ie rs = 50

12

Page 13: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

1 Anomaly Detection and Scikit-Learn

2 Multivariate EVT & Representation of Extremes

3 Estimation

4 Experiments

13

Page 14: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

General idea of our work

• Extreme observations play a special role when dealing withoutlying data.

• But no algorithm has specific treatment for such multivariateextreme observations.

• Our goal: Provide a method which can improve performance ofstandard AD algorithms by combining them with a multivariateextreme analysis of the dependence structure.

14

Page 15: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

15

Page 16: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Goal:

X = (X1, . . . ,Xd)

Find the groups of features which can be largetogether

ex: X1,X2, X3,X6,X7, X2,X4,X10,X11

⇔Characterize the extreme dependence structure

Anomalies = points which violate this structure16

Page 17: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Framework

• ContextI Random vector X = (X1, . . . ,Xd )

I Margins: Xj ∼ Fj (Fj continuous)

• Preliminary step: Standardization of each marginal

I Standard Pareto: Vj =1

1−Fj(Xj)

(P(Vj ≥ x) = 1

x , x ≥ 1)

17

Page 18: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

ProblematicJoint extremes: V’s distribution above large thresholds?

P(V ∈ A)? (A ‘far from the origin’).

18

Page 19: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Fundamental hypothesis and consequences

• Standard assumption: let A extreme region,

P[V ∈ t A] ' t−1P[V ∈ A] (radial homogeneity)

• Formally,

regular variation (after standardization):

0 /∈ AtP[V ∈ t A] −−−→

t→∞ µ(A), µ : exponent measure

Necessarily: µ(tA) = t−1µ(A)

• ⇒ angular measure on sphere Sd−1: Φ(B) = µtB, t ≥ 1

19

Page 20: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

General model in multivariate EVT

Model for excessesIntuitively: P[V ∈ A] ' µ(A) For a large r > 0 and a region B on theunit sphere:

P[‖V‖ > r ,

V‖V‖

∈ B]

∼1rΦ(B) = µ(tB, t ≥ r ) , r →∞

⇒ Φ (or µ) rules the joint distribution of extremes (if margins areknown).

20

Page 21: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Angular distribution• Φ rules the joint distribution of extremes

I Asymptotic dependence: (V1,V2) may be large together.

vs

I Asymptotic independence: only V1 or V2 may be large.21

Page 22: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

General Case

• Sub-cones: Cα =‖v‖ ≥ 1, vi > 0 (i ∈ α), vj = 0 (j /∈ α)

• Corresponding sub-spheres:

Ωα, α ⊂ 1, . . . ,d

(Ωα = Cα ∩ Sd−1)

22

Page 23: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Representation of extreme data• Natural decomposition of the angular measure :

Φ =∑

α⊂1,...,d

Φα with Φα = Φ|Ωα ↔ µ|Cα

• ⇒ yields a representation

M =Φ(Ωα) : ∅ 6= α ⊂ 1, . . . , d

=µ(Cα) : ∅ 6= α ⊂ 1, . . . , d

• Assumption: dµ|Cαdvα = O(1).

• Remark: RepresentationM is linear (after non-linear transform ofthe data X→ V).

23

Page 24: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Multivariate EVT & Representation of Extremes

Sparse Representation ?

Full pattern : Sparse patternanything may happen (V1 not large if V2 or V3 large)

24

Page 25: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

1 Anomaly Detection and Scikit-Learn

2 Multivariate EVT & Representation of Extremes

3 Estimation

4 Experiments

25

Page 26: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

Problem: M is an asymptotic representation

M =Φ(Ωα), α

=µ(Cα), α

is the restriction of an asymptotic measure

µ(A) = limt→∞ tP[V ∈ t A]

to a representative class of set Cα, α, but only the central sub-conehas positive Lebesgue measure!

⇒ Cannot just do, for large t :

Φ(Ωα) = µ(Cα) ' tP(tCα)

26

Page 27: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

SolutionFix ε > 0. Affect data ε-close to an edge, to that edge.

Ωα → Ωεα = v ∈ Sd−1 : vi > ε (j ∈ α), vj ≤ ε (j /∈ α).Cα → Cεα = tΩεα, t ≥ 1

New partition of Sd−1, compatible with non asymptotic data.27

Page 28: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

V ji = 1

1−Fj (Xji )

with Fj(Xji ) =

rank(X ji )−1

n

⇒ get an natural estimate ofΦ(Ωα)

Φ(Ωα) :=nkPn(V ∈

nkCεα)

(nk

large, ε small)

⇒ we obtainM :=

Φ(Ωα), α

28

Page 29: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

TheoremThere is an absolute constant C > 0 such that for anyn > 0, k > 0, 0 < ε < 1, δ > 0 such that 0 < δ < e−k , with probabilityat least 1 − δ,

‖M−M‖∞ ≤ Cd

(√1εk

logdδ+ Mdε

)+ bias(ε, k ,n),

Comments:• C: depends on M = sup(density on subfaces)• Existing litterature (for spectral measure) Einmahl Segers 09, Einmahl

et.al. 01

d = 2.

asymptotic behaviour, rates in 1/√

k .Here: 1/

√k → 1/

√εk + ε. Price to pay for biasing our estimator

with ε.29

Page 30: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

Theorem’s proof1 Maximal deviation on VC-class:

supxε

|µn − µ|([x ,∞[) ≤ Cd

√2k

logdδ+ bias(ε, k ,n)

Tools: Vapnik-Chervonenkis inequality adapted to small

probability sets: bounds in√

p√

1n log 1

δ

On the VC class [nk x ,∞], x ≥ ε

30

Page 31: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

Theorem’s proof1 Maximal deviation on VC-class:2 Decompose error:

|µn(Cεα) − µ(Cα)| ≤ |µn − µ|(Cεα)︸ ︷︷ ︸A

+ |µ(Cεα) − µ(Cα)|︸ ︷︷ ︸B

I A : First step.I B : density on Cεα × Lebesgue : small

30

Page 32: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

Algorithm

DAMEX in O(dn log n)Input: parameters ε > 0, k = k(n),

1 Standardize via marginal rank-transformation:Vi :=

(1/(1 − Fj(X

ji )))

j=1,...,d .

2 Assign to each Vi the cone nk C

εα it belongs to.

3 Φα,εn := Φ(Ωα) =nk Pn(V ∈ n

k Cεα) the estimate of the α-mass of

Φ.Output: (sparse) representation of the dependence structure

M := (Φα,εn )α⊂1,...,d ,Φα,εn >Φmin

31

Page 33: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Estimation

Application to Anomaly DetectionAfter standardization of marginals: P[R > r ,W ∈ B] ' 1

r Φ(B)

→ scoring function = Φεn ×1/r :

sn(x) := (1/‖T (x)‖∞)∑α

Φα,εn 1T (x)∈Cεα.

where T : X 7→ V (Vj =1

1−Fj (Xj ))

32

Page 34: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

1 Anomaly Detection and Scikit-Learn

2 Multivariate EVT & Representation of Extremes

3 Estimation

4 Experiments

33

Page 35: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

number of samples number of featuresshuttle 85849 9forestcover 286048 54SA 976158 41SF 699691 4http 619052 3smtp 95373 3

Table: Datasets characteristics

34

Page 36: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on SF dataset

35

Page 37: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on http dataset

36

Page 38: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on shuttle dataset

37

Page 39: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on forestcover dataset

38

Page 40: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on SA dataset

39

Page 41: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Figure: ROC and PR curve on smtp dataset

40

Page 42: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Experiments

Thank you !

41

Page 43: Anomaly Detection in Scikit-Learn and new tools from ... · Figure:ROC and PR curve on smtp dataset 40. Experiments Thank you ! 41. Some references: Varun Chandola, Arindam Banerjee,

Some references:

• Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomalydetection: A survey, 2009

• J. H. J. Einmahl , J. Segers Maximum empirical likelihood estimation ofthe spectral measure of an extreme-value distribution

• J. H. J. Einmahl, Andrea Krajina, J. Segers. An m-estimator for taildependence in arbitrary dimensions, 2012.

• N. Goix, A. Sabourin, S. Clemencon. Learning the dependence structureof rare events: a non-asymptotic study.

• L. de Haan , A. Ferreira. Extreme value theory, 2006• FT Liu, Kai Ming Ting, Zhi-Hua Zhou. Isolation forest, 2008• Y. Qi. Almost sure convergence of the stable tail empirical dependence

function in multivariate extreme statistics, 1997• S. Resnick. Extreme Values, Regular Variation, Point Processes, 1987• S.J. Roberts. Novelty detection using extreme value statistics, Jun 1999• J. Segers. Max-stable models for multivariate extremes, 2012

41