anomaly/novelty detection with scikit-learn

28
Anomaly/Novelty Detection with scikit-learn Alexandre Gramfort Telecom ParisTech - CNRS LTCI [email protected] GitHub : @agramfort Twitter : @agramfort

Upload: agramfort

Post on 13-Feb-2017

1.653 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Anomaly/Novelty detection with scikit-learn

Anomaly/Novelty Detectionwith scikit-learn

�������

Alexandre GramfortTelecom ParisTech - CNRS LTCI

[email protected]

GitHub : @agramfort Twitter : @agramfort

Page 2: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

What’s the problem?

2

�������

Objective: Spot the red apple

Page 3: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

What’s the problem?

3

�������

“An outlier is an observation in a data set which appears to be inconsistent with the remainder of that set of data.”

Johnson 1992

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”

Hawkins 1980

Outlier/Anomaly

Page 4: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

Types of AD

4

• Supervised AD

• Labels available for both normal data and anomalies

• Similar to rare class mining / imbalanced classification

• Semi-supervised AD (Novelty Detection)

• Only normal data available to train

• The algorithm learns on normal data only

• Unsupervised AD (Outlier Detection)

• no labels, training set = normal + abnormal data

• Assumption: anomalies are very rare

Page 5: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

Types of AD

5

• Supervised AD

• Labels available for both normal data and anomalies

• Similar to rare class mining / imbalanced classification

• Semi-supervised AD (Novelty Detection)

• Only normal data available to train

• The algorithm learns on normal data only

• Unsupervised AD (Outlier Detection)

• no labels, training set = normal + abnormal data

• Assumption: anomalies are very rare

Page 6: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

ML Taxonomy

6

Machine Learning

SupervisedUnsupervised

Regression Classif.

“Prediction”

�������

Clustering Dim. Red. AnomalyNovelty

Detection

Page 7: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

Applications

7

• Fraud detection

• Network intrusion

• Finance

• Insurance

• Maintenance

• Medicine (unusual symptoms)

• Measurement errors (from sensors)

Any application where looking at unusual observations is

relevant

Page 8: Anomaly/Novelty detection with scikit-learn
Page 9: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

Big picture

9

Look for samples that are in low density regions, isolated

Look for a region of the space that is small in volume but

contains most of the samples

Page 10: Anomaly/Novelty detection with scikit-learn

Density based approach (KDE, Gaussian Ellipse, GMM)

Kernel methods

Nearest neighbors

Trees / Partitioning

Page 11: Anomaly/Novelty detection with scikit-learn

Novelty detection via density estimates�������

>>> from sklearn.mixture import GaussianMixture>>> gmm = GaussianMixture(n_components=1).fit(X)>>> log_dens = gmm.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

Low density regions

Page 12: Anomaly/Novelty detection with scikit-learn

Novelty detection via density estimates�������

>>> from sklearn.mixture import GaussianMixture>>> gmm = GaussianMixture(n_components=2).fit(X)>>> log_dens = gmm.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

Page 13: Anomaly/Novelty detection with scikit-learn

Novelty detection via density estimates�������

>>> from sklearn.neighbors import KernelDensity>>> kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(X)>>> log_dens = kde.score_samples(X_plot)>>> plt.fill(X_plot[:, 0], np.exp(log_dens), fc='#ffaf00', alpha=0.7)

Page 14: Anomaly/Novelty detection with scikit-learn

Kernel methods

Nearest neighbors

Trees / Partitioning

Page 15: Anomaly/Novelty detection with scikit-learn

Our toy dataset�������

Page 16: Anomaly/Novelty detection with scikit-learn

Kernel Approach�������

http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html

>>> est = OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)

Page 17: Anomaly/Novelty detection with scikit-learn

Nearest neighbors

Trees / Partitioning

Page 18: Anomaly/Novelty detection with scikit-learn

Nearest Neighbors (NN) Approach�������

https://github.com/scikit-learn/scikit-learn/pull/5279

>>> est = LocalOutlierFactor(n_neighbors=5)

Page 19: Anomaly/Novelty detection with scikit-learn

Local Outlier Factor (LOF)�������

https://en.wikipedia.org/wiki/Local_outlier_factor

Page 20: Anomaly/Novelty detection with scikit-learn

Trees / Partitioning

Page 21: Anomaly/Novelty detection with scikit-learn

Partitioning / Tree Approach�������

http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.IsolationForest.html

>>> est = IsolationForest(n_estimators=100)

Page 22: Anomaly/Novelty detection with scikit-learn

Isolation Forest�������

An anomaly can be isolated with a very shallow random tree

Page 23: Anomaly/Novelty detection with scikit-learn
Page 24: Anomaly/Novelty detection with scikit-learn

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.htmlNetwork intrusions

https://github.com/scikit-learn/scikit-learn/blob/master/benchmarks/bench_isolation_forest.py

Page 25: Anomaly/Novelty detection with scikit-learn

Alexandre Gramfort Anomaly detection with scikit-learn

Some caveats

25

• How to set model hyperparameters?

• How to evaluate performance in the unsupervised setup?

• In any AD method there is a notion of metric/similarity

between samples, e.g. Euclidian distance. Unclear how to define

it (think continuous, categorical features etc.)

Page 26: Anomaly/Novelty detection with scikit-learn

http://scikit-learn.org/stable/modules/outlier_detection.html

Page 27: Anomaly/Novelty detection with scikit-learn
Page 28: Anomaly/Novelty detection with scikit-learn

Alexandre [email protected]:

GitHub : @agramfort Twitter : @agramfort

Questions?

1 position to work on Scikit-Learn and Scipy stack available !

Thanks @ngoix & @albertthomas88 for the work