bigml late summer 2014 webinar - anomaly detection!

BigML Inc

BigML Inc 2

Today’s Webinar

• Speaker:

• Poul Petersen, CIO

• Moderator:

• Andrew Shikiar, VP Business Development

• Enter questions into chat box – we’ll answer some via text; others at the end of the session

• For direct follow-up, email us at info@bigml.com

BigML Inc 3

Agenda

12 Anomaly Detection

3 Questions

What’s New

2 Coming Soon

BigML Inc 4

Model Clusters

7Spicy Body Nutty

5.1 3.5 1.42.6 3.5

6.7 2.5 5.8… … …

Spicy Body Nutty In 5?

5.1 3.5 1.4 TRUE5.7 2.6 3.5 FALSE6.7 2.5 5.8 TRUE… … … …

In Cluster 5?

Use models to discover rules that describe clusters

BigML Inc 5

Model Clusters• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile, then discover rules that distinguish the clusters from each other.

BigML Inc 6

Missing SplitsMissing:

101010

Real World Data … is messy

• Define missing tokens: N/A, Null, etc

• Filter out missing values

• Add a new feature to replace missing values

• Default numeric values in cluster

• Proportional prediction for missing input data

• Allow splits on missing values

BigML Inc 7

Online Predictions

• Single predictions

• Computed in real-time using browser JS

• JS will be open sourced

• Available for models, ensembles, and clusters

BigML Inc 8

Fast(er) Ensembles

Old New Savings

n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ]

Fetch Dataset “F” secs

Transform Dataset “T” secs

Model Dataset

“M” secs

Store Model

“S” secs

Number of Models “n”

Insight: if the dataset fits in memory, we can perform the fetch and transform steps once and model quickly in memory

BigML Inc 9

Anomaly Detection

An unsupervised algorithm to find unusual data quickly and easily

BigML Inc 10

Cluster (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity

Anomalies (Unsupervised Learning) !Provide: unlabeled data Learning Task: Rank data by dissimilarity

Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label

Learning Tasks

BigML Inc 11

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

Inputs “X” “Y”

Learning Task: Find function “f” such that: f(X)≈Y

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Find “k” clusters such that the data in each cluster is self similar

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.

Learning Tasks

BigML Inc 12

AnomaliesIsolation Forest:

Grow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

BigML Inc

batchcentroid batchanomalyscore

anomalyscorecentroid

cluster anomaly

WorkflowClusters Anomalies

ANOMALYSCORE

DATASET

DATASET DATASETCLUSTER

INSTANCE

CENTROIDINSTANCE

DATASET

ANOMALY

CLUSTER ANOMALY

BigML Inc 14

Use Cases

• Unusual instance discovery

• Intrusion Detection

• Fraud

• Identify Incorrect Data

• Remove Outliers

• Model Competence / Input Data Drift

BigML Inc 15

Anomalies

• High dimensions - 10,000 fields

• Mixed data:

• numerical: 3.4

• categorical: red, green, blue

• date time: 2014-05-14T12:34:56

• unstructured text: “The quick brown fox…”

• Computing anomaly score for new data

• Using anomaly detectors programmatically

Coming

BigML Inc 16

Coming Soon

• Config panel for anomaly detection

• Project Management

• In-memory sample server

• Dynamic scatterplots

BigML Inc 17

Coming Soon

BigML Inc 18

FEEDBACK

@bigmlcom TWITTER

info@bigml.com

Get Started Today!

RESOURCESJoin us for future

webinars & hangouts

bigml late summer 2014 webinar - anomaly detection!

Software

bigml.io - the bigml api

datamining2 anomaly & outliers...

miscarriage, ectopic pregnancy and molar...

anomaly 44779

density anomaly

reflex "anomaly"

bigml project

dental anomaly

bigml documentation association...

medieval climate anomaly climate in medieval time bradley,...

the anomaly

ebstein’s anomaly

bigml webcast: september 25, 2013

credit anomaly

ebstein anomaly

science of anomaly detection - v4 updated for htm for it ·...

late cenozoic fault pa -...

(12) (10) patent no.: us 9,501,540 b2 united states...

ml laid bare - cambridge wireless · bigml 48 node decision...

analogue anomaly