bigml late summer 2014 webinar - anomaly detection!

18
BigML Inc

Upload: bigml

Post on 28-Nov-2014

629 views

Category:

Software


1 download

DESCRIPTION

This webinar showcases BigML's Summer Release, including a demonstration on how to quickly detect anomalous data with BigML. BigML’s Summer release is headlined by Anomaly Detection, which can help automate a number of predictive tasks for fraud detection, security, quality control, diagnoses and more. Also included in the release (and demonstrated in the webinar) are support for model clusters, missing splits, client-side predictions and more. For more information, visit: http://wp.me/p234d6-1RX

TRANSCRIPT

Page 1: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc

Page 2: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 2

Today’s Webinar

• Speaker:

• Poul Petersen, CIO

• Moderator:

• Andrew Shikiar, VP Business Development

• Enter questions into chat box – we’ll answer some via text; others at the end of the session

• For direct follow-up, email us at [email protected]

Page 3: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 3

Agenda

12 Anomaly Detection

3 Questions

What’s New

2 Coming Soon

Page 4: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 4

Model Clusters

6

5

132

4

7Spicy Body Nutty

5.1 3.5 1.42.6 3.5

6.7 2.5 5.8… … …

Spicy Body Nutty In 5?

5.1 3.5 1.4 TRUE5.7 2.6 3.5 FALSE6.7 2.5 5.8 TRUE… … … …

In Cluster 5?

Use models to discover rules that describe clusters

Page 5: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 5

Model Clusters• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile, then discover rules that distinguish the clusters from each other.

Page 6: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 6

Missing SplitsMissing:

101010

Real World Data … is messy

x?

• Define missing tokens: N/A, Null, etc

• Filter out missing values

• Add a new feature to replace missing values

• Default numeric values in cluster

• Proportional prediction for missing input data

• Allow splits on missing values

Page 7: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 7

Online Predictions

• Single predictions

• Computed in real-time using browser JS

• JS will be open sourced

• Available for models, ensembles, and clusters

Page 8: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 8

Fast(er) Ensembles

Old New Savings

n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ]

Fetch Dataset “F” secs

Transform Dataset “T” secs

Model Dataset

“M” secs

Store Model

“S” secs

Tim

e

Number of Models “n”

Insight: if the dataset fits in memory, we can perform the fetch and transform steps once and model quickly in memory

Page 9: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 9

Anomaly Detection

An unsupervised algorithm to find unusual data quickly and easily

Page 10: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 10

Cluster (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity

Anomalies (Unsupervised Learning) !Provide: unlabeled data Learning Task: Rank data by dissimilarity

Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label

Learning Tasks

Page 11: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 11

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

Inputs “X” “Y”

Learning Task: Find function “f” such that: f(X)≈Y

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Find “k” clusters such that the data in each cluster is self similar

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.

Learning Tasks

Page 12: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 12

AnomaliesIsolation Forest:

Grow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

Page 13: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc

batchcentroid batchanomalyscore

anomalyscorecentroid

cluster anomaly

13

WorkflowClusters Anomalies

ANOMALYSCORE

DATASET

+

CSV

DATASET DATASETCLUSTER

INSTANCE

+

CENTROIDINSTANCE

+

DATASET

+

CSV

ANOMALY

CLUSTER ANOMALY

CLUSTER ANOMALY

Page 14: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 14

Use Cases

• Unusual instance discovery

• Intrusion Detection

• Fraud

• Identify Incorrect Data

• Remove Outliers

• Model Competence / Input Data Drift

Page 15: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 15

Anomalies

• High dimensions - 10,000 fields

• Mixed data:

• numerical: 3.4

• categorical: red, green, blue

• date time: 2014-05-14T12:34:56

• unstructured text: “The quick brown fox…”

• Computing anomaly score for new data

• Using anomaly detectors programmatically

Coming

Page 16: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 16

Coming Soon

• Config panel for anomaly detection

• Project Management

• In-memory sample server

• Dynamic scatterplots

Page 17: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 17

Coming Soon

Page 18: BigML Late Summer 2014 Webinar - Anomaly Detection!

BigML Inc 18

FEEDBACK

@bigmlcom TWITTER

[email protected]

Get Started Today!

RESOURCESJoin us for future

webinars & hangouts