bigml late summer 2014 webinar - anomaly detection!

Post on 28-Nov-2014

629 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

This webinar showcases BigML's Summer Release, including a demonstration on how to quickly detect anomalous data with BigML. BigML’s Summer release is headlined by Anomaly Detection, which can help automate a number of predictive tasks for fraud detection, security, quality control, diagnoses and more. Also included in the release (and demonstrated in the webinar) are support for model clusters, missing splits, client-side predictions and more. For more information, visit: http://wp.me/p234d6-1RX

TRANSCRIPT

BigML Inc

BigML Inc 2

Today’s Webinar

• Speaker:

• Poul Petersen, CIO

• Moderator:

• Andrew Shikiar, VP Business Development

• Enter questions into chat box – we’ll answer some via text; others at the end of the session

• For direct follow-up, email us at info@bigml.com

BigML Inc 3

Agenda

12 Anomaly Detection

3 Questions

What’s New

2 Coming Soon

BigML Inc 4

Model Clusters

6

5

132

4

7Spicy Body Nutty

5.1 3.5 1.42.6 3.5

6.7 2.5 5.8… … …

Spicy Body Nutty In 5?

5.1 3.5 1.4 TRUE5.7 2.6 3.5 FALSE6.7 2.5 5.8 TRUE… … … …

In Cluster 5?

Use models to discover rules that describe clusters

BigML Inc 5

Model Clusters• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile, then discover rules that distinguish the clusters from each other.

BigML Inc 6

Missing SplitsMissing:

101010

Real World Data … is messy

x?

• Define missing tokens: N/A, Null, etc

• Filter out missing values

• Add a new feature to replace missing values

• Default numeric values in cluster

• Proportional prediction for missing input data

• Allow splits on missing values

BigML Inc 7

Online Predictions

• Single predictions

• Computed in real-time using browser JS

• JS will be open sourced

• Available for models, ensembles, and clusters

BigML Inc 8

Fast(er) Ensembles

Old New Savings

n * [ F + T + M + S ] F + T + n * [ M + S ] ( n - 1 ) * [ F + T ]

Fetch Dataset “F” secs

Transform Dataset “T” secs

Model Dataset

“M” secs

Store Model

“S” secs

Tim

e

Number of Models “n”

Insight: if the dataset fits in memory, we can perform the fetch and transform steps once and model quickly in memory

BigML Inc 9

Anomaly Detection

An unsupervised algorithm to find unusual data quickly and easily

BigML Inc 10

Cluster (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity

Anomalies (Unsupervised Learning) !Provide: unlabeled data Learning Task: Rank data by dissimilarity

Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label

Learning Tasks

BigML Inc 11

sepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

Inputs “X” “Y”

Learning Task: Find function “f” such that: f(X)≈Y

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Find “k” clusters such that the data in each cluster is self similar

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Learning Task: Assign value from 0 (similar) to 1 (dissimilar) to each instance.

Learning Tasks

BigML Inc 12

AnomaliesIsolation Forest:

Grow a random decision tree until each instance is in its own leaf

“easy” to isolate

“hard” to isolate

Depth

Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)

BigML Inc

batchcentroid batchanomalyscore

anomalyscorecentroid

cluster anomaly

13

WorkflowClusters Anomalies

ANOMALYSCORE

DATASET

+

CSV

DATASET DATASETCLUSTER

INSTANCE

+

CENTROIDINSTANCE

+

DATASET

+

CSV

ANOMALY

CLUSTER ANOMALY

CLUSTER ANOMALY

BigML Inc 14

Use Cases

• Unusual instance discovery

• Intrusion Detection

• Fraud

• Identify Incorrect Data

• Remove Outliers

• Model Competence / Input Data Drift

BigML Inc 15

Anomalies

• High dimensions - 10,000 fields

• Mixed data:

• numerical: 3.4

• categorical: red, green, blue

• date time: 2014-05-14T12:34:56

• unstructured text: “The quick brown fox…”

• Computing anomaly score for new data

• Using anomaly detectors programmatically

Coming

BigML Inc 16

Coming Soon

• Config panel for anomaly detection

• Project Management

• In-memory sample server

• Dynamic scatterplots

BigML Inc 17

Coming Soon

BigML Inc 18

FEEDBACK

@bigmlcom TWITTER

info@bigml.com

Get Started Today!

RESOURCESJoin us for future

webinars & hangouts

top related