bigml spring 2014 webinar - clustering!

13
BigML Inc

Upload: bigml

Post on 21-Jan-2015

2.211 views

Category:

Software


5 download

DESCRIPTION

This webinar showcases BigML’s Spring Release, including our brand new Clustering Algorithm as well as new online dataset creation and transformation capabilities. During the webinar we explain clustering in general, and then walk through the following use cases to show how you can use BigML’s clusters to gain greater understanding from your data: - Customer segmentation - Item discovery - Data summarization / compression - Collaborative filtering / recommender - Active learning The webinar features demos that utilize both BigML’s acclaimed user interface, as well as our underlying API (through an iPython notebook which can be accessed here http://nbviewer.ipython.org/gist/petersen-poul/f3d7bce160241f293501).

TRANSCRIPT

Page 1: BigML Spring 2014 Webinar - Clustering!

BigML Inc

Page 2: BigML Spring 2014 Webinar - Clustering!

BigML Inc 2

Today’s Webinar

• Speaker:

• Poul Petersen, CIO

• Moderator:

• Andrew Shikiar, VP Business Development

• Enter questions into chat box – we’ll answer some via text; others at the end of the session

• For direct follow-up, email us at [email protected]

Page 3: BigML Spring 2014 Webinar - Clustering!

BigML Inc 3

Clustering

BigML’s first unsupervised learning offering!

Page 4: BigML Spring 2014 Webinar - Clustering!

BigML Inc 4

Trees vs Clusters

Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label

Clusters (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity

Page 5: BigML Spring 2014 Webinar - Clustering!

BigML Inc 5

Trees vs Clusterssepal length

sepal width

petal length

petal width species

5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …

sepal length

sepal width

petal length

petal width

5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …

Inputs “X” Label “Y”

Learning Task: Find function “f” such that: f(X)≈Y

Learning Task: Find “k” clusters such that the data in each cluster is self similar

Page 6: BigML Spring 2014 Webinar - Clustering!

BigML Inc 6

Clustering Basics

K=3centroids

Page 7: BigML Spring 2014 Webinar - Clustering!

BigML Inc

batchprediction batchcentroid

centroidprediction

model cluster

7

WorkflowSupervised Learning Unsupervised Learning

CENTROIDCLUSTER

CLUSTER DATASET

+

CSV

DATASET MODEL DATASET CLUSTER

INSTANCE

+

PREDICTIONINSTANCE

+

MODEL

DATASET

+

CSVMODEL

Page 8: BigML Spring 2014 Webinar - Clustering!

BigML Inc 8

Use Cases

• Customer segmentation

• Item discovery

• Data summarization / compression

• Collaborative filtering / recommender

• Active learning

Page 9: BigML Spring 2014 Webinar - Clustering!

BigML Inc 9

Item Discovery• Dataset of 86 whiskies

• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.

GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.

Page 10: BigML Spring 2014 Webinar - Clustering!

BigML Inc 10

Customer Segments

GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.

• Dataset of mobile game users.

• Data for each user consists of usage statistics and a LTV based on in-game purchases

• Assumption: Usage correlates to LTV

Page 11: BigML Spring 2014 Webinar - Clustering!

BigML Inc 11

Active Learning

GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.

• Dataset of diagnostic measurements of 768 patients.

• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*

*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.

Page 12: BigML Spring 2014 Webinar - Clustering!

BigML Inc 12

Clustering

• High dimensions - 10,000 fields

• Mixed data:

• numerical: 3.4

• categorical: red, green, blue

• date time: 2014-05-14T12:34:56

• unstructured text: “The quick brown fox…”

• Computing cluster membership for new data

• Using clusters programmatically

Page 13: BigML Spring 2014 Webinar - Clustering!

BigML Inc 13

FEEDBACK

@bigmlcom TWITTER

[email protected]

Get Started Today!

RESOURCESJoin us for future

webinars & hangouts