bigml spring 2014 webinar - clustering!
DESCRIPTION
This webinar showcases BigML’s Spring Release, including our brand new Clustering Algorithm as well as new online dataset creation and transformation capabilities. During the webinar we explain clustering in general, and then walk through the following use cases to show how you can use BigML’s clusters to gain greater understanding from your data: - Customer segmentation - Item discovery - Data summarization / compression - Collaborative filtering / recommender - Active learning The webinar features demos that utilize both BigML’s acclaimed user interface, as well as our underlying API (through an iPython notebook which can be accessed here http://nbviewer.ipython.org/gist/petersen-poul/f3d7bce160241f293501).TRANSCRIPT
BigML Inc
BigML Inc 2
Today’s Webinar
• Speaker:
• Poul Petersen, CIO
• Moderator:
• Andrew Shikiar, VP Business Development
• Enter questions into chat box – we’ll answer some via text; others at the end of the session
• For direct follow-up, email us at [email protected]
BigML Inc 3
Clustering
BigML’s first unsupervised learning offering!
BigML Inc 4
Trees vs Clusters
Trees (Supervised Learning) !Provide: labeled data Learning Task: be able to predict label
Clusters (Unsupervised Learning) !Provide: unlabeled data Learning Task: group data by similarity
BigML Inc 5
Trees vs Clusterssepal length
sepal width
petal length
petal width species
5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Inputs “X” Label “Y”
Learning Task: Find function “f” such that: f(X)≈Y
Learning Task: Find “k” clusters such that the data in each cluster is self similar
BigML Inc 6
Clustering Basics
K=3centroids
BigML Inc
batchprediction batchcentroid
centroidprediction
model cluster
7
WorkflowSupervised Learning Unsupervised Learning
CENTROIDCLUSTER
CLUSTER DATASET
+
CSV
DATASET MODEL DATASET CLUSTER
INSTANCE
+
PREDICTIONINSTANCE
+
MODEL
DATASET
+
CSVMODEL
BigML Inc 8
Use Cases
• Customer segmentation
• Item discovery
• Data summarization / compression
• Collaborative filtering / recommender
• Active learning
BigML Inc 9
Item Discovery• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.
BigML Inc 10
Customer Segments
GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
BigML Inc 11
Active Learning
GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.
• Dataset of diagnostic measurements of 768 patients.
• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.
BigML Inc 12
Clustering
• High dimensions - 10,000 fields
• Mixed data:
• numerical: 3.4
• categorical: red, green, blue
• date time: 2014-05-14T12:34:56
• unstructured text: “The quick brown fox…”
• Computing cluster membership for new data
• Using clusters programmatically
BigML Inc 13
FEEDBACK
@bigmlcom TWITTER
Get Started Today!
RESOURCESJoin us for future
webinars & hangouts