building data products
TRANSCRIPT
1
Building Data Products Josh Wills, Senior Director of Data Science
About Me
2
3
What Do Data Scien<sts Do?
What I Think I Do
4
What Other People Think I Do
5
What I Actually Do
6
Data Science and Data Products
7
8
Thinking About Data Products
The Best Way To Find Insights
9
Build A Team
10
Measure Everything
11
Solve the Right Problem
12
13
Building Data Products with Hadoop
Hadoop as a PlaMorm for Data Products
14
ETL, Data Science, and Machine Learning
15
Changing the Unit of Analysis
16
Machine Learning and You
17
The Five Ques<ons
1. When should I use it? 2. What does the input look like?
3. What does the output look like?
4. How many parameters do I have to tune?
5. Why will it fail?
18
1. Collabora<ve Filtering
19
Collabora<ve Filtering (cont.)
1. To see things that are hidden.
2. <user_id>,<item_id>,<weight>
3. <item1>,<item2>,<score>
4. The distance metric and the weight calcula<ons.
5. If the input data is too sparse.
20
Collabora<ve Filtering on Hadoop
21
2. K-‐Means Clustering
22
K-‐Means Clustering (cont.)
1. To find anomalous events.
2. Vectors of normally distributed values.
3. Cluster centroids.
4. The choice(s) of K.
5. The points aren’t even remotely normally distributed.
23
K-‐Means on Hadoop
24
3. Random Forests
25
Random Forests (cont.)
1. To classify and predict.
2. A dependent variable and many independent variables.
3. Lots and lots of liale trees.
4. The number of variables to consider at each level.
5. Too many independent variables.
26
Random Forests on Hadoop
• R’s randomForest and rhadoop tools
• Map: par<<on the input data among the reducers
• Reduce: fit the random forests to each par<<on
• Re-‐combine the resul<ng trees in the client
27
The Art of Model Design
28
Cau<on: Mind the Gap
29
The Joy of Experiments
30
31
Introduc<on to Data Science: Building Recommender Systems hap://university.cloudera.com/
Josh Wills, Director of Data Science, Cloudera @josh_wills
Thank you!