data science & the data cycle
TRANSCRIPT
![Page 1: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/1.jpg)
Data Science & the Data Cycle
Girl Geeks WaterlooMay 2015
Jennifer Nguyen
![Page 2: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/2.jpg)
Why do we need data?To inform our decisions
![Page 3: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/3.jpg)
Why do we need data?Everyday examples
• To decide what to wear in the morning
![Page 4: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/4.jpg)
Why do we need data?Everyday examples
• To book a flight
![Page 5: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/5.jpg)
Why do we need data?Everyday examples
• To negotiate a salary
![Page 6: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/6.jpg)
Data Cycle
User Activity
Data Collection/Storage
Data Analytics
Improved User Experience
![Page 7: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/7.jpg)
How do we create data?
![Page 8: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/8.jpg)
How do we create data?
Source: DOMO
• Every interaction on any platform is collected
• Clicks, mouse hovers, scrolls
• User generated data• photos, videos, “likes”• purchases
• Direct feedback• Registration information
![Page 9: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/9.jpg)
Data is collected and stored
• Data is collected and stored in large databases to be analyzed
Source: DOMO
![Page 10: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/10.jpg)
What is it used for?“Not only are we doing more with data, data is doing more with us” – Jer Thorp, DOMO
![Page 11: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/11.jpg)
Data Analytics Spectrum
Source: Gartner
Business Intelligence
![Page 12: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/12.jpg)
Business Intelligencea.k.a. reactive analytics
• Reactive analytics answers questions related to past/current events
• E.g., “How are we doing?”, “What went well?”
• Answers are used to inform business decisions
![Page 13: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/13.jpg)
Business Intelligencea.k.a. reactive analytics
• Measuring web traffic• How did the volume of traffic
change since yesterday/last week/last month/last year?
• Where are users coming from?• Which topics resonated with
readers?
![Page 14: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/14.jpg)
Data Analytics Spectrum
Source: Gartner
Data Mining
![Page 15: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/15.jpg)
Data MiningKnowledge Discovery
• Used to find patterns, trends, and insights from data
• Techniques include• Anomaly detection• Association rule learning• Clustering
![Page 16: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/16.jpg)
Data Science Technique: Clustering
• How to segment users?• K-Means Clustering
• Clustering can be used to discover communities of users
• Other applications:• Find similar items, movies
Source: Stanford
![Page 17: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/17.jpg)
Data Analytics Spectrum
Source: Gartner
Predictive Analytics
![Page 18: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/18.jpg)
Predictive Analytics
• Predictive analytics answers questions related to future events
• E.g.:• How likely will this student drop
out?• What would this reader like to
read next?• Will a customer churn?
![Page 19: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/19.jpg)
Data Science Technique: Classification
• How to predict customer churn?• Logistic Regression• Decision Trees• Random Forests
• Results can be used to “save” a customer
Source: Stanford
Filed complaint?
<2 years of tenure?
No
No
Yes
Yes
![Page 20: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/20.jpg)
What is a decision tree?
Filed complaint?
<2 years of tenure?
Yes
Yes
No
No
Root/internal nodes: split the data based on an attribute
Leaf nodes: outputs the prediction
Branches: binary decisions
![Page 21: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/21.jpg)
How to build a decision tree• Given the following data set:
Example adapted from Michael S. Lewicki, Artificial Intelligence: Learning and Decision Trees, http://www.cs.cmu.edu/afs/cs/academic/class/15381-s07/www/slides/041007decisionTrees1.pdf
<2 years of tenure? Filed Complaint? Churned?
N N N
Y N Y
N N N
N N N
N Y Y
Y N N
N Y N
N Y Y
Y N N
Y N N
![Page 22: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/22.jpg)
How to build a decision tree
• Goal:• Split the data in such a way to achieve high classification accuracy
• Requires knowing which attributes to use and in which order• Use “greedy” algorithm:
• Choose attribute that gives best split at each level of tree
![Page 23: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/23.jpg)
Recursive Algorithm
1. Start with all data2. Find query that gives best split.3. Create child nodes4. Recurse until stopping criterion:
• Node consists of one dominant class, considered the node’s “purity”
![Page 24: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/24.jpg)
Building a Decision Tree
Filed complaint?
<2 y
ears
of t
enur
e?
No Yes
No
Yes
Filed complaint?
3 Y7 N
1 Y6 N
2 Y1 N
No Yes
Purity: 6/7 Purity: 2/3
![Page 25: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/25.jpg)
Building a Decision Tree
missed payment?
<2 y
ears
at c
urre
nt jo
b?
No Yes
No
Yes
3 Y7 N
1 Y6 N
2 Y1 N
No Yes
Filed complaint?
<2 years of tenure?
3 Y7 N
1 Y6 N
2 Y1 NNo
Yes
0 Y3 N
1 Y3 N
Purity: 2/3
Purity: 3/3 Purity: 3/4
![Page 26: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/26.jpg)
How to use a decision tree
• Given new data samples, how to predict if the individual will churn?
• Use recursive tree traversal algorithm to find the corresponding leaf node
<2 years of tenure?
Filed complaint?
Churned?
N N ?
Y N ?
N Y ?
Filed complaint?
<2 years of tenure?
![Page 27: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/27.jpg)
Data Analytics Spectrum
Source: Gartner
Machine Learning
![Page 28: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/28.jpg)
Data Science Technique: Collaborative Filtering• Used in simple recommendation
engines to recommend items to users
• Items can be news articles, movies, clothing, friends
![Page 29: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/29.jpg)
Data Mining Technique: Collaborative Filtering
![Page 30: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/30.jpg)
Data Mining Technique: Collaborative Filtering
• Goal is to fill in the missing empty cells with a prediction• Items with positive predictions are recommended to users• Predictions are influenced by ratings from other users• The more similar two users are with respect to their ratings, the more
they will influence the prediction
![Page 31: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/31.jpg)
Data Mining Technique: Collaborative Filtering
![Page 32: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/32.jpg)
Data Mining Technique: Collaborative Filtering
![Page 33: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/33.jpg)
Data Mining Technique: Collaborative Filtering
• Disadvantages of this method:• Users do not understand why some recommendations are made by the
engine• Users may not receive recommendations they like because other users have
not liked them
• Advanced recommendation methods exist to address these shortcomings
![Page 34: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/34.jpg)
How do users benefit from data science?
• Users get a more personalized experience that is tailored to their interests
• E.g., Nest Thermostat, PC Plus
• Users save time from not having to sift through the vast sea of options
• E.g., Netflix, online retailers, LinkedIn
![Page 35: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/35.jpg)
Key Takeaways
• Data science is part of an iterative process to continually improve the user experience
User Activity
Data Collection/Storage
Data Analytics
Improved User Experience
![Page 36: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/36.jpg)
Key Takeaways
• There are many flavours of data analytics
• Which one to use depends on the questions you want answered
![Page 37: Data Science & the Data Cycle](https://reader034.vdocument.in/reader034/viewer/2022050112/626cdbe15c6a254a9a07a411/html5/thumbnails/37.jpg)
Questions?