decoding data science
TRANSCRIPT
DECODING DATA SCIENCEMatt Fornito
Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm
@MattFornito
“Big Data is the simple yet seemingly revolutionary belief that data are valuable…I believe that ‘big’ actually means important.
-Sean Patrick Murphy
BIG DATA
➤ There is a continuous assumption that organizations all have ‘big data’ and they need solutions to manage big data
➤ From a more realistic perspective, big data operates on the premise of both (1) storage and (2) memory.
➤ Big Data is not easily stored on a single hard drive (or a single computer with multiple hard drives).
➤ Big Data requires meaningful memory processing e.g. if we had 100,000,000 rows and 100 variables, we’d likely have a big data need because that cannot be processed via data science analytics with 4-32 GB of memory (for the most part).
BIG DATA
➤ Hadoop & AWS ➤ Small/Mid-size Organizations: AWS cheaper than any dedicated
infrastructure ➤ Large Organizations: Can afford dedicated servers or choose cloud
computing for highly-scalable solutions
“As big data and statistics engage with one another, it is critical to remember that the two fields are united by one common goal: to draw reliable conclusions from available data.
-Kaiser Fung
WHAT IS DATA SCIENCE
➤ DATA SCIENCE is the utilization of data to solve problems ➤ Bonus points for novel, interesting, necessary, and complex
problems
➤ A DATA SCIENTIST is a professional who uses the scientific method to liberate and create meaning from raw data
DATA SCIENCE SKILLS SUMMARY
Programming Data Cleaning
Feature Engineering Statistics
Machine Learning Optimization Visualizations
Communication Creativity, Curiosity, & Problem Solving
DATA CLEANING/WRANGLING
Approximately 80% of time and costs are related to cleaning up data and other quality issues
➤ Invalid ➤ Missing ➤ Duplicated ➤ Corrupted ➤ Inconsistent
DataFrame
‘CO’ ‘Colorado’
“If I had only one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution.
-Albert Einstein (attributed)
FEATURE ENGINEERING: PARSING & NEW FEATURES
Date of Sale
03/25/2014
09/22/2015
04/05/2016
05/12/2016
Day Month Year Day of Week
Days Since Sale
25 3 2014 Tuesday 782
22 9 2015 Tuesday 236
5 4 2016 Wednesday 40
12 5 2016 Thursday 3
STATISTICS
➤ Summary Statistics ➤ Probability/Combinatorics ➤ Distributions (e.g. Binomial, Uniform, Poisson, etc.) ➤ Linear Algebra ➤ Hypothesis Testing ➤ Calculus ➤ Graph Theory ➤ Bayesian Analysis
MACHINE LEARNING
➤ Machine Learning is the process of letting ‘machines’ do the heavy lifting
➤ More Formally: it’s defined as the field of study that gives computers the ability to learn without being explicitly programmed.
➤ Two Paths:
Supervised Learning Unsupervised Learning
DEEP LEARNING
➤ Deep Learning is a branch of Machine Learning, usually more advanced that uses multiple processing layers composed of multiple data transformations.
➤ It is often constructed on pictures, audio, videos, and text data.
STATISTICS & MACHINE LEARNING
Parsimony
Line
ar R
egre
ssio
n
Recur
rent
Neu
ral N
etw
orkPredictive Power
vs.
Interpretability
VISUALIZATIONS
Tell a story…
TOP DOWN COGNITIVE FRAMEWORK
➤ Problem solving holistic approach
➤ Parse all into meaningful chunks
➤ Solve piece-by-piece
➤ Roll back up
REQUIRED SKILLS
➤ Strong statistics/probability/distributions/etc. background
➤ [Ideal] experience with Machine Learning
➤ Python and/or R
➤ SQL
➤ [Ideal] AWS and/or Hadoop
➤ Problem Solving skills & Asking the right questions
➤ Capable of explaining what was done and why at all levels
KEEPING AN EYE ON RECRUITER BEHAVIOR
➤ Using eye-tracking software, researchers found recruiters spend only 6 seconds reviewing a resume.
➤ 80% of time is spent looking at Education, Current/Previous Company & Current/Previous Title
➤ Take Away: Getting a job from a renowned company OR with a data scientist title opens up a lot of doors.
DATA SCIENCE ITStat Software Exploration
Visualizations Cleaning
Modeling Automation Visualizations Communication
KEY FEATURES WHEN HIRING
➤ Cultural Fit
➤ Math/Statistics/Machine Learning knowledge
➤ Programming skills (hackerrank challenges/take home assessments)
➤ One-day on site/Day-in-life
➤ Continuous Learning Assessment (i.e. What do you enjoy about Data Science?)
➤ Problem Solving (situational interview questions or past performance assessment)
“The impact of a data science team is dependent upon its ability to influence the adoption of its recommendations.
Elena Grewal & Riley Newman
THANK YOUMatt Fornito
Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm
@MattFornito BigDataUnicorn.com