decoding data science

58
DECODING DATA SCIENCE Matt Fornito Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm @MattFornito

Upload: matt-fornito

Post on 16-Apr-2017

231 views

Category:

Data & Analytics


0 download

TRANSCRIPT

DECODING DATA SCIENCEMatt Fornito

Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm

@MattFornito

BIG DATA

“Big Data is the simple yet seemingly revolutionary belief that data are valuable…I believe that ‘big’ actually means important.

-Sean Patrick Murphy

BIG DATA

➤ There is a continuous assumption that organizations all have ‘big data’ and they need solutions to manage big data

➤ From a more realistic perspective, big data operates on the premise of both (1) storage and (2) memory.

➤ Big Data is not easily stored on a single hard drive (or a single computer with multiple hard drives).

➤ Big Data requires meaningful memory processing e.g. if we had 100,000,000 rows and 100 variables, we’d likely have a big data need because that cannot be processed via data science analytics with 4-32 GB of memory (for the most part).

BIG DATA

➤ Hadoop & AWS ➤ Small/Mid-size Organizations: AWS cheaper than any dedicated

infrastructure ➤ Large Organizations: Can afford dedicated servers or choose cloud

computing for highly-scalable solutions

“As big data and statistics engage with one another, it is critical to remember that the two fields are united by one common goal: to draw reliable conclusions from available data.

-Kaiser Fung

DIVE INTO DATA SCIENCE

“A data scientist is a person who is better at blah blah blah

-Josh Willis

WHAT IS DATA SCIENCE

➤ DATA SCIENCE is the utilization of data to solve problems ➤ Bonus points for novel, interesting, necessary, and complex

problems

➤ A DATA SCIENTIST is a professional who uses the scientific method to liberate and create meaning from raw data

DATA SCIENCE MARKET

Data Scientist is the #1 job of 2016 according to both Forbes and Glassdoor

DATA SCIENCE IS EASY!FALSE

T-MODEL TO SUCCESS

Breadth of Knowledge

Dep

th o

f E

xper

tise

DATA SCIENCE SKILLS SUMMARY

Programming Data Cleaning

Feature Engineering Statistics

Machine Learning Optimization Visualizations

Communication Creativity, Curiosity, & Problem Solving

PROGRAMMING

TOO MANY OPTIONS?

R VS. PYTHON

lm(y ~ x1 + x2 + x3, data=mydata)

linear_model.LinearRegression()

DATA CLEANING

DATA CLEANING/WRANGLING

Approximately 80% of time and costs are related to cleaning up data and other quality issues

➤ Invalid ➤ Missing ➤ Duplicated ➤ Corrupted ➤ Inconsistent

DataFrame

‘CO’ ‘Colorado’

“If I had only one hour to save the world, I would spend fifty-five minutes defining the problem, and only five minutes finding the solution.

-Albert Einstein (attributed)

FEATURE ENGINEERING

FEATURE ENGINEERING: TRANSFORMATIONS

FEATURE ENGINEERING: PARSING & NEW FEATURES

Date of Sale

03/25/2014

09/22/2015

04/05/2016

05/12/2016

Day Month Year Day of Week

Days Since Sale

25 3 2014 Tuesday 782

22 9 2015 Tuesday 236

5 4 2016 Wednesday 40

12 5 2016 Thursday 3

STATISTICS

STATISTICS

➤ Summary Statistics ➤ Probability/Combinatorics ➤ Distributions (e.g. Binomial, Uniform, Poisson, etc.) ➤ Linear Algebra ➤ Hypothesis Testing ➤ Calculus ➤ Graph Theory ➤ Bayesian Analysis

MACHINE LEARNING

MACHINE LEARNING

➤ Machine Learning is the process of letting ‘machines’ do the heavy lifting

➤ More Formally: it’s defined as the field of study that gives computers the ability to learn without being explicitly programmed.

➤ Two Paths:

Supervised Learning Unsupervised Learning

DEEP LEARNING

➤ Deep Learning is a branch of Machine Learning, usually more advanced that uses multiple processing layers composed of multiple data transformations.

➤ It is often constructed on pictures, audio, videos, and text data.

STATISTICS & MACHINE LEARNING

Parsimony

Line

ar R

egre

ssio

n

Recur

rent

Neu

ral N

etw

orkPredictive Power

vs.

Interpretability

OPTIMIZATION

VISUALIZATIONS

COMMUNICATION & STORY TELLING

CREATIVITY, CURIOSITY,

& PROBLEM SOLVING

“How do I do X in R/Python?

-Everyone

TOP DOWN COGNITIVE FRAMEWORK

➤ Problem solving holistic approach

➤ Parse all into meaningful chunks

➤ Solve piece-by-piece

➤ Roll back up

BREAK INTO THE FIELD

REQUIRED SKILLS

➤ Strong statistics/probability/distributions/etc. background

➤ [Ideal] experience with Machine Learning

➤ Python and/or R

➤ SQL

➤ [Ideal] AWS and/or Hadoop

➤ Problem Solving skills & Asking the right questions

➤ Capable of explaining what was done and why at all levels

PROGRAMMING SCHOOLS

ONLINE COURSES

DATA SCIENCE BOOTCAMPS

OPEN SOURCE DATA SCIENCE MASTERS

METACADEMY

MEASURING SUCCESS

KEEPING AN EYE ON RECRUITER BEHAVIOR

➤ Using eye-tracking software, researchers found recruiters spend only 6 seconds reviewing a resume.

➤ 80% of time is spent looking at Education, Current/Previous Company & Current/Previous Title

➤ Take Away: Getting a job from a renowned company OR with a data scientist title opens up a lot of doors.

JOB ROLES

DATA ARCHITECT

DATA ENGINEER

DATA SCIENTIST

ARCHITECTING & FLOW MODELS

FLOW MODEL

DATA ARCHITECTURE

DATA ENGINEERING DATA SCIENCE AUTOMATION

ARCHITECTING & ENGINEERINGIngestion Warehousing/Storage Cleaning & Optimization

DATA SCIENCE ITStat Software Exploration

Visualizations Cleaning

Modeling Automation Visualizations Communication

BUILDING A TEAM

CROSS-INDUSTRY STANDARD PROCESS FOR DATA MINING (CRISP-DM)

KEY FEATURES WHEN HIRING

➤ Cultural Fit

➤ Math/Statistics/Machine Learning knowledge

➤ Programming skills (hackerrank challenges/take home assessments)

➤ One-day on site/Day-in-life

➤ Continuous Learning Assessment (i.e. What do you enjoy about Data Science?)

➤ Problem Solving (situational interview questions or past performance assessment)

“The impact of a data science team is dependent upon its ability to influence the adoption of its recommendations.

Elena Grewal & Riley Newman

FINDING THE UNICORN

ALTERNATIVE APPROACH

THANK YOUMatt Fornito

Director of Analytics OpsVision Solutions: Big Data/Cloud Consulting Firm

@MattFornito BigDataUnicorn.com