your data scientist hates you

Post on 16-Apr-2017

173 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Your Data Scientist Hates YouBradford Stephens

bradford@roboticprofit.com

ft. help from Nick Kypreos nick@roboticprofit.com

About Us

Data Infrastructure and Data Science at scale.

fact

71%1 of Data Science projects fail

fact

63%2 of Data Scientists quit in < 2 years

• Data Scientists aren’t involved from the beginning

• No strategy

• Bad Data: more common than you think and untestable

• Pointlessness

Why

• Everything stems from this

• Goals need to be attainable

• Data needs to be accessible and formatted correctly

• You can’t conceive of what’s possible (or impossible)

Involvement

Your Data Strategy

Your Data Strategy: Diagnostic

• Diagnostic: How did we get here?

• Understanding history and how your org drives decisions is key

• What will your org’s immune system allow?

• Infrastructure: what is currently in place and how did it happen?

• Goals: How do we drive revenue or KPIs?

Your Data Strategy: Roadmapping

• Roadmapping: What are we going to build?

• Data Architecture?

• Platform feasible?

• Who builds what when, for how much?

• How do we ensure a low-latency feedback loop? DS highly iterative

Your Data Strategy: Development

• Platform: What’s our stack?

• Storage: Where does data come from, go to, and latency/throughput requirements on storage?

• Processing: Where do we transform data? Batch? Real-time? Bounds?

• Collaboration: How do we share results, data, and APIs across the org? (always forgotten)

Bad Data

Data Science is Untestable

Data Science = Math + QA + CS + PM + Psionics

Untestable

• Data Scientists spend vast amounts of time fixing data

• …and you need to be OK with that

• Unit Testing doesn’t make sense in science

• Distributions fittings, etc

• Can only test via simulation: a whole ‘nother process

• “Simple” things take weeks to verify

Instrumentation

• Can you even verify your instrumentation?

• Are you collecting everything?

• Collecting the right thing?

• What if only 85% of the time?

• Systematically drop at high enough traffic?

• Someone comes into site through different channel from an acquisition 2 yr ago?

Software is Garbage

• Remember Hadoop?

• Spark?

• MLib bugs for years

• Wrong math won’t fail unit tests

• GIGO

• JSON, weekly microversioning, schema entropy…

• This is why DS efforts are so slow to start w/o initial involvement

• Don’t build the One True Data Platform

• one of our customers had 30 DBs including a critical out-of-license DB2 box

Pointlessness

! Dashboards ! are ! not ! a ! strategy

“Here’s some data, just tell us what’s interesting…”

“We didn’t think that was interesting, you’re bad at your job.”

Data Must be Treated like a Product

• Build a Data Products Team

• Engineers, PMs, Design. Data Science. Not just analysts.

• KPIs, Goals, Measurability, Backlogs

• Budget

• Freedom to Innovate

• Staff of diverse backgrounds

A Data Platform will touch every part of your org

top related