r meetup talk scaling data science with dgit

21
Scaling Data Science with dgit Dr. Venkata Pingali Founder, Scribble Data [email protected] https://github.com/pingali

Upload: venkata-pingali

Post on 11-Feb-2017

453 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: R meetup talk   scaling data science with dgit

Scaling Data Science with dgit

Dr. Venkata PingaliFounder, Scribble [email protected]

https://github.com/pingali

Page 2: R meetup talk   scaling data science with dgit

Summary

1. Scaling impact of data science requires increasing trust and efficiencya. Trust requires auditability and reproducibility of results

b. Efficiency requires standardization and automation

2. Dataset is a fundamental abstraction of data science

3. dgit enables git-like management of datasetsa. Python package, open source, MIT licence

b. Familiar git interface with modifications

4. Call to collaborate

Page 3: R meetup talk   scaling data science with dgit

dgit - 1 min summary

Page 4: R meetup talk   scaling data science with dgit

dgit - git wrapper for datasets

1. Python package, MIT license2. Application of git3. Beyond git - “Understands” data

a. Metadata generation and managementb. Automatic scanning of working directory for changesc. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with executionf. Pipeline support

Page 5: R meetup talk   scaling data science with dgit

Growing Pains in Data Science

Page 6: R meetup talk   scaling data science with dgit

Anonymized Random Slide from an Actual Presentation

Implication: Large wasted spend, poor production design, baseline worsening

Page 7: R meetup talk   scaling data science with dgit

Decision-maker Questions

1. Where did the numbers come from? (Correctness, Lineage)a. Assumption, models, datasets

2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)a. Model, dataset, and question revisions

3. Can you get the results faster? (Efficiency)a. Time, effort, cost

4. Can you also analyze X? (Extensibility) a. Different dataset, question

5. Could we try X? (Dataset generation - synthetic and real)a. What if scenarios, field experiments

Page 8: R meetup talk   scaling data science with dgit

Conceptual Process Biz Analytics

TeamData Engg

Qtns, Context

Data Req

Datasets

Model Results

Story TellingAll three roles could be in a single team!

Page 9: R meetup talk   scaling data science with dgit

Business Complexity is Discovered Over Time

Incomplete context (history, semantics)Qtns not thought through Continuous revisions

Biz Analytics Team

Data Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

Page 10: R meetup talk   scaling data science with dgit

Imperfect Data Queries due to Limited Understanding

Dependencies not specifiedWrong filters Known outliers Narrow specification (cubes)

Biz Analytics Team

Data Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

Page 11: R meetup talk   scaling data science with dgit

Weak process

Lack of protocol (email/files)Missing validation checksNo lineageNo revisions

Biz Analytics Team

Data Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

Page 12: R meetup talk   scaling data science with dgit

Eagerness to Present Great Narratives

Wrong input datasetMistakes in pipelineExcel/adhoc transformationsModel evolutionContinuous revision of narratives Missing interpretation integrity checks (e.g. other time windows)Better methodology

Biz Analytics Team

Data Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

Page 13: R meetup talk   scaling data science with dgit

Process in RealityBiz Analytics

TeamData Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

IterativeExpensiveLaborious

Page 14: R meetup talk   scaling data science with dgit

Actual Process Biz Analytics

TeamData Engg

Qtns, Context

Data Req

Datasets

Model Results

Story Telling

IterativeExpensiveLaborious

http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/

"80% of ..companies strategic decision go haywire.. “flawed” data

Page 15: R meetup talk   scaling data science with dgit

Desired State

1. Trusted a. Every model should be auditable to the last record and step ⬅b. Every model should be reproducible with zero human intervention ⬅c. Enables use and development of mathematical judgment

2. Scalablea. Highly automated through most of the lifecycle ⬅b. Continuous reduction in costs ⬅c. Grow sublinearly with questions, datasets, models

3. Robusta. Younger, inexperienced staff ⬅b. Weak processes

Page 16: R meetup talk   scaling data science with dgit

Process with Dataset RepositoryBiz Analytics

TeamData Engg

Server Side CI

Dataset RulesEvaluation Rules

DependenciesMaterialized dataset

v1

v2

v3MaterializeModel Pipeline

Pipeline Executionv4

Slide ContentURN

Context,Questions

v5Evaluation Interpretation

v6

Dataset as mutable object with memory

No emails/google docs

Continuous validation by thirdparty (server)

Separate model development and evaluation

Page 17: R meetup talk   scaling data science with dgit

dgit

Page 18: R meetup talk   scaling data science with dgit

Dgit Structure

dgitcore API

Repo Mgr

Git

Backend

S3

Validator Generator Instrumentation

MySQLS3Regression ContentPlatform

dgit CLI

Metadata

Basic

Page 19: R meetup talk   scaling data science with dgit

Demo Goals

1. Show end-to-end example (command line)a. Simple regression

2. Explain structure 3. Advanced features

a. Validation (regression quality plugin) b. Generator (SQL)c. Pipeline (Dora)

Page 20: R meetup talk   scaling data science with dgit

Open Tasks

1. Dgit specifica. Cleanup and stabilization

i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc)

b. Testing infrastructure

c. Integrationi. Windows and MacOS support ii. Support for instabase/dat/other services

2. Ideas for new tools to reduce cost and complexity of data science

Page 21: R meetup talk   scaling data science with dgit

Speaker

Dr. Venkata Pingali

Founder, Scribble DataFormer-VP Analytics, FourthLion

IIT(B) PhD (USC)

http://linkedin.com/in/pingali