r meetup talk scaling data science with dgit
TRANSCRIPT
Scaling Data Science with dgit
Dr. Venkata PingaliFounder, Scribble [email protected]
https://github.com/pingali
Summary
1. Scaling impact of data science requires increasing trust and efficiencya. Trust requires auditability and reproducibility of results
b. Efficiency requires standardization and automation
2. Dataset is a fundamental abstraction of data science
3. dgit enables git-like management of datasetsa. Python package, open source, MIT licence
b. Familiar git interface with modifications
4. Call to collaborate
dgit - 1 min summary
dgit - git wrapper for datasets
1. Python package, MIT license2. Application of git3. Beyond git - “Understands” data
a. Metadata generation and managementb. Automatic scanning of working directory for changesc. Automatic validation and materialization d. Dependency tracking across repos e. Automatic audit trails with executionf. Pipeline support
Growing Pains in Data Science
Anonymized Random Slide from an Actual Presentation
Implication: Large wasted spend, poor production design, baseline worsening
Decision-maker Questions
1. Where did the numbers come from? (Correctness, Lineage)a. Assumption, models, datasets
2. Is this an accident? Does it hold now? (Reproducibility, Retargetability)a. Model, dataset, and question revisions
3. Can you get the results faster? (Efficiency)a. Time, effort, cost
4. Can you also analyze X? (Extensibility) a. Different dataset, question
5. Could we try X? (Dataset generation - synthetic and real)a. What if scenarios, field experiments
Conceptual Process Biz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story TellingAll three roles could be in a single team!
Business Complexity is Discovered Over Time
Incomplete context (history, semantics)Qtns not thought through Continuous revisions
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Imperfect Data Queries due to Limited Understanding
Dependencies not specifiedWrong filters Known outliers Narrow specification (cubes)
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Weak process
Lack of protocol (email/files)Missing validation checksNo lineageNo revisions
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Eagerness to Present Great Narratives
Wrong input datasetMistakes in pipelineExcel/adhoc transformationsModel evolutionContinuous revision of narratives Missing interpretation integrity checks (e.g. other time windows)Better methodology
Biz Analytics Team
Data Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
Process in RealityBiz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
IterativeExpensiveLaborious
Actual Process Biz Analytics
TeamData Engg
Qtns, Context
Data Req
Datasets
Model Results
Story Telling
IterativeExpensiveLaborious
http://fortune.com/2016/02/05/why-big-data-isnt-paying-off-for-companies-yet/
"80% of ..companies strategic decision go haywire.. “flawed” data
Desired State
1. Trusted a. Every model should be auditable to the last record and step ⬅b. Every model should be reproducible with zero human intervention ⬅c. Enables use and development of mathematical judgment
2. Scalablea. Highly automated through most of the lifecycle ⬅b. Continuous reduction in costs ⬅c. Grow sublinearly with questions, datasets, models
3. Robusta. Younger, inexperienced staff ⬅b. Weak processes
Process with Dataset RepositoryBiz Analytics
TeamData Engg
Server Side CI
Dataset RulesEvaluation Rules
DependenciesMaterialized dataset
v1
v2
v3MaterializeModel Pipeline
Pipeline Executionv4
Slide ContentURN
Context,Questions
v5Evaluation Interpretation
v6
Dataset as mutable object with memory
No emails/google docs
Continuous validation by thirdparty (server)
Separate model development and evaluation
dgit
Dgit Structure
dgitcore API
Repo Mgr
Git
Backend
S3
Validator Generator Instrumentation
MySQLS3Regression ContentPlatform
dgit CLI
Metadata
Basic
Demo Goals
1. Show end-to-end example (command line)a. Simple regression
2. Explain structure 3. Advanced features
a. Validation (regression quality plugin) b. Generator (SQL)c. Pipeline (Dora)
Open Tasks
1. Dgit specifica. Cleanup and stabilization
i. Python v2/3 compatibility ii. Plugins to do various tasks (anonymization, hive etc)
b. Testing infrastructure
c. Integrationi. Windows and MacOS support ii. Support for instabase/dat/other services
2. Ideas for new tools to reduce cost and complexity of data science
Speaker
Dr. Venkata Pingali
Founder, Scribble DataFormer-VP Analytics, FourthLion
IIT(B) PhD (USC)
http://linkedin.com/in/pingali