the last mile: challenges and opportunities in data tools (strata 2014)
TRANSCRIPT
![Page 1: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/1.jpg)
Strata Santa Clara 2014
The Last Mile: Challenges and opportunities
in data tools
![Page 2: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/2.jpg)
www.datapad.io
Wes McKinney
�2
• Former quant @ AQR (a hedge fund)
• Creator of pandas
• Author of Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
@wesmckinn
![Page 4: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/4.jpg)
www.datapad.io�4
•http://datapad.io
•New web-based visual analytics environment
• In private beta, join us!
•Hiring for engineering
![Page 5: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/5.jpg)
www.datapad.io
•Business Analytics
•Statistics and ML
•ETL
•Data Visualization
•Workflows + Collaboration
�5
Some Problems
![Page 6: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/6.jpg)
www.datapad.io�6
Data toolchains
Data Acquisition
ETL
SQL / Tidy Form
Code-based Env UI-based Env
Data Slinging / Management
Analysis
![Page 7: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/7.jpg)
www.datapad.io�7
Data toolchains
Data Acquisition
ETL
HDFS
Code-based Env UI-based Env
Analytic DBMS
ETL
ETL?
Maybe
![Page 8: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/8.jpg)
www.datapad.io
•Columnar / analytic databases
•SQL-on-Hadoop
•Spark / Spark ecosystem
•New life in visual ETL / data prep
•Better data manipulation libraries
�8
Some Trends
![Page 9: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/9.jpg)
www.datapad.io
•R (+ data.table, dplyr)
•Python: pandas
•Data frames in Scala, F#, Julia, …
•Spark (Scala/Java)
�9
Crunching data with code
![Page 10: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/10.jpg)
www.datapad.io
•Awkward / slow DB interactions
• In-process memory management
•Reuse of intermediate results
•Execution speed
•Evaluation semantics�10
Some Programmatic Tool Problems
![Page 11: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/11.jpg)
www.datapad.io
•By Hadley Wickham and Romain Francois
•Uniform R API, SQL and in-memory backends
•Describe complex data manipulation using “chaining”
�11
dplyr (R library)
![Page 12: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/12.jpg)
www.datapad.io�12
dplyr (R library)
final <- crime.by.state %.% filter(State=="New York", Year==2005) %.% arrange(desc(Count)) %.% select(Type.of.Crime, Count) %.% mutate(Proportion=Count/sum(Count)) %.% group_by(Type.of.Crime) %.% summarise(num.types = n(), counts = sum(Count))
![Page 13: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/13.jpg)
www.datapad.io
•Broad set of primitive data ops
•Distributed in-memory model scales naturally, high performance
•Build complex computation graphs for analytics
•Applications: Shark, GraphX, …
�13
Apache Spark
![Page 14: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/14.jpg)
www.datapad.io
•Broad traction
•Strong feature: time series analytics
•User-friendly API and community
•Being used in many unexpected ways
�14
pandas (Python library)
![Page 15: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/15.jpg)
www.datapad.io
•A high performance in-memory analytics engine for DataPad
•Addresses many performance and memory management concerns in pandas
•May become an OSS project someday
�15
badger (DataPad internal)
![Page 16: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/16.jpg)
www.datapad.io
•scikit-learn
•PMML
•Mahout
•Cloudera ML
�16
Standardized machine learning toolkits
![Page 17: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/17.jpg)
www.datapad.io
•Cascading (+ Scalding, Cascalog)
•Apache Crunch
•Pig
�17
Enterprise data workflows
![Page 18: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/18.jpg)
www.datapad.io
•Powering visual analytics tools on big data
•Compressed columnar storage
•MPP / in-memory execution model
�18
Analytic databases
![Page 19: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/19.jpg)
www.datapad.io
•Visual Analytics/BI gone mainstream
•New Data Prep products
•Drag-and-drop predictive analytics
•Proliferation of vertical SaaS solutions
�19
Visual data tools
![Page 20: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/20.jpg)
www.datapad.io
•Tend to be less flexible than code
•Multiple tools to get the job done
•Many still dependent on Excel
•Collaboration, versioning, provenance
�20
Visual tool challenges
![Page 21: The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)](https://reader034.vdocument.in/reader034/viewer/2022052316/558b2310d8b42aad478b4601/html5/thumbnails/21.jpg)
www.datapad.io
•Discovery and reuse
•Cataloguing insights
•Analytics from ad-hoc to production
• Interesting projects: IPython Notebook, Shiny, Pivotal Chorus
�21
Collaboration tools