data scientist toolbox

Click here to load reader

Post on 27-Jan-2015




3 download

Embed Size (px)


My presentation at on how to get your job done as data scientist!


  • 1. Data Scientist Toolbox Andrei Savu - 2013

2. Me Founder of Organizer of Bucharest JUG ( Passion for DevOps, Data Analysis Connect with me on LinkedIn 3. @ Axemblr Service Deployment Orchestration Infrastructure Automation (DevOps) Apache Hadoop On-Demand Appliance Axemblr Provisionr 4. (Big)Data in a nutshell Business Intelligence / Research Evolved Signicant change in Decision Making Enables new Products & Features Enables new Business Models 5. Data Scientist Has a Business / Research orientedperspective Knowledge of statistics & softwareengineering (AI, infrastructure) Ability to explore questions and formulatehypotheses to be tested 6. Data Science Project Focused on particular business goals Based on a set of important questions Result > Answers that support businessdecisions 7. The Algorithm Find *Important* Create PipelinesQuestions Automate & Deploy Identify & Extract Data Learn & Repeat! Store & Sample Analyse Visualization 8. Start w/ Big Questions... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we nd important mentions in social media? 9. Identify Data SourcesOR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc. 10. Extract Data... to a medium that allows you to run arbitrary queriesLocal lesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig 11. Extract Database dump tool, replicas or backups External web services Apache Sqoop (SQL-to-Hadoop) Implement pipelines / real-time streams Write custom tools as needed 12. CurateUnfortunately Data is Messy 13. Curate - Your Way Use or develop tools / scripts On large volumes there no obvious choices Custom ways of ltering & aggregating largestreams (e.g. twitter, sensors) Reuse existing software components fordata curation / validation 14. DataWranglerInteractive System for Data cleaning a transformation 15. Open ReneFormer Google Rene OpenRene 16. Sample (time, etc.)As needed to support interactive exploration 17. Why Sample? Interactive exploration to create and checkassumptions, to create algorithms Be careful with Statistical Signicance Sample Smart: By time, By location etc. 18. Analyse Sample This is were the fun begins 19. Analyse Sample Create models Create algorithms Check hypotheses Faster feedback loops & ImmediateGratication 20. Excel-like 21. Python 22. RStudio 23. 24. Analyse Allapply your results to the entire data set 25. How to Analyse All? Easy on a single machine Go distributed w/ Hadoop, MPI, Storm,Oracle Exa* etc. Key: Leverage existing tools Tools: sed, awkSQL, RHadoop, ApacheHive, Pig, Cloudera Impala, MPI, Custom MR 26. VisualizationCommunicate meaning w/ Graphics 27. 28. Automate & Deploy Make it part of your internal dashboard 29. Learn & RepeatAnswer most of the time generate new questions 30. Thanks! Questions?Andrei Savu / [email protected]@andreisavu