data scientist toolbox
Click here to load reader
Post on 27-Jan-2015
Embed Size (px)
DESCRIPTIONMy presentation at http://www.bigdata.ro/ on how to get your job done as data scientist!
- 1. Data Scientist Toolbox Andrei Savu - Axemblr.com BigData.ro 2013
2. Me Founder of Axemblr.com Organizer of Bucharest JUG (bjug.ro) Passion for DevOps, Data Analysis Connect with me on LinkedIn 3. @ Axemblr Service Deployment Orchestration Infrastructure Automation (DevOps) Apache Hadoop On-Demand Appliance Axemblr Provisionrhttps://github.com/axemblr/axemblr-provisionr 4. (Big)Data in a nutshell Business Intelligence / Research Evolved Signicant change in Decision Making Enables new Products & Features Enables new Business Models 5. Data Scientist Has a Business / Research orientedperspective Knowledge of statistics & softwareengineering (AI, infrastructure) Ability to explore questions and formulatehypotheses to be tested 6. Data Science Project Focused on particular business goals Based on a set of important questions Result > Answers that support businessdecisions 7. The Algorithm Find *Important* Create PipelinesQuestions Automate & Deploy Identify & Extract Data Learn & Repeat! Store & Sample Analyse Visualization 8. Start w/ Big Questions... answer them with (Big)DataHow can we understand & improve the conversion rate? How can we increase customer satisfaction? How can we nd important mentions in social media? 9. Identify Data SourcesOR add more probes / sensors as needed Google Analytics,Web server logs, Mixpanel, Customapplication metrics, Mouse tracking, Facebook metrics etc. 10. Extract Data... to a medium that allows you to run arbitrary queriesLocal lesystem, Databases, Hadoop, HBase, HDFS, Hive, Pig 11. Extract Database dump tool, replicas or backups External web services Apache Sqoop (SQL-to-Hadoop) Implement pipelines / real-time streams Write custom tools as needed 12. CurateUnfortunately Data is Messy 13. Curate - Your Way Use or develop tools / scripts On large volumes there no obvious choices Custom ways of ltering & aggregating largestreams (e.g. twitter, sensors) Reuse existing software components fordata curation / validation 14. DataWranglerInteractive System for Data cleaning a transformationhttp://vis.stanford.edu/wrangler/ 15. Open ReneFormer Google Renehttps://github.com/OpenRene/ OpenRene 16. Sample (time, etc.)As needed to support interactive exploration 17. Why Sample? Interactive exploration to create and checkassumptions, to create algorithms Be careful with Statistical Signicance Sample Smart: By time, By location etc. 18. Analyse Sample This is were the fun begins 19. Analyse Sample Create models Create algorithms Check hypotheses Faster feedback loops & ImmediateGratication 20. Excel-like 21. Python 22. RStudio 23. Gephi.org 24. Analyse Allapply your results to the entire data set 25. How to Analyse All? Easy on a single machine Go distributed w/ Hadoop, MPI, Storm,Oracle Exa* etc. Key: Leverage existing tools Tools: sed, awkSQL, RHadoop, ApacheHive, Pig, Cloudera Impala, MPI, Custom MR 26. VisualizationCommunicate meaning w/ Graphics 27. http://selection.datavisualization.ch/ 28. Automate & Deploy Make it part of your internal dashboard 29. Learn & RepeatAnswer most of the time generate new questions 30. Thanks! Questions?Andrei Savu / [email protected]@andreisavu