Download - Open source analytics
![Page 1: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/1.jpg)
Open Source in Analytics
![Page 2: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/2.jpg)
IntroductionIIML ,DCEFounder Decisionstats.comAuthor R for Business Analytics
![Page 3: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/3.jpg)
Brief History of Analytics
SAS and SPSS led from 1970-s to early 2000sSAS leads market but very expensiveIBM bought SPSS but still not open source
R, Python and Hadoop Challenged this
![Page 4: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/4.jpg)
Analytics Sub Components
Data StorageData QueryingData SummarizationData VisualizationStatistical Routines
![Page 5: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/5.jpg)
Analytics Sub Components
Data StorageData QueryingData SummarizationData VisualizationStatistical Routines
*not an exhaustive list. Vendors provide a broader portfolio. Example purposes only.
Proprietary Open Source
OracleDBMSSQL Server
Business ObjectsSAP
SQL, SAS,Crystal Reports
Tableau
SAS,SPSS
![Page 6: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/6.jpg)
Analytics Sub Components
Data StorageData QueryingData SummarizationData VisualizationStatistical Routines
*not an exhaustive list. Vendors provide a broader portfolio. Example purposes only.
Proprietary Open Source
OracleDBMSSQL Server
MySQL, NoSQL, Hadoop
Business ObjectsSAP
Pentaho, Jaspersoft
SQL, SAS,Crystal Reports
Still SQL,Pig, Hive
Tableau R,Python,Javascript
SAS,SPSS R,Python,RapidMiner
![Page 7: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/7.jpg)
Analytics using Python
● pandas http://pandas.pydata.org/ High-performance, easy-to-use data structures and data analysis tools
● scikit-learn http://scikit-learn.org/stable/ Simple and efficient tools for data mining and data analysis and built on NumPy, SciPy, and matplotlib
● NumPy http://www.numpy.org/● SciPy http://www.scipy.org/scipylib/index.html● matplotlib http://matplotlib.org/
● statsmodels http://statsmodels.sourceforge.net/# Statsmodels is a Python module that allows users to
explore data, estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics, statistical tests, plotting functions, and result statistics are available
● iPython http://ipython.org/ interactive computing
![Page 8: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/8.jpg)
Analytics using Rhttp://www.r-project.org/
● RStudio and Revolution Analytics● sqldf https://code.google.com/p/sqldf/ and RODBC http://cran.r-project.org/web/packages/RODBC/index.html ● ggplot2 http://ggplot2.org/ and ggmap and shiny● RHadoop et al https://github.com/RevolutionAnalytics/RHadoop● car, stats, forecast, sna,tm● rattle and Rcommander (with plugins)
More at http://rforanalytics.wordpress.com/
![Page 9: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/9.jpg)
Analytics using Rhttp://www.revolutionanalytics.com/
![Page 10: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/10.jpg)
Analytics using Rhttp://www.revolutionanalytics.com/
![Page 11: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/11.jpg)
Analytics using R<blatant self promotion>http://www.amazon.com/R-Business-Analytics-A-Ohri/dp/1461443423
R for Business Analytics looks at some of the most common tasks performed
by business analysts and helps the user navigate the wealth of information in R
and its packages. With this information the reader can select the packages that
can help process the analytical tasks with minimum effort and maximum usefulness
. The use of Graphical User Interfaces (GUI) is emphasized in this book to
further cut down and bend the famous learning curve in learning R.
</blatant self promotion>
![Page 12: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/12.jpg)
Analytics using Rapid Miner
Early adopter of open source analyticsRecently moved from Germany to USA following PE infusionOne of the first marketplace for analytics extensions http://marketplace.rapid-i.com/UpdateServer/
One of the best GUI - Drag and Drop using flow
![Page 13: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/13.jpg)
Analytics using Rapid Miner
![Page 14: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/14.jpg)
Analytics using Jaspersoft
OLAPBIG DATA
(offered through cloud, mobile)
![Page 15: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/15.jpg)
Analytics using Pentaho
Basically WekaReporting as wellComplete BI and Analytics Stack
![Page 17: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/17.jpg)
Hadoop- evolving ecosystem
![Page 18: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/18.jpg)
Hadoop- evolving ecosystem
![Page 19: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/19.jpg)
Hadoop- evolving ecosystem
![Page 20: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/20.jpg)
Rhttp://www.r-project.org/
Open Source
Free
5000+ Packages
Growing Faster
>2 million users
RAM constraints??
![Page 21: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/21.jpg)
Rhttp://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
![Page 22: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/22.jpg)
Rhttp://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
![Page 23: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/23.jpg)
R - Rattle- Data Mining GUIhttp://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
![Page 24: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/24.jpg)
R - R Commanderhttp://www.r-project.org/
Object Oriented
has GUI and IDE
has Commercial offerings
![Page 25: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/25.jpg)
R -R Studio
![Page 26: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/26.jpg)
R -Revolution Analytics Free for Academics
World Wide !!
RevoScaleR package
for Big Data
Recommended Install -
http://info.revolutionanalytics.com/free-academic.html
![Page 27: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/27.jpg)
R -Revolution Analytics Free for Academics
World Wide !!
RevoScaleR package
for Big Data
![Page 28: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/28.jpg)
R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html
● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach to big data. ( link )
● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce programming framework. ( link )
● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce (EMR) at Amazon. ( link )
● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. This package can be used in R code to read data streams from other systems in a distributed MapReduce setting where data is serialized and passed back and forth between tasks.
● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.
![Page 29: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/29.jpg)
Terrific Data Mining using R GUI
![Page 30: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/30.jpg)
Great Data Visualization using R GUI
![Page 31: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/31.jpg)
So many packages- CRAN Views to the rescuehttp://cran.r-project.org/web/views/Bayesian Bayesian InferenceChemPhys Chemometrics and Computational PhysicsClinicalTrials Clinical Trial Design, Monitoring, and AnalysisCluster Cluster Analysis & Finite Mixture ModelsDifferentialEquations Differential EquationsDistributions Probability DistributionsEconometrics Computational EconometricsEnvironmetrics Analysis of Ecological and Environmental DataExperimentalDesign Design of Experiments (DoE) & Analysis of Experimental DataFinance Empirical FinanceGenetics Statistical GeneticsGraphics Graphic Displays & Dynamic Graphics & Graphic Devices & VisualizationHighPerformanceComputing High-Performance and Parallel Computing with RMachineLearning Machine Learning & Statistical LearningMedicalImaging Medical Image AnalysisMetaAnalysis Meta-AnalysisMultivariate Multivariate StatisticsNaturalLanguageProcessing Natural Language Processing
![Page 32: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/32.jpg)
So many packages- CRAN Views to the rescuehttp://cran.r-project.org/web/views/NumericalMathematics Numerical MathematicsOfficialStatistics Official Statistics & Survey MethodologyOptimization Optimization and Mathematical ProgrammingPharmacokinetics Analysis of Pharmacokinetic DataPhylogenetics Phylogenetics, Especially Comparative MethodsPsychometrics Psychometric Models and MethodsReproducibleResearch Reproducible ResearchRobust Robust Statistical MethodsSocialSciences Statistics for the Social SciencesSpatial Analysis of Spatial DataSpatioTemporal Handling and Analyzing Spatio-Temporal DataSurvival Survival AnalysisTimeSeries Time Series Analysis WebTechnologies Web Technologies and ServicesgR gRaphical Models in R
![Page 33: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/33.jpg)
R in the Browserhttp://www.r-fiddle.org/#/
http://statace.com/
http://www.rstudio.com/ide/server/
![Page 34: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/34.jpg)
R -Hadoop Packages https://github.com/RevolutionAnalytics/RHadoop/wiki
● plyrmr - higher level plyr-like data processing for structured data, powered by rmr
● rmr - functions providing Hadoop MapReduce functionality in R
● rhdfs - functions providing file management of the HDFS from within R
● rhbase - functions providing database management for the HBase distributed database from within R
http://amplab-extras.github.io/SparkR-pkg/
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
https://github.com/nexr/RHive
RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and
allows easy usage of R objects and R functions in Hive.
![Page 35: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/35.jpg)
R - Cloud Computinghttp://cran.r-project.org/web/views/WebTechnologies.html
![Page 36: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/36.jpg)
R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.htmlLarge memory and out-of-memory data
● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R's main memory.
● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.
● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. .
● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table ● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also
facilitates operating on data in a streaming fashion which does not require Hadoop.● The speedglm package permits to fit (generalised) linear models to large data. ● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression,
lasso and stepwise regression.● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory.● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.
![Page 37: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/37.jpg)
Data Scientist Tool Kit
● web scraping● visualization● machine learning● data mining● modeling● sna● social media analytics
● web analytics● reproducible research ● TS forecasting ● spatial analysis● data storage ● data querying
![Page 38: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/38.jpg)
Data Scientist Programming Skills
Java http://www.learnjavaonline.org/
Python http://www.codecademy.com/tracks/python
SQL http://www.w3schools.com/sql/
R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/
http://www.statmethods.net/
Hadoop http://hortonworks.com/hadoop-training/
Linux https://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh
![Page 39: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/39.jpg)
Other place to learn
MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/
Books
Courses
Workshops
![Page 40: Open source analytics](https://reader033.vdocument.in/reader033/viewer/2022051110/54c6750c4a7959d4168b45af/html5/thumbnails/40.jpg)
Summary
Open source has greatly helped cut down cost of software in analyticsThe benefits of analytics continue to be manyAdded with Big Data and Cloud and MOOCs-----total cost to geeks is much lower !!