data analysis in r for beginners

Click here to load reader

Post on 10-Jan-2017



Data & Analytics

1 download

Embed Size (px)


  • Data Analysis in .

    for BeginnersAlton Alexander

    Data Science Consultant

  • Why R? R is open source like python not like SAS Out of the box R is single machine, in memory

    statistical computing engine Download from

    Use an IDE R Studio Revolution Analytics (MSFT) Jython (ipython)

  • R studio



  • Essential Learning Resources

    A new book for learning R

    Q: What have you tried and what works?

  • Topics Data ingestion Manipulation Summary and exploration Writing Reports Interactive visualization and dashboarding Predictive Modeling & Forecasting Big Data Integrations

  • Demo

    Options data

    R studio

  • Data ingestion

    Load data Load.csv() library(RJDBC) library(RODBC)

  • Data Structures and Manipulation

    Another major reason for using R Ability to work with data in Data Frames Like pandas in python and data tables in SAS

    Reasons for doing data manipulation (munging) Feature extraction ETL Data cleansing Pivots, stack/unstack, aggregate, groupby, reshape

  • Set Theory

    SQL joins and their results

    merge, sqldf in R


  • Summary and Exploration

    Powerful summary functions for programmatically quantifying datasets

    Functions include: Summary(), hist(), levels(), aggregate()

  • Interactive Visualizationand Dashboarding

    Shiny from Rstudio Like tableau

    Local and server options Much more customizable, more coding, no GUI or

    click to edit But you can bring in powerful libraries to build

    web apps comparatively fast

  • Predictive Modeling & Forecasting Examples

    Customer segmentation Unsupervised classification

    Marketing mix models Explain the coefficients

    Attribution modeling Supervised time series of events

    Multivariate testing (AB tests with statistical significance, ANOVA)

    Lead scoring P2B Models, topic of interest, propensity to buy, expected spend

  • 5 Libraries for Machine LearningAllowing the machine to capture complexity:1. gbm [Gradient Boosting Machine]2. randomForest [Random Forest]3. e1071 [Support Vector Machines]

    Taking advantage of high-cardinality categorical or text-data:4. glmnet [Lasso and Elastic-Net Regularized Generalized Linear Models]5. tau [Text Analysis Utilities]

  • Big Data Integration

    Single laptop is often sufficient Millions of rows on a 32GB i7 laptop

    Scale using a larger server Often sufficient but has limitations (100s of GB)

    Clustered compute engine Algorithm considerations to affect performance

  • RServer

    For datasets that dont fit in memory or for convenience there is a SERVER option A shared compute engine Shares resources Think +100 GB of RAM

  • Big Data Integration - Frameworks SparkR Revolution Analytics In DB processing

    Applying lead score or segmentation model in real time

    Spark, teradata, vertica

  • Why R? In High Demand Nationally

  • Get Altons FREE Reports!

    Go to

    Complete the survey including your email

    Ill email you the two reports:

    1. Anonymized Summary of the Survey2. LinkedIn Job Suggestions for a Utah Data Scientist

View more