hochschule düsseldorf fachbereich ... · an effective data handling and storage facility, ... sap...

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 04 – The R Programming Language

Thomas Zeutschler

Let’s get started…

The R Programming Language

R is a Statistical Programming Language developed by Ross Ihaka

and Robert Gentleman, introduced in 1993.

R provides a wide variety of statistical and graphical techniques.

(linear and nonlinear modelling, classical statistical tests, time-series

analysis, classification, clustering, …)

R is open source, highly extensible and runs on all platforms.

Today, R is the most used software / eco-system for statistical analysis.

The R Programming Language

R system contains two major components:

1. Base System – contains the R language software and the high

priority add-on packages.

2. User contributed add-on Packages.

R includes… an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on

hardcopy, and

a simple and effective programming language which includes conditionals,

loops, user-defined recursive functions and input and output facilities.

Who uses the R Language?

Data scientists & analysts, statisticans, mathematicians.

All scientists, researches and (product) developers who deal with data.

esp. in natural science (medicine, biology) and social science.

R is especially used quite often developing countries.

Because it allows universal free access to state of the art tools

for statistical data analysis.

Most widely used for teaching undergraduates and graduates statistics.

Because its free of cost.

Who uses the R Language?

Many software vendors integrate R to provide advanced statistical

capabilities from within their products.

Statistical SoftwareSAS, SPSS, Statistica, Knime, RapidMiner, Mathematica etc.

Relational Database:Oracle, SAP HANA, Microsoft SQL Server, IBM DB2 etc.

Big Data and NoSQL DatabasesHadoop, MongoDB, Cassandra etc.

LOB (Line of Business) Applications, eg. ERP, CRMSAP, Microsoft etc.

R Popularity

Google Trends 04.2016

R Eco-System

R Eco System

Packages are collections

of R functions, data and/or

compiled code.

CRAN “The Comprehensive R

Archive Network” is the

central repository for all public

available R packages.


8300 different packages

available (as of 2016.04)

R Eco System – Packages

The table below: Some R packages ordered by date of creation.

Many packages are constantly updated and very reliable.

The community is the reason for the success of R.

R Eco System – Packages (most popular 2015)

1. Rcpp Seamless R and C++ Integration


2. ggplot2 An Implementation of the Grammar of Graphics

3. stringr Simple, Consistent Wrappers for Common String Operations.

4. plyr Tools for Splitting, Applying and Combining Data

5. digest Create Cryptographic Hash Digests of R Objects.

6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package

7. colorspace Color Space Manipulation

8. RColorBrewer ColorBrewer Palettes

9. manipulate Interactive Plots for RStudio.

10.scales Scale Functions for Visualization

11.labeling Axis Labeling

12.proto Prototype object-based programming.

13.munsell Munsell colour system.

14.gtable Arrange grobs in tables

15.dichromat Color Schemes for Dichromats

16.mime Map Filenames to MIME Types.

17.RCurl General network (HTTP/FTP/...) client interface for R.

18.bitops Bitwise Operations

19.zoo S3 Infrastructure for Regular and Irregular Time Series

20.knitr A General-Purpose Package for Dynamic Report Generation in R.


R Studio

Native R is a console

application, RStudio is

wrapper for convenience…

R Basics

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

Simple Mathematics


# Declaration and usage of variables

A <- 2

B <- 3

x <- seq(0, 2*pi, 0.1)

y <- sin(x)

# Attention: R is case sensitive

1 + 2


# Declaration and usage of variables

plot(x,y, main=„Sinus Plot",

sub=„made with R",



R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

Using Packages

Installing Packages (remove the #)

Automatic Load and (if required) Installation of a Package

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

Loading Data

Assign Data to Objects

Accessing Data

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm


Accessing Data continued / Saving Data

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm


Simple Data Analysis

d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)

# return the number of observations(rows) and variables(columns) in d.


# get the structure of d, including the class(type) of all variables


# return the distributional summaries of variables in the dataset


# return a summary of the dataset for all rows where variable ‘read’ >= 60.

# note that filter is in the dplyr package.

summary(filter(d, read >= 60))

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm



# load the lattice charting package


# draw a simple scatter plot

xyplot(read ~ write, data = d)

# conditioned scatter plot

xyplot(read ~ write | prog, data = d)

# box and whisker plots

bwplot(read ~ factor(prog), data = d)

More Charting (ggplot2 package)

# draw a kernel density plot

ggplot(d, aes(x = write)) + geom_density()

# draw a kernel density plot per prog

ggplot(d, aes(x = write)) + geom_density()

+ facet_wrap(~ prog)

# inspect univariate and bivariate

# relationships using a scatter plot matrix

ggpairs(d[, 7:11])

Exercise in R

First Exercise in R

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)







Training Video:



First Exercise in R

Data Import… …/sleep.csv

First Exercise in R

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

1. How old do animals become on average?

2. Which species gets the oldest?

3. Can we have a histogram of lifespan?

4. What is the correlation between lifespan

and size of an animal?

5. Can we have a full correlation matrix of all

variables (see figure 1)?

6. Can we have a scatter-plot of species size

vs. danger factor (see figure 2)?

Figure 1

Figure 2

Lecture Summary & Homework

Lessons Learned

CRISP DM is a highly adopted and standardized process for

data mining projects.

Ex-ante definition of success criteria is essential for successful projects.

Data understanding and preparation are typically the most costly and

time-consuming (~80%) phases in CRISP DM.

CRISP DM is an iterative approach. Certain phases are likely to be

passed multiple times (modelling and evaluation.

Lessons Learned

Lorem Ipsum

Learn R

Interactive Web Training: http://tryr.codeschool.com/

Learn R in R with Swirl: http://swirlstats.com/students.html

Swirl Courses: https://github.com/swirldev/swirl_courses#swirl-courses

Tips & Tricks

Tips & Tricks: https://www.stat.wisc.edu/network-skills/learnR#guide

R by example: http://www.mayin.org/ajayshah/KB/R/

R Tutorials: https://ww2.coastal.edu/kingw/statistics/R-tutorials/


#1 R blog to subscribe: http://www.r-bloggers.com/

Get Prepared (Homework)

Take the full course = retype and execute each command. Enjoy…


Get prepared for next lesson: Install Knime on your PC/Laptop.

Any Questions?

