hochschule düsseldorf fachbereich ... · an effective data handling and storage facility, ... sap...

Post on 29-Jun-2018

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

HSDHochschule Düsseldorf

University of Applied Scienses

WFachbereich Wirtschaftswissenschaften

Faculty of Business Studies

IT Applications in Business Analytics

Business Analytics (M.Sc.)

IT in Business Analytics

SS2016 / Lecture 04 – The R Programming Language

Thomas Zeutschler

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Let’s get started…

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 2

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Intoduction

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 3

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 4

R is a Statistical Programming Language developed by Ross Ihaka

and Robert Gentleman, introduced in 1993.

R provides a wide variety of statistical and graphical techniques.

(linear and nonlinear modelling, classical statistical tests, time-series

analysis, classification, clustering, …)

R is open source, highly extensible and runs on all platforms.

Today, R is the most used software / eco-system for statistical analysis.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

The R Programming Language

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 5

R system contains two major components:

1. Base System – contains the R language software and the high

priority add-on packages.

2. User contributed add-on Packages.

R includes… an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on

hardcopy, and

a simple and effective programming language which includes conditionals,

loops, user-defined recursive functions and input and output facilities.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Who uses the R Language?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 6

Data scientists & analysts, statisticans, mathematicians.

All scientists, researches and (product) developers who deal with data.

esp. in natural science (medicine, biology) and social science.

R is especially used quite often developing countries.

Because it allows universal free access to state of the art tools

for statistical data analysis.

Most widely used for teaching undergraduates and graduates statistics.

Because its free of cost.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Who uses the R Language?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 7

Many software vendors integrate R to provide advanced statistical

capabilities from within their products.

Statistical SoftwareSAS, SPSS, Statistica, Knime, RapidMiner, Mathematica etc.

Relational Database:Oracle, SAP HANA, Microsoft SQL Server, IBM DB2 etc.

Big Data and NoSQL DatabasesHadoop, MongoDB, Cassandra etc.

LOB (Line of Business) Applications, eg. ERP, CRMSAP, Microsoft etc.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Popularity

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 8

Google Trends 04.2016

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco-System

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 9

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 10

Packages are collections

of R functions, data and/or

compiled code.

CRAN “The Comprehensive R

Archive Network” is the

central repository for all public

available R packages.

https://cran.r-project.org/

8300 different packages

available (as of 2016.04)

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System – Packages

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 11

The table below: Some R packages ordered by date of creation.

Many packages are constantly updated and very reliable.

The community is the reason for the success of R.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Eco System – Packages (most popular 2015)

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 12

1. Rcpp Seamless R and C++ Integration

693.288 downloads

2. ggplot2 An Implementation of the Grammar of Graphics

3. stringr Simple, Consistent Wrappers for Common String Operations.

4. plyr Tools for Splitting, Applying and Combining Data

5. digest Create Cryptographic Hash Digests of R Objects.

6. reshape2 Flexibly Reshape Data: A Reboot of the Reshape Package

7. colorspace Color Space Manipulation

8. RColorBrewer ColorBrewer Palettes

9. manipulate Interactive Plots for RStudio.

10.scales Scale Functions for Visualization

11.labeling Axis Labeling

12.proto Prototype object-based programming.

13.munsell Munsell colour system.

14.gtable Arrange grobs in tables

15.dichromat Color Schemes for Dichromats

16.mime Map Filenames to MIME Types.

17.RCurl General network (HTTP/FTP/...) client interface for R.

18.bitops Bitwise Operations

19.zoo S3 Infrastructure for Regular and Irregular Time Series

20.knitr A General-Purpose Package for Dynamic Report Generation in R.

295.528 downloads

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Studio

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 13

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

RStudio

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 14

Native R is a console

application, RStudio is

wrapper for convenience…

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basics

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 15

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 16

Variables

Simple Mathematics

Charting

# Declaration and usage of variables

A <- 2

B <- 3

x <- seq(0, 2*pi, 0.1)

y <- sin(x)

# Attention: R is case sensitive

1 + 2

Sin(2*3)

# Declaration and usage of variables

plot(x,y, main=„Sinus Plot",

sub=„made with R",

xlab="x-axis",

ylab="y-axis")

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 17

Using Packages

Installing Packages (remove the #)

Automatic Load and (if required) Installation of a Package

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 18

Loading Data

Assign Data to Objects

Accessing Data

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

19

Accessing Data continued / Saving Data

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

20

Simple Data Analysis

d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)

# return the number of observations(rows) and variables(columns) in d.

dim(d)

# get the structure of d, including the class(type) of all variables

str(d)

# return the distributional summaries of variables in the dataset

summary(d)

# return a summary of the dataset for all rows where variable ‘read’ >= 60.

# note that filter is in the dplyr package.

summary(filter(d, read >= 60))

HSDFaculty of Business Studies

Thomas Zeutschler

Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language

R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm

21

Charting

# load the lattice charting package

require(lattice)

# draw a simple scatter plot

xyplot(read ~ write, data = d)

# conditioned scatter plot

xyplot(read ~ write | prog, data = d)

# box and whisker plots

bwplot(read ~ factor(prog), data = d)

More Charting (ggplot2 package)

# draw a kernel density plot

ggplot(d, aes(x = write)) + geom_density()

# draw a kernel density plot per prog

ggplot(d, aes(x = write)) + geom_density()

+ facet_wrap(~ prog)

# inspect univariate and bivariate

# relationships using a scatter plot matrix

ggpairs(d[, 7:11])

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 22

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 23

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

https://www.stat.auckland.ac.nz/~stats330/datasets.dir/sleep.txt

…/sleep.csv

Source:

https://www.stat.auckland.

ac.nz/~stats330/datasets.d

ir/

Training Video:

https://www.youtube.com/

watch?v=Uo1C7Iligw0

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 24

Data Import… …/sleep.csv

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

First Exercise in R

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 25

"Sleep in Mammals: Ecological and Constitutional Correlates" by Allison, T. and Cicchetti, D. (1976)

1. How old do animals become on average?

2. Which species gets the oldest?

3. Can we have a histogram of lifespan?

4. What is the correlation between lifespan

and size of an animal?

5. Can we have a full correlation matrix of all

variables (see figure 1)?

6. Can we have a scatter-plot of species size

vs. danger factor (see figure 2)?

Figure 1

Figure 2

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lecture Summary & Homework

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 26

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lessons Learned

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 27

CRISP DM is a highly adopted and standardized process for

data mining projects.

Ex-ante definition of success criteria is essential for successful projects.

Data understanding and preparation are typically the most costly and

time-consuming (~80%) phases in CRISP DM.

CRISP DM is an iterative approach. Certain phases are likely to be

passed multiple times (modelling and evaluation.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Lessons Learned

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 28

Lorem Ipsum

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Resources

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 29

Learn R

Interactive Web Training: http://tryr.codeschool.com/

Learn R in R with Swirl: http://swirlstats.com/students.html

Swirl Courses: https://github.com/swirldev/swirl_courses#swirl-courses

Tips & Tricks

Tips & Tricks: https://www.stat.wisc.edu/network-skills/learnR#guide

R by example: http://www.mayin.org/ajayshah/KB/R/

R Tutorials: https://ww2.coastal.edu/kingw/statistics/R-tutorials/

Blogs

#1 R blog to subscribe: http://www.r-bloggers.com/

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Get Prepared (Homework)

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 30

Take the full course = retype and execute each command. Enjoy…

http://www.ats.ucla.edu/stat/r/seminars/intro.htm

Get prepared for next lesson: Install Knime on your PC/Laptop.

HSDFaculty of Business Studies

Thomas Zeutschler

Associate Lecturer

Any Questions?

SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 31

top related