dr andy pryke - the data mine ltd an introduction to r free software for repeatable statistics,...

18
Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine Ltd [email protected]

Upload: ryan-ryan

Post on 26-Mar-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

An Introduction to R

Free software for repeatable statistics, visualisation and

modeling

Dr Andy Pryke,

The Data Mine Ltd

[email protected]

Page 2: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Outline1. Overview

What is R?

When to use R?

Wot no GUI?

Help and Support

2. ExamplesSimple CommandsStatisticsGraphicsModeling and MiningSQL Database Interface

3. Going ForwardRelevant LibrariesOnline Courses etc.

Page 3: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

What is R?

• Open source, well supported, command line driven, statistics package

• 100s of extra “packages” available free• Large number of users - particularly in

bio-informatics and social science• Good Design - John Chambers received the

ACM 1998 Software System Award for “S”

Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data…”

Page 4: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

When Should I Use R?

• To do a full cycle of:– data import– data pre-processing– exploratory statistics and graphics,– modeling and data mining– report production– integration into other systems.

• Or any one of these steps - i.e. just to standardise pre-processing of data

Page 5: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Wot no GUI? or “The Advantages of Scripting”

• Repeatable

• Debug-able

• Documentable

• Build on previous work

• Automation– Report generation– Website or system integration– Links from Perl, Python, Java, C, TCP/IP….

Page 6: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Help and Support

Built in help/example system (e.g. type “?plot”)

Many tutorials available free

R-Help mailing list- Archived online- Key R developers respond- Contributors understand statistical concepts

Large User Community

Page 7: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Simple Commands

1+1

2

10*3

30

c(1,2,3)

1 2 3

c(1,2,3)*10

10 20 30

x <- 5

x*x

25

exp(1)

2.718282

q()

Save workspace image? [y/n/c]: n

Page 8: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

colnames(iris)

"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width” "Species"

plot(iris$Sepal.Length, iris$Petal.Length)

# Pearson Correlation

cor(iris$Sepal.Length, iris$Petal.Length)

0.8717538

# Spearman Correlation

cor(rank(iris$Sepal.Length), rank(iris$Petal.Length))

0.8818981

Simple Statistics

Page 9: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

-4.33

-2.00

0.00

2.00

4.00

7.61

Pearsonresiduals:

p-value =< 2.22e-16

Eye

Ha

ir

Sex

Blo

nd

Fem

ale

Ma

le

Red

Fe

ma

leMal

e

Bro

wn

Fem

ale

Ma

le

Bla

ck

Brown Blue Hazel Green

Fem

aleM

aleSepal.Length

2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

2.0

2.5

3.0

3.5

4.0

Sepal.Width

Petal.Length

12

34

56

7

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

1 2 3 4 5 6 7

Petal.Width

Edgar Anderson's Iris Data

Blueberry

Cherry

Apple

Boston Cream

Other

Vanilla Cream

January Pie Sales

Graphics

Page 10: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Linear Models## Scatterplot of Sepal and Petal Length

plot(iris$Sepal.Length, iris$Petal.Length)

## Make a Model of Petals in terms of Sepals

irisModel <- lm(iris$Petal.Length ~ iris$Sepal.Length)

## plot the model as a line

abline(irisModel)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

12

34

56

7

iris$Sepal.Length

iris$

Pet

al.L

engt

h

Page 11: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Classification TreesPetal.Length

p < 0.001

1

1.9 1.9

Node 2 (n = 50)

setosa

0

0.2

0.4

0.6

0.8

1

Petal.Widthp < 0.001

3

1.7 1.7

Petal.Lengthp < 0.001

4

4.8 4.8

Node 5 (n = 46)

setosa

0

0.2

0.4

0.6

0.8

1

Node 6 (n = 8)

setosa

0

0.2

0.4

0.6

0.8

1

Node 7 (n = 46)

setosa

0

0.2

0.4

0.6

0.8

1

# Model Speciesirisct <- ctree(Species ~ . , data = iris)

# Show the model treeplot(irisct)

# Compare predictionstable(predict(irisct), iris$Species)

Page 12: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

SQL Interface

Connect to databases with ODBC

library("RODBC")

channel <- odbcConnect("PostgreSQL30w", case="postgresql")

sqlSave(channel,iris, tablename="iris")

myIris <- sqlQuery(channel, "select * from iris")

Page 13: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Data Mining Libraries (i)

RandomForest– Random forests - Robust prediction

Party– Conditional inference trees - Statistically principled– Model-based partitioning - Advanced regression– cForests - Random Forests with ctrees

e1071– Naïve Bayes, Support Vector Machines, Fuzzy

Clustering and more...

Page 14: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Data Mining Libraries (ii)

nnets – Feed-forward Neural Networks – Multinomial Log-Linear Models

BayesTree– Bayesian Additive Regression Trees

gafit & rgenoud– Genetic Algorithm based optimisation

varSelRF– Variable selection using random forests

Page 15: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Data Mining Libraries (iii)arules– Association Rules (links to ‘C’ code)

Rweka library – Access to the many data mining algorithms found in

open source package “Weka”

dprep – Data pre-processing – You can easily write your own functions too.

Bioconductor– Multiple packages for analysis of genomic (and

biological) data

Page 16: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Sources of Further Information

Download these slides + the examples

& find links to online courses in R here:

http://www.andypryke.com/pub/R

Page 17: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Page 18: Dr Andy Pryke - The Data Mine Ltd An Introduction to R Free software for repeatable statistics, visualisation and modeling Dr Andy Pryke, The Data Mine

Dr Andy Pryke - The Data Mine Ltd

Editors which Link to R

• Rgui (not really a GUI)

• Emacs (with “ESS” mode)

• RCmdr

• Tinn-R

• jgr - Ja

• SciViews

• and more...