dr andy pryke - the data mine ltd an introduction to r free software for repeatable statistics,...
TRANSCRIPT
Dr Andy Pryke - The Data Mine Ltd
An Introduction to R
Free software for repeatable statistics, visualisation and
modeling
Dr Andy Pryke,
The Data Mine Ltd
Dr Andy Pryke - The Data Mine Ltd
Outline1. Overview
What is R?
When to use R?
Wot no GUI?
Help and Support
2. ExamplesSimple CommandsStatisticsGraphicsModeling and MiningSQL Database Interface
3. Going ForwardRelevant LibrariesOnline Courses etc.
Dr Andy Pryke - The Data Mine Ltd
What is R?
• Open source, well supported, command line driven, statistics package
• 100s of extra “packages” available free• Large number of users - particularly in
bio-informatics and social science• Good Design - John Chambers received the
ACM 1998 Software System Award for “S”
Dr. Chambers' work "will forever alter the way people analyze, visualize, and manipulate data…”
Dr Andy Pryke - The Data Mine Ltd
When Should I Use R?
• To do a full cycle of:– data import– data pre-processing– exploratory statistics and graphics,– modeling and data mining– report production– integration into other systems.
• Or any one of these steps - i.e. just to standardise pre-processing of data
Dr Andy Pryke - The Data Mine Ltd
Wot no GUI? or “The Advantages of Scripting”
• Repeatable
• Debug-able
• Documentable
• Build on previous work
• Automation– Report generation– Website or system integration– Links from Perl, Python, Java, C, TCP/IP….
Dr Andy Pryke - The Data Mine Ltd
Help and Support
Built in help/example system (e.g. type “?plot”)
Many tutorials available free
R-Help mailing list- Archived online- Key R developers respond- Contributors understand statistical concepts
Large User Community
Dr Andy Pryke - The Data Mine Ltd
Simple Commands
1+1
2
10*3
30
c(1,2,3)
1 2 3
c(1,2,3)*10
10 20 30
x <- 5
x*x
25
exp(1)
2.718282
q()
Save workspace image? [y/n/c]: n
Dr Andy Pryke - The Data Mine Ltd
colnames(iris)
"Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width” "Species"
plot(iris$Sepal.Length, iris$Petal.Length)
# Pearson Correlation
cor(iris$Sepal.Length, iris$Petal.Length)
0.8717538
# Spearman Correlation
cor(rank(iris$Sepal.Length), rank(iris$Petal.Length))
0.8818981
Simple Statistics
Dr Andy Pryke - The Data Mine Ltd
-4.33
-2.00
0.00
2.00
4.00
7.61
Pearsonresiduals:
p-value =< 2.22e-16
Eye
Ha
ir
Sex
Blo
nd
Fem
ale
Ma
le
Red
Fe
ma
leMal
e
Bro
wn
Fem
ale
Ma
le
Bla
ck
Brown Blue Hazel Green
Fem
aleM
aleSepal.Length
2.0 2.5 3.0 3.5 4.0 0.5 1.0 1.5 2.0 2.5
4.5
5.5
6.5
7.5
2.0
2.5
3.0
3.5
4.0
Sepal.Width
Petal.Length
12
34
56
7
4.5 5.5 6.5 7.5
0.5
1.0
1.5
2.0
2.5
1 2 3 4 5 6 7
Petal.Width
Edgar Anderson's Iris Data
Blueberry
Cherry
Apple
Boston Cream
Other
Vanilla Cream
January Pie Sales
Graphics
Linear Models## Scatterplot of Sepal and Petal Length
plot(iris$Sepal.Length, iris$Petal.Length)
## Make a Model of Petals in terms of Sepals
irisModel <- lm(iris$Petal.Length ~ iris$Sepal.Length)
## plot the model as a line
abline(irisModel)
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
12
34
56
7
iris$Sepal.Length
iris$
Pet
al.L
engt
h
Dr Andy Pryke - The Data Mine Ltd
Classification TreesPetal.Length
p < 0.001
1
1.9 1.9
Node 2 (n = 50)
setosa
0
0.2
0.4
0.6
0.8
1
Petal.Widthp < 0.001
3
1.7 1.7
Petal.Lengthp < 0.001
4
4.8 4.8
Node 5 (n = 46)
setosa
0
0.2
0.4
0.6
0.8
1
Node 6 (n = 8)
setosa
0
0.2
0.4
0.6
0.8
1
Node 7 (n = 46)
setosa
0
0.2
0.4
0.6
0.8
1
# Model Speciesirisct <- ctree(Species ~ . , data = iris)
# Show the model treeplot(irisct)
# Compare predictionstable(predict(irisct), iris$Species)
Dr Andy Pryke - The Data Mine Ltd
SQL Interface
Connect to databases with ODBC
library("RODBC")
channel <- odbcConnect("PostgreSQL30w", case="postgresql")
sqlSave(channel,iris, tablename="iris")
myIris <- sqlQuery(channel, "select * from iris")
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (i)
RandomForest– Random forests - Robust prediction
Party– Conditional inference trees - Statistically principled– Model-based partitioning - Advanced regression– cForests - Random Forests with ctrees
e1071– Naïve Bayes, Support Vector Machines, Fuzzy
Clustering and more...
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (ii)
nnets – Feed-forward Neural Networks – Multinomial Log-Linear Models
BayesTree– Bayesian Additive Regression Trees
gafit & rgenoud– Genetic Algorithm based optimisation
varSelRF– Variable selection using random forests
Dr Andy Pryke - The Data Mine Ltd
Data Mining Libraries (iii)arules– Association Rules (links to ‘C’ code)
Rweka library – Access to the many data mining algorithms found in
open source package “Weka”
dprep – Data pre-processing – You can easily write your own functions too.
Bioconductor– Multiple packages for analysis of genomic (and
biological) data
Dr Andy Pryke - The Data Mine Ltd
Sources of Further Information
Download these slides + the examples
& find links to online courses in R here:
http://www.andypryke.com/pub/R
Dr Andy Pryke - The Data Mine Ltd
Dr Andy Pryke - The Data Mine Ltd
Editors which Link to R
• Rgui (not really a GUI)
• Emacs (with “ESS” mode)
• RCmdr
• Tinn-R
• jgr - Ja
• SciViews
• and more...