big data in r - university of groningen · 2013-04-15 · big data in r nazia and fentaw university...
TRANSCRIPT
![Page 1: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/1.jpg)
Big data in R
Nazia and Fentaw
University of Groningen
JBI of Mathematics and Computer Science
April 15, 2013
Nazia and Fentaw rug
![Page 2: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/2.jpg)
Outline
1 Memory usage and R
2 biglm package
3 bigmemory package
Nazia and Fentaw rug
![Page 3: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/3.jpg)
R holds all objects in memory
Limited amount of memory that can be used by all objects
Memory error message while working with big data in R
“cannot allocate vector of size _ MB”
memory.size() % amount currently in use
memory.limit() % memory limit
[1] 1535 MB
memory.limit(size=1800) % increase memory limit (?)
[1] 1800 MB
Nazia and Fentaw rug
![Page 4: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/4.jpg)
R objects and memory usage (my experience)
object.sizes()
net z ycor ybay
1071880 400024 218688 205824
xtn yy Y X
200120 200112 180120 180120
arth800.expr cor outscad BB1
177656 168580 160552 152192
initial_5 initial_4 initial_3 initial_2
139356 139356 139356 139356
initial_1 data_long ybay1 B.net
139356 135168 131488 124040
arth800.mexpr loglik.full test.results arth800.descr
106824 101308 99008 97240
LL L LLtheta bd
83556 83244 80200 80112
bb a2 a1 nw_full
80112 80112 80112 67360 ...
Nazia and Fentaw rug
![Page 5: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/5.jpg)
Memory usage and R
net
zycorybay
xtnyy
Y
X
arth800.expr
cor
outscad
BB1
initial_5
initial_4
initial_3
initial_2
initial_1data_long
ybay1B.netarth800.mexprloglik.full test.resultsarth800.descrLLL LLthetabdbb
a2a1
nw_fullarth800.symbolarth800.namexinew
yarth800.probetest.resultslfdrBhat_dctfBhat_dcmdatapcor.dynuptomegaLomega1LthetaBtrueZvalsigma1lwtDestBhat_btf99Bhat_btfBhat_bm99Bhat_bmdata_matrixbaROC1ROC2panel.corarr2listmatrix.indexlogllxttxvalue1valueobject.sizesvar_mattranssum1sumsigmagcovariancecorr1loglitemna.var_matna.transna.sum1na.sumout7latentdata_sout8xtx0etuuuauuamucheckb99checkbcheckltprobcthetarho.oldthetarhoz1xirepbetaordmeanBTP1oTP1bsqdistdsqdistb99sqdistbSENdSENb99SENbrhoprho_vecrho.oldmSPEdmSPEb99mSPEblam2lam1L2norm_out1oL2norm_out1binitialFP1oFP1bv_beta.oldv_betav.beta.oldv.betau4u3u2u1sexbetaB1average.resultsalpha2alpha1idv5_d1.oldv5_d1v5_beta.oldv5_alpha2.oldv5_alpha1.oldv5_alpha1v4_d1.oldv4_d1v4_beta.oldv4_alpha2.oldv4_alpha1.oldv4_alpha1v3_d1.oldv3_d1v3_beta.oldv3_alpha2.oldv3_alpha1.oldv3_alpha1v2_d1.oldv2_d1v2_beta.oldv2_alpha2.oldv2_alpha1.oldv2_alpha1v1_d1.oldv1_d1v1_beta.oldv1_alpha2.oldv1_alpha1.oldv1_alpha1ttitotaltit_probtSPEdSPEb99SPEbsd_4sd_3sd_2sd_1rho34rho24rho23rho14rho13rho12rho_s.oldrho_srho_45.oldrho_45rho_35.oldrho_35rho_34.oldrho_34rho_25.oldrho_25rho_24.oldrho_24rho_23.oldrho_23rho_15.oldrho_15rho_14.oldrho_14rho_13.oldrho_13rho_12.oldrho_12rpomegaold.valuenum.signew.valuen.simNnKkjigenderdeltadcn
Memory usage by object
Nazia and Fentaw rug
![Page 6: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/6.jpg)
rm(list=ls())
# create dummy variables for memory usage demonstration
x <- 1:1000
y <- 1:10000
z <- 1:100000
object.sizes <- function()
{
return(rev(sort(sapply(ls(envir=.GlobalEnv),
function (object.name)
object.size(get(object.name))))))
}
object.sizes()
z y x object.sizes
400024 40024 4024 3580
Nazia and Fentaw rug
![Page 7: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/7.jpg)
Memory usage and R
z y x object.sizes
Memory usage by object
Variable name
Byte
s
0e+
00
1e+
05
2e+
05
3e+
05
4e+
05
Nazia and Fentaw rug
![Page 8: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/8.jpg)
Data types stored as R object, eg. matrix
z <- matrix(0,1e+06,3) ##numeric/double type
object.size(Z)
24000112 bytes #24MB
W <- matrix(as.integer(0),1e+06,3) ##integer
object.size(w)
12000112 bytes #12MB
Nazia and Fentaw rug
![Page 9: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/9.jpg)
Big data packages
biglm
bigmemory, biganalytics, bigtabulate
snow
ff
filehash
mapReduce
HadoopStreaming, etc.
Nazia and Fentaw rug
![Page 10: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/10.jpg)
biglm package: bounded memory linear and generalized
linear models
Question: How it works?
load data into memory in chunks
process loaded data and update the sufficient statisticrequired for the model
dispose the loaded chunk and load the next chunk
repeat until end of file
Nazia and Fentaw rug
![Page 11: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/11.jpg)
bigglm(formula, data, family, ...,chunksize=5000)
formula A model formula
data eg. data frame
family A glm family object: gaussian, gamma, poisson
chunksize Size of chunks for processng the data frame...
Nazia and Fentaw rug
![Page 12: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/12.jpg)
mydat_r <- matrix(rnorm(15000000,0,2),5000000,3)
colnames(mydat_r) <- c("first","second","third")
mydat_rdf <- as.data.frame(mydat_r)
> object.size(mydat_rdf)
120MB
## fit linear model using lm
linear.mod <- lm( first ~ second + third, data=mydat_rdf)
Error: cannot allocate vector of size 114.4 Mb
## fit linear model using biglm
biglinear.mod <- bigglm( first ~ second + third,
data=mydat_rdf, family=gaussian(), chunksize=100000)
Nazia and Fentaw rug
![Page 13: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/13.jpg)
> summary(biglinear.mod)
Large data regression model: bigglm(first ~ second + third,
data = mydat_rdf, family = gaussian(),
chunksize = 1e+05)
Sample size = 5e+06
Coef (95% CI) SE p
(Intercept) -2e-04 -0.0020 0.0016 9e-04 0.8198
second 3e-04 -0.0006 0.0012 4e-04 0.4925
third -6e-04 -0.0014 0.0003 4e-04 0.2165
elapsed time is 21.370000 seconds
Nazia and Fentaw rug
![Page 14: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/14.jpg)
bigmemory package: manage massive matrices with shared
memory and memory-mapped files
bigmemory package allows powerful and memory-efficientanalyses
Permits storage of large objects (matrices, etc) in memory (onthe RAM) using pointer objects to refer them
Data sets may be file-backed (stored on har drive/disk), toeasily manage and analyze datasets larger than available RAM
several R processes on the same computer can also share bigmemory objects
Nazia and Fentaw rug
![Page 15: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/15.jpg)
bigmemory package
bigmemory creates a variable X < − big.matrix,
X is apointer to the dataset saved in the RAM or the harddrive
when X is passed as an argument to a function, it isessentially providing call-by reference rather thancall-by-value behavior
big.matrix is an R object that simply points to a datastructure in C++
The bigmemory package allows users to create matrices thatare stored on disk, rather than in RAM. When an element isneeded, it is read from the disk and cached in RAM.
Nazia and Fentaw rug
![Page 16: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/16.jpg)
x <- big.matrix(nrow, ncol,
type = options()$bigmemory.default.type,
init = NULL, dimnames = NULL, separated = FALSE,
backingfile = NULL, backingpath = NULL,
descriptorfile = NULL,
shared = TRUE)
x <- filebacked.big.matrix(nrow, ncol,
type = options()$bigmemory.default.type, init = NULL,
dimnames = NULL, separated = FALSE,
backingfile = NULL,
backingpath = NULL, descriptorfile = NULL)
x <- as.big.matrix(x, type = NULL, separated = FALSE,
backingfile = NULL, backingpath = NULL,
descriptorfile = NULL,
shared=TRUE)
Nazia and Fentaw rug
![Page 17: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/17.jpg)
big.matrix is local to a single R process and is limited byavailable RAM
big.matrix(shared.true) is like big.matrix but can be sharedamong multiple R processes.
filebacked.bigmatrix : point to a file on disk containing thematrix and the file can be shared across clusters.
Nazia and Fentaw rug
![Page 18: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/18.jpg)
bigmemory package
x a matrix, vector, or data.frame for as.big.matrix;
nrow number of rows.
ncol number of columns.
type the type of the atomic element (options: “double",“integer", “short", or“char").
init a scalar value for initializing the matrix
Nazia and Fentaw rug
![Page 19: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/19.jpg)
bigmemory package
dimnames a list of the row and column names; use withcaution for large objects.
separated use separated column organization of the data.
backingfile the root name for the file(s) for the cache of x.
backingpath the path to the directory containing the filebacking cache.
descriptorfile the name of the file to hold the backingfiledescription, for subsequent use with attach.big.matrix.
shared if TRUE, the resulting big.matrix can be sharedacross processes.
Nazia and Fentaw rug
![Page 20: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/20.jpg)
> z <- big.matrix(3, 4, type="integer", init=5)
> z
An object of class "big.matrix"
Slot "address":
<pointer: 0x06148810>
> z[,]
[,1] [,2] [,3] [,4]
[1,] 5 5 5 5
[2,] 5 5 5 5
[3,] 5 5 5 5
> dim(z)
[1] 3 4
> z[2,2] <- as.integer(3)
> z[,]
[,1] [,2] [,3] [,4]
[1,] 5 5 5 5
[2,] 5 3 5 5
[3,] 5 5 5 5
Nazia and Fentaw rug
![Page 21: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/21.jpg)
> zz <- filebacked.big.matrix(3, 4, type="integer", init=5,
+ backingfile="example0.bin", descriptorfile="example0.desc",
+ dimnames=NULL) ##shared =TRUE always
> zz
An object of class "big.matrix"
Slot "address":
<pointer: 0x017d4238>
> zz[,]
[,1] [,2] [,3] [,4]
[1,] 5 5 5 5
[2,] 5 5 5 5
[3,] 5 5 5 5
Nazia and Fentaw rug
![Page 22: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/22.jpg)
Open a second R session
> yy <- attach.big.matrix("example0.desc")
> yy
An object of class "big.matrix"
Slot "address":
<pointer: 0x017d46d8>
> yy[,]
[,1] [,2] [,3] [,4]
[1,] 5 5 5 5
[2,] 5 5 5 5
[3,] 5 5 5 5
Nazia and Fentaw rug
![Page 23: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/23.jpg)
Second R session
> yy[2,2] <- 100 ##change a value
> yy[,]
[,1] [,2] [,3] [,4]
[1,] 5 5 5 5
[2,] 5 100 5 5
[3,] 5 5 5 5
> Check the first R session
> Notice that zz = yy
Nazia and Fentaw rug
![Page 24: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/24.jpg)
write.big.matrix(x, filename, row.names = FALSE,
col.names = FALSE, sep=‘‘,’’)
read.big.matrix(filename, sep = ‘‘,’’ , header = FALSE,
col.names = NULL, row.names = NULL,
has.row.names=FALSE, ignore.row.names=FALSE,
type = NA, skip = 0, separated = FALSE,
backingfile = NULL, backingpath = NULL,
descriptorfile = NULL, extraCols = NULL,
shared=TRUE)
Nazia and Fentaw rug
![Page 25: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/25.jpg)
# Without specifying the type, big.matrix x will hold integers.
> x <- as.big.matrix(matrix(1:10, 5, 2))
> x[2,2] <- NA
> x[,]
[,1] [,2]
[1,] 1 6
[2,] 2 NA
[3,] 3 8
[4,] 4 9
[5,] 5 10
> write.big.matrix(x, "sample.txt")
#Read it back in as character (1-byte integers):
> y <- read.big.matrix("sample.txt", type="char")
> y[,]
[,1] [,2]
[1,] 1 6
[2,] 2 NA
[3,] 3 8
[4,] 4 9
[5,] 5 10
Nazia and Fentaw rug
![Page 26: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/26.jpg)
Pitfalls of big.matrix
only one type of data structure
nonnumeric data values (no guarantee)
> w <- as.big.matrix(matrix(1:10, 5, 2), type="double")
> w[,]
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
> w[1,2] <- NA
> w[2,2] <- -Inf
> w[3,2] <- "b"
Warning message:
In SetElements.bm(x, i, j, value) : NAs introduced by coercion
> w[4,2] <- NaN
> w[,]
[,1] [,2]
[1,] 1 NA
[2,] 2 -Inf
[3,] 3 NA
[4,] 4 NaN
[5,] 5 10
Nazia and Fentaw rug
![Page 27: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/27.jpg)
> write.big.matrix(w, "nonnumeric.txt")
> w1 <- read.big.matrix("nonnumeric.txt", type="double")
> w1[,]
[,1] [,2]
[1,] 1 NA
[2,] 2 -1
[3,] 3 NA
[4,] 4 NA
[5,] 5 10
> w2 <- read.big.matrix("nonnumeric.txt", type="integer")
> w3 <- read.big.matrix("nonnumeric.txt", type="char")
> w4 <- read.big.matrix("nonnumeric.txt", type="short")
Nazia and Fentaw rug
![Page 28: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/28.jpg)
library(bigmemory)
library(biganalytics)
mydata.big = as.big.matrix(mydat_r) ## from R matrix
> mydata.big
An object of class "big.matrix"
Slot "address":
<pointer: 0x06109940>
> options(bigmemory.allow.dimnames=TRUE)
##permission to change col names
> colnames(mydata.big) = c("first","second","third")
> head(mydata.big)
first second third
[1,] -1.8358754 -2.9763708 0.3542316
[2,] -1.4135088 1.5770370 -1.1113071
[3,] -2.0222139 -1.4661885 -4.0510420
[4,] -2.3542718 1.1143254 5.1107734
[5,] 0.9703098 3.4126123 0.5251787
[6,] 0.3535078 -0.7306655 1.3299772
Nazia and Fentaw rug
![Page 29: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/29.jpg)
## fit linear model using biglm.big.matrix
>bigmatrix.linear.mod <- biglm.big.matrix(first ~ second + third,
data=mydata.big)
> summary(bigmatrix.linear.mod)
Large data regression model: biglm(formula = formula,
data = data, ...)
Sample size = 5000000
Coef (95% CI) SE p
(Intercept) -2e-04 -0.0020 0.0016 9e-04 0.8198
second 3e-04 -0.0006 0.0012 4e-04 0.4925
third -6e-04 -0.0014 0.0003 4e-04 0.2165
elapsed time is 7.300000 seconds
Nazia and Fentaw rug
![Page 30: Big data in R - University of Groningen · 2013-04-15 · Big data in R Nazia and Fentaw University of Groningen JBI of Mathematics and Computer Science April 15, 2013 Nazia and Fentaw](https://reader033.vdocument.in/reader033/viewer/2022052800/5f0f73547e708231d4443917/html5/thumbnails/30.jpg)
Thank you!
Nazia and Fentaw rug