manipulating big data in r - robert j. carrollmanipulating big data in r robert j. carroll may 21,...

M A N I P U L AT I N G B I G DATA I N RRO B E RT J . CA R RO L L

M AY 2 1 , 2 0 1 4

This document introduces the data.table package for fast manipulation ofbig data objects. This is but one option among a few, so we begin by consideringthe full constellation of big data options. With that addressed, we move on to usingdata.table for a variety of tasks, including subsetting, group join, and expressionevaluation.

D , and the most basic tools¹ we learn By basic, I mean basic: subse ing data,applying functions, and so on.for data manipulation quickly become obsolete as data sets grow large.

But even if we restrict the conversation to orthodox data, manipulationcan become unwieldy: those in IR know all too well that years’ worth ofdirected dyads is enough to crash a computer unless one takes proper care.

I want this tutorial to be something useful, and as a result I have madethe decision to restrict my a ention to a single R package that helps makebig data manipulation easier: the data.table package. I have cho-sen data.table because it is the most incremental improvement overwhat we already do: a data.table is a kind of data object, just like amatrix or the more familiar data.frame. In fact, a data.table is adata.frame: anything that can be done with the la er can be done withthe former as well, and data.tables even pass the is.data.frame()test. But, data.tables have added functionality that make them easy touse in big data contexts.² Given the similarities between data.table and These are not a panacea. The package will

only be of use to you up to, say, GB.Beyond that, you’d want to use a morespecialized platform like Hadoop.

the existing data.frame, it is possible to change your approach to R pro-gramming. Put differently: when working with data by myself, I only work withdata.tables.

The tutorial proceeds as follows. First, just to make sure all bases arecovered, I will briefly list some of the be er-known big data options. ThenI will introduce the data.table package, going through each of its [,inputs in turn. The tutorial is an extension of the vigne es provided by thedevelopers of data.table, and the interested reader is referred there formore details.

Your big data options

Hadley Wickham³ listed five options for working with big data in R: In case it is not obvious, Hadley Wick-ham is a well-known R developer andstatistician. It was he that gave us gg-plot, reshape, plyr, and many otherpackages.

. Sampling: if your data are problematically large, then take out a man-ageable sample and run your models on the sample. I like this option,but others object to discarding a potentially large majority of the observa-tions.

. Bigger hardware: R can incorporate a lot of memory—on a -big ma-chine, up to TB. But, the fact that more memory is available does not

mean that R’s capabilities are being used as well as possible.

. Store objects on a hard disc. This is what many of us do when we paral-lelize: keep one big data file, then break it into chunks, and then performanalysis on each of the chunks, and then synthesize. You can do this sortof thing using the ff package and the ffbase set of functions.

. Integration of higher performing programming languages. Most no-tably, you can bring C++ into the equation—people that have used theMCMCpack package know that much of the analysis is performed usingthe Scythe library in C++. This is a fine option, but it certainly isn’tincremental to the skills most URDPSsers have, as it requires knowingsomething about the advanced language in question.

. Alternative interpreters. R interpreters can implement (or reimplement)R in other platforms—for example, in Java. But again, this is not a goodincremental option.

What is a regular R user to do for that cumbersome calculation neededto finish dissertation chapter ? So long as the data are not too large (say,under GB or so), most things can be done within the confines of R withoutresorting to new languages. But, this often requires doing something ali le more advanced. The most well-known option seems to be Wickham’splyr, but this is not the only approach. plyr is a set of functions thatquickly breaks large data objects into chunks, performs some function(s)on the chunks, and then synthesizes. This is a fine option, and many of thefunctions (e.g. dlply) are standard tools in the kit. However, plyr is notalways the fasted standard option—and quite often, the data.tableoption is much faster. What’s more, instead of learning a set of functions,we will be learning a different approach to manipulation. That said, thebest option is probably to work with data.tables, then incorporatereshape and plyr as needed.

For both data.table and plyr, loops are discarded. For data.table,vectorizing is discarded.

Preliminaries for the tutorial

We will use the following packages:� �1 #### preliminaries ####2 library(data.table)3 library(plyr)4 library(ggplot2)5 library(reshape2)� �

For the present purposes, it will be easiest to make a somewhat large dataobject and then manipulate it a few ways. The introductory vigne es pro-

vide a nice, simple matrix to build upon. After, we will have some real-datademonstrations. Accordingly, I make a , , row object as follows:� �

1 > #### make a very large data frame ####2 >3 > ### set the size4 > size <- ceiling(1e7/26^2)5 >6 > ### make the object. here, variables x and y are characters that

will serve as7 > ### potential indicators to be subsetted by, etc. w, v, and z are

all8 > ### numeric.9 > tt <- system.time(

10 +11 + DF <- data.frame(x = rep(LETTERS, each = 26 * size),12 + y = rep(letters, each = size),13 + w = rpois(size * 26^2, lambda = 1),14 + v = runif(size * 26^2),15 + z = rnorm(size * 26^2),16 + stringsAsFactors = F17 + )18 +19 + );tt20 user system elapsed21 4.10 0.41 4.49� �

Throughout, I will make constant use of thesystem.time() command, which, asyou may have deduced, tells you how longa given task took.

Here DF is a data.framewith two character indicators: x is uppercasele ers and y is lowercase le ers. Each unique uppercase/lowercase combi-nation has , observations, for a total of , , rows. w, v, and z areall randomly-chosen numbers. To see what the data look like:� �

1 > ### see the data and its dimensions2 > head(DF)3 x y w v z4 1 A a 2 0.6531179 0.515459585 2 A a 2 0.8353646 -0.660235036 3 A a 1 0.7110683 0.728283597 4 A a 2 0.6086215 2.320456248 5 A a 0 0.1605766 0.050410169 6 A a 1 0.2185821 -0.65137406

10 > tail(DF)11 x y w v z12 10000063 Z z 2 0.64095400 0.361574513 10000064 Z z 0 0.52693859 1.754727114 10000065 Z z 0 0.06636169 0.218030715 10000066 Z z 0 0.92682736 0.948891416 10000067 Z z 1 0.24955435 -1.258431117 10000068 Z z 0 0.55297120 -1.116927118 > dim(DF)19 [1] 10000068 5� �

Now, we will make a new object, DT, that is a data.table version ofthe same object.⁴ You will ordinarily read data in as a

data.table by the fread command,which we will discuss later.

� �1 > ### make a data.table version2 > DT <- data.table(DF)3 >4 > ### note that the data.table version looks a little different5 > DT6 x y w v z7 1: A a 2 0.65311793 0.515459588 2: A a 2 0.83536463 -0.660235039 3: A a 1 0.71106833 0.72828359

10 4: A a 2 0.60862148 2.3204562411 5: A a 0 0.16057663 0.05041016

12 ---13 10000064: Z z 0 0.52693859 1.7547270614 10000065: Z z 0 0.06636169 0.2180307015 10000066: Z z 0 0.92682736 0.9488914516 10000067: Z z 1 0.24955435 -1.2584311017 10000068: Z z 0 0.55297120 -1.11692707� �

Observe that, if you ask to see an object of class data.table that is overrows, you will automatically be shown a summarized version like what

you see above.⁵ The row numbers are different in adata.table, and you will find that,unlike their data.frame counterparts,they always get reset. Other than that, thereis no major difference.

At any point, you can use the tables() command to see how manydata.tables you’ve got cooking, along with some basic statistics:� �

1 > ### we can see how many data.tables are currently in play using thetables()

2 > ### command3 > tables()4 NAME NROW MB COLS KEY5 [1,] DT 10,000,068 382 x,y,w,v,z6 Total: 382MB� �

So, even though this data set is trivial, it’s still MB.Most importantly, a data.table is a data.frame. Observe:� �

1 > ### most importantly, a data.table IS a data.frame2 > is.data.frame(DT)3 [1] TRUE� �

Almost all of the functionality we are used to applies:� �1 > ### and it works normally:2 > mean(DT$w)3 [1] 1.0000544 >5 > ### you can still call out the rows like normal6 > DT[1:10,]7 x y w v z8 1: A a 2 0.6531179 0.515459589 2: A a 2 0.8353646 -0.66023503

10 3: A a 1 0.7110683 0.7282835911 4: A a 2 0.6086215 2.3204562412 5: A a 0 0.1605766 0.0504101613 6: A a 1 0.2185821 -0.6513740614 7: A a 0 0.8731887 0.2788613715 8: A a 0 0.8399500 0.1662631516 9: A a 2 0.4460591 -1.3883128717 10: A a 0 0.1154040 -2.55948095� �

The thing that doesn’twork: numeric appeals to columns.� �1 > ### ...but not the columns2 > DT[,1:2]3 [1] 1 2� �

What?! Why did that happen? As we will see, data.table has an en-tirely different set of functionality built into the [, apparatus, and the

second entry—that is, the j entry—is an expression rather than an appeal.⁶ That is, data.table did as it was told:we asked it to evaluate 1:2, and it gave usjust that.

More on that later. First, we will discuss how to use the first entry—that is,the i entry—to subset data.

How to subset: the i entry

data.table searches for values by binary search rather than by vectorsearch. When you do your familiar x == x sort of thing, R must make ann × 1 vector of TRUEs and FALSEs, find the TRUEs, and then go get thevalues where TRUE holds. It’s quite cumbersome. On the other hand, usingbinary search, R does not have to go through every value. To illustrate, let’swork through the Wikipedia explanation. Suppose we have the followingvector: x = {1, 3, 4, 6, 8, 9, 11}, and we want to identify which entryis equal to . When vectorizing, we must create a vector that looks like You don’t have to read this for

data.table to be useful.the following: [F F T F F F F]. That’s operations—one for eachentry. With binary search, we sort the vector (here, already done), and thenstart from the middle and work backwards. Since we’re looking at , westart with the middle value of x, , and then observe that is bigger than .Accordingly, we cut x into the half of x that is less than the middle score of ,x′ = {1, 3, 4}, the middle value of which is . is bigger than , so we lookat the half of x that is greater than . This is just x′′ = {4}, which is equal toour desired value. So, we are done. This took us only operations: 6 ==4, 3 == 4, and 4 == 4.

So, for a binary search algorithm, we assign a “key” indicator to serve asour sorting variable. data.table then sorts the data based on that key.We then use DT[i,] to use the key to pull out data. An example will help.Suppose we want to pull out the subset of data where x = ”R” and y =”c”. The traditional way takes a long time with such big data:� �

1 > ### subsetting via vector scan2 > tt <- system.time(3 +4 + ans1 <- DF[DF$x == ”R” & DF$y == ”c”,]5 +6 + );tt7 user system elapsed8 0.94 0.04 0.97� �

So, merely making a subset took a second. Why? Well, R had to make a, , by vector for x, another , , by vector for y, and then

combine them into a joint conditional, and then apply that to the data.Conversely, for data.table, we begin by se ing x and y as the “key”variables, which is reflected in the tables() output:� �

1 > ### set a key---this now serves as the ”sorting” variable2 > setkey(DT,x,y)3 > tables()4 NAME NROW MB COLS KEY5 [1,] DT 10,000,068 382 x,y,w,v,z x,y6 Total: 382MB

� �Now subse ing is fast.� �

1 > ### subsetting via binary search. here J(”R”,”c”) ”passes” the two-column,

2 > ### one-row data.table into DT, a process known as joining.3 > ss <- system.time(4 +5 + ans2 <- DT[J(”R”, ”c”)]6 +7 + );ss8 user system elapsed9 0.02 0.00 0.01� �

We get the same result either way:� �1 > mapply(identical, ans1, ans2)2 x y w v z3 TRUE TRUE TRUE TRUE TRUE� �

So, all the columns are identical. Note that this scales—so the bigger thedata, the bigger the gains from using binary search. You can do this for as many keys at a time

as you would like, though it gets hard tokeep track.

You may be asking: what is this J() thing? That simplifies the notationgreatly. We are passing another data.table into DT so that they “joinup.” To see what I mean, consider the following:� �

1 > foo <- data.table(”R”,”c”)2 > foo3 V1 V24 1: R c5 > mapply(identical, ans2, DT[foo])6 x y w v z7 TRUE TRUE TRUE TRUE TRUE8 >9 > ### the comma is optional

10 > mapply(identical, ans2, DT[foo,])11 x y w v z12 TRUE TRUE TRUE TRUE TRUE13 >14 > ### could have just used data.table:15 > mapply(identical, ans2, DT[data.table(”R”,”c”)])16 x y w v z17 TRUE TRUE TRUE TRUE TRUE� �

For a single variable, you need not use such notation.� �1 > ### even simplier for a single variable.2 > setkey(DT, x)3 > DT[”R”]4 x y w v z5 1: R a 0 0.07252648 -0.34464146 2: R a 1 0.48033637 1.98935317 3: R a 2 0.36097745 1.41747258 4: R a 1 0.98904894 0.29348119 5: R a 3 0.94297707 0.2300404

10 ---11 384614: R z 1 0.73335683 0.316724712 384615: R z 1 0.17367247 -0.904635713 384616: R z 1 0.20860220 0.7953754

14 384617: R z 2 0.68243385 1.876287215 384618: R z 1 0.78726622 -0.174055616 > DT[LETTERS[18]]17 x y w v z18 1: R a 0 0.07252648 -0.344641419 2: R a 1 0.48033637 1.989353120 3: R a 2 0.36097745 1.417472521 4: R a 1 0.98904894 0.293481122 5: R a 3 0.94297707 0.230040423 ---24 384614: R z 1 0.73335683 0.316724725 384615: R z 1 0.17367247 -0.904635726 384616: R z 1 0.20860220 0.795375427 384617: R z 2 0.68243385 1.876287228 384618: R z 1 0.78726622 -0.1740556� �

Observe that we can pass something like LETTERS[18] into i. This ishandy in case you do need to wrap something into a loop, which hopefullyyou will not have to do.

You can still use vector scans in data.table, but they will not be fast.� �1 > ### you can use data.table badly2 > DT[DT$x == ”R”]3 x y w v z4 1: R a 0 0.07252648 -0.34464145 2: R a 1 0.48033637 1.98935316 3: R a 2 0.36097745 1.41747257 4: R a 1 0.98904894 0.29348118 5: R a 3 0.94297707 0.23004049 ---

10 384614: R z 1 0.73335683 0.316724711 384615: R z 1 0.17367247 -0.904635712 384616: R z 1 0.20860220 0.795375413 384617: R z 2 0.68243385 1.876287214 384618: R z 1 0.78726622 -0.1740556� �

How to apply functions by group: the j entry

We have learned what happens before the first comma in [,. We nowmove to the second entry: j. j captures many different functionalities, butthe most important thing to remember is that it is an expression. So, mostbasically, j is how you ask for a column or columns—but you do this byname rather than by column number:� �

1 > ### it’s how you get columns called out:2 > gladys <- DT[,x]3 > head(gladys)4 [1] ”A” ”A” ”A” ”A” ”A” ”A”5 > tail(gladys)6 [1] ”Z” ”Z” ”Z” ”Z” ”Z” ”Z”7 > DT[,list(x,y,z)]8 x y z9 1: A a 0.51545958

10 2: A a -0.6602350311 3: A a 0.7282835912 4: A a 2.3204562413 5: A a 0.0504101614 ---15 10000064: Z z 1.7547270616 10000065: Z z 0.2180307017 10000066: Z z 0.9488914518 10000067: Z z -1.25843110

19 10000068: Z z -1.11692707� �Observe that you must ask for a list of variable names, rather than a vector.

But you can do much more than just ask for variables. For example, youcan create new variables—to add them to the existing object, use the :=command:� �

1 > ### it’s how you make new variables2 > DT[,u := rchisq(nrow(DT), df = 10)]3 > DT4 x y w v z u5 1: A a 2 0.65311793 0.51545958 9.8769256 2: A a 2 0.83536463 -0.66023503 9.2840637 3: A a 1 0.71106833 0.72828359 16.4826688 4: A a 2 0.60862148 2.32045624 8.0152949 5: A a 0 0.16057663 0.05041016 8.002566

10 ---11 10000064: Z z 0 0.52693859 1.75472706 8.47744412 10000065: Z z 0 0.06636169 0.21803070 8.15724813 10000066: Z z 0 0.92682736 0.94889145 6.33057114 10000067: Z z 1 0.24955435 -1.25843110 4.64635915 10000068: Z z 0 0.55297120 -1.11692707 8.92878516 > DF$u <- DT$u� �

Importantly, if you don’t use :=, then the new object you’ve created will notbe a ached to the object in question—instead, you just get a new object:� �

1 > ### if we hadn’t used := but rather =, we’d just get a new object2 > DT[,list(davis = rchisq(nrow(DT), df = 10),3 + coltrane = 1:nrow(DT))4 + ]5 davis coltrane6 1: 10.049122 17 2: 18.314233 28 3: 11.352462 39 4: 12.732192 4

10 5: 12.928894 511 ---12 10000064: 6.538454 1000006413 10000065: 8.429185 1000006514 10000066: 9.890027 1000006615 10000067: 5.632696 1000006716 10000068: 16.316146 10000068� �

In other words, DT has remained unaffected: davis and coltrane havenot been added to it. If we want to add a few variables to an existing object,we use the notation ‘:=’, as follows:� �

1 > ### did you want a few more variables?2 > DT[, ‘:=‘(mingus = u^2, evans = 16-w)]3 > DT4 x y w v z u mingus evans5 1: A a 2 0.65311793 0.51545958 9.876925 97.55364 146 2: A a 2 0.83536463 -0.66023503 9.284063 86.19383 147 3: A a 1 0.71106833 0.72828359 16.482668 271.67835 158 4: A a 2 0.60862148 2.32045624 8.015294 64.24493 149 5: A a 0 0.16057663 0.05041016 8.002566 64.04106 16

10 ---11 10000064: Z z 0 0.52693859 1.75472706 8.477444 71.86705 1612 10000065: Z z 0 0.06636169 0.21803070 8.157248 66.54069 1613 10000066: Z z 0 0.92682736 0.94889145 6.330571 40.07612 1614 10000067: Z z 1 0.24955435 -1.25843110 4.646359 21.58865 15

15 10000068: Z z 0 0.55297120 -1.11692707 8.928785 79.72320 16� �To remove variables, the notation is straightforward:� �

1 > DT[,c(”mingus”, ”evans”) := NULL]2 > DT3 x y w v z u4 1: A a 2 0.65311793 0.51545958 9.8769255 2: A a 2 0.83536463 -0.66023503 9.2840636 3: A a 1 0.71106833 0.72828359 16.4826687 4: A a 2 0.60862148 2.32045624 8.0152948 5: A a 0 0.16057663 0.05041016 8.0025669 ---

10 10000064: Z z 0 0.52693859 1.75472706 8.47744411 10000065: Z z 0 0.06636169 0.21803070 8.15724812 10000066: Z z 0 0.92682736 0.94889145 6.33057113 10000067: Z z 1 0.24955435 -1.25843110 4.64635914 10000068: Z z 0 0.55297120 -1.11692707 8.928785� �

Note that you can use i and j in tandem—for example, if you wanted tochange the values only for a certain set of x and y, it would look like:� �

1 > ### did you only want to change the value of a subset of an existing2 > ### variable?3 > setkey(DT, x, y)4 > DT[J(”R”,”c”), v := as.double(1:nrow(DT[J(”R”,”c”)]))]5 > DT[J(”R”,”c”)]6 x y w v z u7 1: R c 0 1 -0.1939731 13.9958718 2: R c 0 2 -0.1662939 15.7455179 3: R c 0 3 0.3822362 7.413997

10 4: R c 0 4 0.5299911 12.62096911 5: R c 0 5 0.1672528 10.44511312 ---13 14789: R c 1 14789 0.3722119 14.94801814 14790: R c 2 14790 -1.3209145 6.21550115 14791: R c 0 14791 -0.7928462 9.45335516 14792: R c 1 14792 -0.3500285 13.74215217 14793: R c 3 14793 0.3661481 10.280751� �

This is especially helpful if you only want to change, say, a single country’spanel in time-series cross-sectional data.

Importantly, you can put any expression into j—you need not evenreturn data!� �

1 > ### but much more importantly, you can feed any R expression into j.example:

2 > DT[J(”R”,”c”), plot(v ~ u)]3 Empty data.table (0 rows) of 2 cols: x,y� �

This produced Figure .

0 5 10 15 20 25 30 35

050

0010

000

1500

0

u

v

Figure : A sca erplot made within j.

Grouped expressions

We often want to perform a task grouped by a certain variable—this is oftenwhy we end up performing looped subse ing in the first place! This is

where plyr and data.table are demonstrably be er than how wewould normally do business. The basic flavor is as follows:� �

1 > ### you can apply functions by group.2 > DT[, sum(v), by = x]3 x V14 1: A 1.917867e+055 2: B 1.924702e+056 3: C 1.922589e+057 4: D 1.923288e+058 5: E 1.922437e+059 6: F 1.925284e+05

10 7: G 1.921200e+0511 8: H 1.921281e+0512 9: I 1.922380e+0513 10: J 1.922401e+0514 11: K 1.923190e+0515 12: L 1.924032e+0516 13: M 1.921449e+0517 14: N 1.921843e+0518 15: O 1.923918e+0519 16: P 1.922256e+0520 17: Q 1.923917e+0521 18: R 7.352803e+1022 19: S 1.925934e+0523 20: T 1.924088e+0524 21: U 1.924787e+0525 22: V 1.923010e+0526 23: W 1.920993e+0527 24: X 1.923995e+0528 25: Y 1.925537e+0529 26: Z 1.924239e+0530 x V1� �

These are the respective sums of the variable v performed across values ofx. We could do the same for x,y combinations:� �

1 > ### or by multiple groups2 > DT[, sum(v), by = c(”x”, ”y”)]3 x y V14 1: A a 7363.5695 2: A b 7345.5326 3: A c 7361.2577 4: A d 7418.1948 5: A e 7371.3409 ---

10 672: Z v 7452.86711 673: Z w 7357.90912 674: Z x 7416.21013 675: Z y 7342.40614 676: Z z 7398.265� �

Notably, this approach is demonstrably faster than tapply:� �1 > ### this is faster than most folks’ practice:2 > tt <- system.time(3 +4 + sum1 <- tapply(X = DT$v, INDEX = DT$x, FUN = sum)5 +6 + );tt7 user system elapsed8 4.23 0.14 4.389 >

10 > ss <- system.time(11 +12 + sum2 <- DT[,sum(v), by = x]13 +14 + );ss

15 user system elapsed16 0.17 0.01 0.1917 >18 > identical(as.vector(sum1), sum2$V1)19 [1] TRUE� �

You can also do multiple transformations using similar j notation asabove:� �

1 > ### you can do multiple transformations2 > DT[, list(mingus = sum(w), evans = prod(v)), by = x]3 x mingus evans4 1: A 384613 05 2: B 383768 06 3: C 384028 07 4: D 385129 08 5: E 384726 09 6: F 384820 0

10 7: G 383470 011 8: H 385640 012 9: I 384968 013 10: J 384145 014 11: K 383850 015 12: L 384101 016 13: M 384556 017 14: N 384071 018 15: O 384748 019 16: P 385445 020 17: Q 384993 021 18: R 384712 Inf22 19: S 384272 023 20: T 383660 024 21: U 385329 025 22: V 385094 026 23: W 385271 027 24: X 384423 028 25: Y 385489 029 26: Z 385283 030 x mingus evans� �

You may sometimes find it easier to work with a subse ed version of thedata as a list—that is, a list full of data.tables, one for every value of agiven variable. Then you can use lapply and other similar functions asneeded. To do that, we use the notation .SD. Let’s see what that looks like:� �

1 > DT[, lapply(.SD, sum), by = x, .SDcols = c(”v”, ”w”, ”z”)]2 x v w z3 1: A 1.917867e+05 384613 -1209.631534 2: B 1.924702e+05 383768 -491.239795 3: C 1.922589e+05 384028 313.846716 4: D 1.923288e+05 385129 -90.575167 5: E 1.922437e+05 384726 -415.526768 6: F 1.925284e+05 384820 -1902.857799 7: G 1.921200e+05 383470 -109.40697

10 8: H 1.921281e+05 385640 438.7716011 9: I 1.922380e+05 384968 57.8150512 10: J 1.922401e+05 384145 90.8944213 11: K 1.923190e+05 383850 -195.1536714 12: L 1.924032e+05 384101 -563.6104015 13: M 1.921449e+05 384556 -184.2171916 14: N 1.921843e+05 384071 -443.4800617 15: O 1.923918e+05 384748 -394.5199118 16: P 1.922256e+05 385445 -451.4015119 17: Q 1.923917e+05 384993 1014.9851820 18: R 7.352803e+10 384712 523.2178121 19: S 1.925934e+05 384272 766.8586222 20: T 1.924088e+05 383660 -725.40286

23 21: U 1.924787e+05 385329 -544.5219024 22: V 1.923010e+05 385094 -147.5144925 23: W 1.920993e+05 385271 32.0541926 24: X 1.923995e+05 384423 1154.6662827 25: Y 1.925537e+05 385489 254.7181028 26: Z 1.924239e+05 385283 -302.4520729 x v w z� �

Here we have used the .SD command to subset the data into a new ob-ject with entries, and then applied the sum command to the specifiedcolumns of v, w, and z.

It need not be the case that you use this sort of notation to “collapse” thedata—suppose you wanted to add a by-indicator percentile for z by both xand y. Easy:� �

1 > ### note that this need not be a ”collapsing” sort of transformation2 > DT[, zpercentile := rank(z)/length(z), by = c(”x”, ”y”)]3 > DT4 x y w v z u zpercentile5 1: A a 2 0.65311793 0.51545958 9.876925 0.70830806 2: A a 2 0.83536463 -0.66023503 9.284063 0.26316507 3: A a 1 0.71106833 0.72828359 16.482668 0.77523158 4: A a 2 0.60862148 2.32045624 8.015294 0.99006299 5: A a 0 0.16057663 0.05041016 8.002566 0.5305888

10 ---11 10000064: Z z 0 0.52693859 1.75472706 8.477444 0.957277112 10000065: Z z 0 0.06636169 0.21803070 8.157248 0.579192913 10000066: Z z 0 0.92682736 0.94889145 6.330571 0.823768014 10000067: Z z 1 0.24955435 -1.25843110 4.646359 0.103968115 10000068: Z z 0 0.55297120 -1.11692707 8.928785 0.1320895� �

A few applications

Just to show you some flexibility, here is a problem: suppose that youwanted to break these data down by values of x, run an OLS regressionon each subset, and then save the coefficients. In , you might write up ascript like the following:� �

1 > ### the 505 way2 > tt <- system.time({3 +4 + coefs1 <- matrix(NA, nrow = 26, ncol = 4)5 +6 + for(i in 1:26){7 +8 + piece <- DF[DF$x == LETTERS[i],]9 + coefs1[i,] <- coef(lm(w ~ v + z + u, data = piece))

10 +11 + }12 +13 + });tt14 user system elapsed15 39.55 3.39 43.1416 >17 > coefs118 [,1] [,2] [,3] [,4]19 [1,] 0.9967335 0.0043255313 0.0023009268 1.105387e-0420 [2,] 1.0025524 -0.0131911473 0.0002201425 1.839118e-0421 [3,] 0.9875414 0.0095193322 0.0005937307 6.161587e-0422 [4,] 1.0018826 -0.0030160729 -0.0009244225 9.532975e-0523 [5,] 0.9986710 0.0011057573 -0.0017969269 1.054551e-04

24 [6,] 1.0070935 -0.0076919648 -0.0016824387 -2.726625e-0425 [7,] 0.9937026 -0.0048305683 0.0025053514 5.727455e-0426 [8,] 0.9998594 -0.0045061421 0.0021532664 5.047661e-0427 [9,] 1.0021244 -0.0002771667 -0.0005902548 -1.076862e-0428 [10,] 1.0022800 -0.0038178402 -0.0007240292 -1.601389e-0429 [11,] 0.9992882 0.0037992022 0.0011495521 -3.180637e-0430 [12,] 1.0007567 -0.0064986820 0.0027587647 1.154958e-0431 [13,] 1.0015299 -0.0108109919 0.0021047146 3.715152e-0432 [14,] 0.9968747 0.0048773806 -0.0007185543 -7.348219e-0533 [15,] 0.9976900 0.0033982248 0.0010579466 9.494114e-0534 [16,] 1.0029305 -0.0034994993 -0.0035536237 9.643136e-0535 [17,] 0.9976515 0.0062629682 -0.0014009404 1.943619e-0536 [18,] 1.0030362 -0.0050053556 -0.0002495498 -2.870706e-0537 [19,] 0.9972203 0.0052514141 0.0012352377 -7.514891e-0538 [20,] 0.9934930 0.0053785986 -0.0020267336 1.322008e-0439 [21,] 0.9967530 0.0036978510 -0.0008680150 3.244288e-0440 [22,] 0.9995008 -0.0048016468 -0.0009455275 4.138265e-0441 [23,] 1.0012034 0.0037786034 -0.0003363080 -1.392472e-0442 [24,] 1.0030570 -0.0066906694 0.0011964442 -2.206894e-0543 [25,] 1.0026660 -0.0049284783 0.0006725549 2.067052e-0444 [26,] 1.0052937 0.0002326185 0.0016023204 -3.679136e-04� �

That took seconds, and it’s not hard to think up slightly clunkier ways towrite a script like this. But, at the very least, our plucky li le for loop gotthe job done.

Relatively savvy plyr users might use the following compact notationto get the job done much faster:� �

1 > ### plyr2 > uu <- system.time(3 +4 + coefs3 <- dlply(DF, ”x”, function(DF) coef(lm(w ~ v + z + u, data =

DF)))5 +6 + );uu7 user system elapsed8 16.57 3.04 19.739 > coefs3

10 $A11 (Intercept) v z u12 0.9967334947 0.0043255313 0.0023009268 0.00011053871314 $B15 (Intercept) v z u16 1.0025523992 -0.0131911473 0.0002201425 0.00018391181718 $C19 (Intercept) v z u20 0.9875413678 0.0095193322 0.0005937307 0.00061615872122 $D23 (Intercept) v z u24 1.001883e+00 -3.016073e-03 -9.244225e-04 9.532975e-052526 $E27 (Intercept) v z u28 0.9986709730 0.0011057573 -0.0017969269 0.00010545512930 $F31 (Intercept) v z u32 1.0070935293 -0.0076919648 -0.0016824387 -0.00027266253334 $G35 (Intercept) v z u36 0.9937026198 -0.0048305683 0.0025053514 0.00057274553738 $H39 (Intercept) v z u40 0.9998594336 -0.0045061421 0.0021532664 0.000504766141

42 $I43 (Intercept) v z u44 1.0021243940 -0.0002771667 -0.0005902548 -0.00010768624546 $J47 (Intercept) v z u48 1.0022799819 -0.0038178402 -0.0007240292 -0.00016013894950 $K51 (Intercept) v z u52 0.9992882339 0.0037992022 0.0011495521 -0.00031806375354 $L55 (Intercept) v z u56 1.0007567218 -0.0064986820 0.0027587647 0.00011549585758 $M59 (Intercept) v z u60 1.0015298907 -0.0108109919 0.0021047146 0.00037151526162 $N63 (Intercept) v z u64 9.968747e-01 4.877381e-03 -7.185543e-04 -7.348219e-056566 $O67 (Intercept) v z u68 9.976900e-01 3.398225e-03 1.057947e-03 9.494114e-056970 $P71 (Intercept) v z u72 1.002930e+00 -3.499499e-03 -3.553624e-03 9.643136e-057374 $Q75 (Intercept) v z u76 9.976515e-01 6.262968e-03 -1.400940e-03 1.943619e-057778 $R79 (Intercept) v z u80 1.003036e+00 -5.005356e-03 -2.495498e-04 -2.870706e-058182 $S83 (Intercept) v z u84 9.972203e-01 5.251414e-03 1.235238e-03 -7.514891e-058586 $T87 (Intercept) v z u88 0.9934930344 0.0053785986 -0.0020267336 0.00013220088990 $U91 (Intercept) v z u92 0.9967530003 0.0036978510 -0.0008680150 0.00032442889394 $V95 (Intercept) v z u96 0.9995008267 -0.0048016468 -0.0009455275 0.00041382659798 $W99 (Intercept) v z u100 1.0012034219 0.0037786034 -0.0003363080 -0.0001392472101102 $X103 (Intercept) v z u104 1.003057e+00 -6.690669e-03 1.196444e-03 -2.206894e-05105106 $Y107 (Intercept) v z u108 1.0026660393 -0.0049284783 0.0006725549 0.0002067052109110 $Z111 (Intercept) v z u112 1.0052937421 0.0002326185 0.0016023204 -0.0003679136113114 attr(,”split_type”)115 [1] ”data.frame”116 attr(,”split_labels”)117 x

118 1 A119 2 B120 3 C121 4 D122 5 E123 6 F124 7 G125 8 H126 9 I127 10 J128 11 K129 12 L130 13 M131 14 N132 15 O133 16 P134 17 Q135 18 R136 19 S137 20 T138 21 U139 22 V140 23 W141 24 X142 25 Y143 26 Z� �

I don’t like this as much because it’s got the coefficients in a list rather thanin a single object, but that’s an easy fix. But, observe that we have savedquite a bit of time using this approach—we’re down to seconds. Nowdata.table:� �

1 > ### data.table2 > ss <- system.time(3 +4 + coefs2 <- DT[,as.list(coef(lm(w ~ v + z + u))), by = x]5 +6 + );ss7 user system elapsed8 13.80 1.94 15.819 > coefs2

10 x (Intercept) v z u11 1: A 0.9967335 4.325531e-03 0.0023009268 1.105387e-0412 2: B 1.0025524 -1.319115e-02 0.0002201425 1.839118e-0413 3: C 0.9875414 9.519332e-03 0.0005937307 6.161587e-0414 4: D 1.0018826 -3.016073e-03 -0.0009244225 9.532975e-0515 5: E 0.9986710 1.105757e-03 -0.0017969269 1.054551e-0416 6: F 1.0070935 -7.691965e-03 -0.0016824387 -2.726625e-0417 7: G 0.9937026 -4.830568e-03 0.0025053514 5.727455e-0418 8: H 0.9998594 -4.506142e-03 0.0021532664 5.047661e-0419 9: I 1.0021244 -2.771667e-04 -0.0005902548 -1.076862e-0420 10: J 1.0022800 -3.817840e-03 -0.0007240292 -1.601389e-0421 11: K 0.9992882 3.799202e-03 0.0011495521 -3.180637e-0422 12: L 1.0007567 -6.498682e-03 0.0027587647 1.154958e-0423 13: M 1.0015299 -1.081099e-02 0.0021047146 3.715152e-0424 14: N 0.9968747 4.877381e-03 -0.0007185543 -7.348219e-0525 15: O 0.9976900 3.398225e-03 0.0010579466 9.494114e-0526 16: P 1.0029305 -3.499499e-03 -0.0035536237 9.643136e-0527 17: Q 0.9976515 6.262968e-03 -0.0014009404 1.943619e-0528 18: R 0.9995745 5.046324e-09 -0.0002509601 -2.945720e-0529 19: S 0.9972203 5.251414e-03 0.0012352377 -7.514891e-0530 20: T 0.9934930 5.378599e-03 -0.0020267336 1.322008e-0431 21: U 0.9967530 3.697851e-03 -0.0008680150 3.244288e-0432 22: V 0.9995008 -4.801647e-03 -0.0009455275 4.138265e-0433 23: W 1.0012034 3.778603e-03 -0.0003363080 -1.392472e-0434 24: X 1.0030570 -6.690669e-03 0.0011964442 -2.206894e-0535 25: Y 1.0026660 -4.928478e-03 0.0006725549 2.067052e-0436 26: Z 1.0052937 2.326185e-04 0.0016023204 -3.679136e-0437 x (Intercept) v z u� �

It’s even faster.Reading in data is faster via fread than it is via commands like read.table

or read.csv. Observe:� �1 > ### read it in as a data.frame2 > tt <- system.time(3 +4 + baby <- read.csv(”data/bnames.csv”)5 +6 + );tt7 user system elapsed8 0.93 0.00 1.089 >

10 > dim(baby)11 [1] 258000 412 >13 > ### read it in as a data.table14 > tt <- system.time(15 +16 + babyDT <- fread(”data/bnames.csv”)17 +18 + );tt19 user system elapsed20 0.34 0.00 0.39� �

These data—the most popular baby names by gender and year fromto the present—is not an especially large object, but the three-times gainscales as well. Again, in just two lines, you can get a plot of the sum of thepercentages over time (to get a sense of “name diffusion”):� �

1 > ### you want to get a sense of whether the ”most popular” namesdiffer in mass

2 > ### across genders or years---like, has there been diffusion?3 > foo <- babyDT[, sum(percent), by = c(”year”, ”sex”)]4 > ggplot(foo, aes(x = year, y = V1, colour = sex)) + geom_line()� �

0.7

0.8

0.9

1880 1920 1960 2000year

V1

sex

boy

girl

Figure : Sum of most popular names’popularity by gender and year.

Apparently, there are a lot more naming options lately!Note also that many commands are faster with data.table than with

data.frame. Importantly, these include merge and reshape. To givean example of the la er:� �

1 > tt <- system.time(2 +3 + foo <- melt(baseball, idvar = id)4 +5 + );tt6 Using id, team, lg as id variables7 user system elapsed8 1.13 0.05 1.179 >

10 > ss <- system.time(11 +12 + foo1 <- melt(baseDT, idvar = id)13 +14 + );ss15 Using id, team, lg as id variables16 user system elapsed17 0.30 0.05 0.35� �

Wrapping up

In short, you should use data.table to get be er functionality whenworking in regular old R. Many jobs that seemed to require parallelizationdo not—they just require a be er approach than the usual subset-looping.

You can learn more about data.table at its home page. There area few quick vigne es (from which I have borrowed heavily!), along withupdated documentation.

Again, if you have very large data, then you probably should be workingin a different computing context than regular old R.

http://datatable.r-forge.r-project.org/

manipulating big data in r - robert j. carrollmanipulating big data in r robert j. carroll may 21,...

Documents