plyr, one data analytic strategy

37
plyr One data-analytic strategy Hadley Wickham Rice University Friday, 29 May 2009

Upload: hadley-wickham

Post on 01-Nov-2014

5.770 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Plyr, one data analytic strategy

plyrOne data-analytic strategy

Hadley WickhamRice University

Friday, 29 May 2009

Page 2: Plyr, one data analytic strategy

1. Motivation: Deseasonlising ozone measurements

2. Outline of strategy: split-apply-combine

3. Specifics: input vs. output

4. Fiddly details

5. Thoughts on data analysis

Friday, 29 May 2009

Page 3: Plyr, one data analytic strategy

−20

−10

0

10

20

30

−110 −85 −60

24 x 24 x 72 = 41,472

Friday, 29 May 2009

Page 4: Plyr, one data analytic strategy

−20

−10

0

10

20

30

−110 −85 −60

24 x 24 x 72 = 41,472

Friday, 29 May 2009

Page 5: Plyr, one data analytic strategy

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0

Friday, 29 May 2009

Page 6: Plyr, one data analytic strategy

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.00.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

Page 7: Plyr, one data analytic strategy

time

value

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

Page 8: Plyr, one data analytic strategy

timeresid

(des

eas1

) + m

ean(

one$

valu

e)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

Page 9: Plyr, one data analytic strategy

timeresid

(des

eas1

) + m

ean(

one$

valu

e)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

Page 10: Plyr, one data analytic strategy

How can we do this for all 24 x 24 locations?

(assume ozone levels stored in a 24 x 24 x 72 array)

Friday, 29 May 2009

Page 11: Plyr, one data analytic strategy

models <- as.list(rep(NA, 24 * 24))

dim(models) <- c(24, 24)

deseas <- array(NA, c(24, 24, 72))

dimnames(deseas) <- dimnames(ozone)

for (i in seq_len(24)) {

for(j in seq_len(24)) {

mod <- deseasf(ozone[i, j, ])

models[[i, j]] <- mod

deseas[i, j, ] <- resid(mod)

}

}

With a for loop

Friday, 29 May 2009

Page 12: Plyr, one data analytic strategy

models <- as.list(rep(NA, 24 * 24))

dim(models) <- c(24, 24)

deseas <- array(NA, c(24, 24, 72))

dimnames(deseas) <- dimnames(ozone)

for (i in seq_len(24)) {

for(j in seq_len(24)) {

mod <- deseasf(ozone[i, j, ])

models[[i, j]] <- mod

deseas[i, j, ] <- resid(mod)

}

}

With a for loop

Friday, 29 May 2009

Page 13: Plyr, one data analytic strategy

models <- apply(ozone, 1:2, deseasf)

resids <- unlist(lapply(models, resid))

dim(resids) <- c(72, 24, 24)

deseas <- aperm(resids, c(2, 3, 1))

dimnames(deseas) <- dimnames(ozone)

With apply

Friday, 29 May 2009

Page 14: Plyr, one data analytic strategy

models <- apply(ozone, 1:2, deseasf)

resids <- unlist(lapply(models, resid))

dim(resids) <- c(72, 24, 24)

deseas <- aperm(resids, c(2, 3, 1))

dimnames(deseas) <- dimnames(ozone)

With apply

Friday, 29 May 2009

Page 15: Plyr, one data analytic strategy

models <- aaply(ozone, 1:2, deseasf)

deseas <- aaply(models, 1:2, resid)

With plyr

Succinct, but you need to know what aaply does

cf. onomatopoeia, schadenfreude, soliloquyFriday, 29 May 2009

Page 16: Plyr, one data analytic strategy

−20

−10

0

10

20

30

−110 −85 −60

avg250260270280290300310

Friday, 29 May 2009

Page 17: Plyr, one data analytic strategy

−20

−10

0

10

20

30

−110 −85 −60

Friday, 29 May 2009

Page 18: Plyr, one data analytic strategy

Many problems involve splitting up a large data structure, operating on each piece and joining the results back together:

split-apply-combine

Friday, 29 May 2009

Page 19: Plyr, one data analytic strategy

How you split up depends on the type of input: arrays, data frames, lists

How you combine depends on the type of output: arrays, data frames, lists, nothing

Friday, 29 May 2009

Page 20: Plyr, one data analytic strategy

array data frame list nothing

array

data frame

list

aaply adply alply a_ply

daply ddply dlply d_ply

laply ldply llply l_ply

Friday, 29 May 2009

Page 21: Plyr, one data analytic strategy

array data frame list nothing

array

data frame

list

apply adply alply a_ply

daply aggregate by d_ply

sapply ldply lapply l_ply

Friday, 29 May 2009

Page 22: Plyr, one data analytic strategy

21

1

2 1,2

Split: array, data frame, list

Friday, 29 May 2009

Page 23: Plyr, one data analytic strategy

3

21

1 2 3

1,2 1,3 2,31,2,3

Split: array, data frame, list

Friday, 29 May 2009

Page 24: Plyr, one data analytic strategy

models <- aaply(ozone, 1:2, deseasf)

deseas <- aaply(models, 1:2, resid)

Splitting up ozone gives 576 vectors of length 72.Splitting up models gives 576 rlm models

Take 3d array, split up by first two dimensions.

How are they combined?

Friday, 29 May 2009

Page 25: Plyr, one data analytic strategy

4D!

Combine: array, data frame, list

Friday, 29 May 2009

Page 26: Plyr, one data analytic strategy

Combine: array, data frame, list

Friday, 29 May 2009

Page 27: Plyr, one data analytic strategy

name age sex

John 13 Male

Peter 13 Male

Roger 14 Male

John 13 Male

Mary 15 Female

Alice 14 Female

Peter 13 Male

Roger 14 Male

Phyllis 13 Female

name age sex

Mary 15 Female

Alice 14 Female

Phyllis 13 Female

name age sex

John 13 Male

Peter 13 Male

Phyllis 13 Female

name age sex

Mary 15 Female

name age sex

Alice 14 Female

Roger 14 Male

name age sex

.(sex) .(age)

Split: array, data frame, list

Friday, 29 May 2009

Page 28: Plyr, one data analytic strategy

Combine: array, data frame, list

sex

Male

Female

value

3

3

age

13

14

value

3

2

15 2

age

13

14

value

2

1

sex

Male

Male

14 1

15 1

Female

Female

Female 13 1

.(sex) .(age) .(sex, age)

Applying nrow to each piece

Friday, 29 May 2009

Page 29: Plyr, one data analytic strategy

Case study: Baseball

Friday, 29 May 2009

Page 30: Plyr, one data analytic strategy

id year team g ab r h

ruthba01 1914 BOS 5 10 1 2

ruthba01 1915 BOS 42 92 16 29

ruthba01 1916 BOS 67 136 18 37

ruthba01 1917 BOS 52 123 14 40

ruthba01 1918 BOS 95 317 50 95

ruthba01 1919 BOS 130 432 103 139

ruthba01 1920 NYA 142 457 158 172

ruthba01 1921 NYA 152 540 177 204

ruthba01 1922 NYA 110 406 94 128

ruthba01 1923 NYA 152 522 151 205

ruthba01 1924 NYA 153 529 143 200

ruthba01 1925 NYA 98 359 61 104

ruthba01 1926 NYA 152 495 139 184

ruthba01 1927 NYA 151 540 158 192

ruthba01 1928 NYA 154 536 163 173

ruthba01 1929 NYA 135 499 121 172

21 699 records

1228 players

15-31 years for each player

Friday, 29 May 2009

Page 31: Plyr, one data analytic strategy

How does performance (rbi/ab) change over the course of a career?

First need to add column that gives “career year”

Easy for a single player.

baberuth <- subset(baseball, id == "ruthba01") baberuth <- transform(baberuth, cyear = year - min(year) + 1)

For many players, use ddply + transform

baseball <- ddply(baseball, "id", transform, cyear = year - min(year) + 1)

Friday, 29 May 2009

Page 32: Plyr, one data analytic strategy

baseball <- subset(baseball, ab >= 25)

xlim <- range(baseball$cyear, na.rm=TRUE)

ylim <- range(baseball$rbi / baseball$ab, na.rm=TRUE)

plotpattern <- function(df) {

qplot(cyear, rbi / ab, data = df, geom = "line",

xlim = xlim, ylim = ylim)

}

pdf("paths.pdf", width = 8, height = 4)

d_ply(baseball, .(reorder(id, rbi / ab)), failwith(NA, plotpattern), .print = TRUE)

dev.off()

Draw time series for all 1228 players

Friday, 29 May 2009

Page 33: Plyr, one data analytic strategy

rsquare

count

0

50

100

150

200

0.0 0.2 0.4 0.6 0.8 1.0

Friday, 29 May 2009

Page 34: Plyr, one data analytic strategy

slope

intercept

−0.5

0.0

0.5

1.0

−0.04−0.020.000.020.040.060.08

rsquare0.000.250.500.751.00

slope

intercept

−0.10

−0.05

0.00

0.05

0.10

0.15

0.20

0.25

−0.010 −0.005 0.000 0.005 0.010

rsquare0.000.250.500.751.00

Friday, 29 May 2009

Page 35: Plyr, one data analytic strategy

Fiddly details

Labelling

Progress bars

Consistent argument names

Missing values / Nulls

Friday, 29 May 2009

Page 36: Plyr, one data analytic strategy

Data analysis

What other patterns of data analysis are waiting to be discovered?

How can we identify these strategies and then develop software to support them?

Does teaching these patterns make it easier for novices to become experts?

Friday, 29 May 2009

Page 37: Plyr, one data analytic strategy

http://had.co.nz/plyr

Friday, 29 May 2009