Download - R meetup talk
Fast lookups in R
Joseph Adler
April 13 2010
About me
Relevant work
• Tasks– Computer security research
– Credit risk modeling
– Pricing strategy
– Direct marketing
• Places– American Express
– Johnson and Johnson
– DoubleClick
– VeriSign
– LinkedIn (now)
About me
Books
Today’s talk
What I wrote
If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average
Today’s talk
What I read after the book was printed
Re: [R] beginner Q: hashtable or dictionary?
From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30
Jan 2006 - 18:37:00 EST
On Sun, 29 Jan 2006, hadleywickham wrote:
>> use a 'list': > > Is a list O(1) for setting and getting?
Can you elaborate? R is a vector language, and normally you create
a list in one pass, and you can retrieve multiple elements at once.
Retrieving elements by name from a long vector (including a
list) is very fast, as an internal hash table is used.Does the
following item from ONEWS answer your question?
Indexing a vector by a character vector was slow if both
the vector and index were long (say 10,000). Now
hashing is used and the time should be linear in the
longer of the lengths (but more memory is used).
Indexing by number is O(1) except where replacement causes the
list vector to be copied. There is always the option to use match() to
convert to numeric indexing.
-- Brian D. Ripley,
Professor of Applied Statistics,
University of Oxford
Retrieving elements by name from a
long vector (including a list) is very
fast, as an internal hash table is used.
Professor Brian D. Ripley
Today’s talk
• A short introduction to objects in R
• Looking up values in R
– How lookup tables are implemented in R
– Measuring lookup speed
– Optimizing lookup speed
Objects in R
Everything in R is an object. Here are some
examples of objects.
Numeric Vector:
>onehalf<- 1/2
>class(onehalf)
[1] "numeric”
Objects in R
Integer Vector:
> four <- as.integer(4)
> four
[1] 4
>class(four)
[1] "integer”
Objects in R
Character vector:
> zero <- "zero"
>class(zero)
[1] "character”
Objects in R
Logical vector:
>this.is.interesting<- FALSE
>class(this.is.interesting)
[1] "logical"
Objects in R
Vectors can have multiple elements
>one.to.five<- 1:5
>class(one.to.five)
[1] "integer"
>six.to.ten<- c(6, 7, 8, 9, 10)
>class(six.to.ten)
[1] "numeric"
Objects in R
Lists contain heterogeneous collections of objects> stuff <- list(3.14, "hat", FALSE)
>class(stuff)
[1] "list"
Objects in R
Functions are also objects in R:
>f<- function(x, y) {
+ x + y
+ }
>f
function(x, y) {
x + y
}
>class(f)
[1] "function"
Objects in R
Environments map names to objects. They are
used within R itself to map variable names to
objects. You can access these environment
objects, or create your own.> one <- 1
> two <- 2
> three <- 3
> objects()
[1] "one" "three" "two"
>e<- .GlobalEnv
>class(e)
[1] "environment"
>objects(e)
[1] "e" "one" "three" "two"
Lookups
You can look up an item in a vector, list, or array
within R
– Let’s define a vector:
>a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> a
[1] 1 2 3 4 5 6 7 8 9 10
– You can refer to elements by index:
>a[3]
[1] 3
Lookups
It's also possible to name elements in a vector, then refer to
them by name:
>b<- c(Joe=1, Bob=2, Jim=3)
>b["Bob"]
Bob
This can be very convenient: you can use every vector in R
as a table. You can access the name vector through the
names function:
>names(b)
[1] "Joe" "Bob" "Jim"
Lookups
Named vectors in R are implemented using two
different arrays:
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
Lookups
The name lookup algorithm works roughly like this:
function(vector, name) {
for (i in 1:length(vector)) {
if (names(vector)[i] == name)
return vector[i]
}
return NA
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[1]
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[2]
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[4]
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[4]
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[5]
Lookups
Example: Look up a.20[“F”]
B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20
names(a.20)
names(a.20)[5]
Lookups
In vectors,
– Looking up a value by index takes a constant amount
of time.
– Looking up a value by name (potentially) requires
looking at every name in the names array. (This
means that lookup times scale linearly with the
number of items in the table.)
Lookups
Environments store (and fetch) data using a
different structure. They use hash tables.
Hash tables rely on a hash function to map labels
to indices.
Lookups
Simple hash table implementation
Example: store 15 ¾ for “Joe”
1. Calculate h(“Joe”)
2. Store 15 ¾ in the
table in slot h(“Joe”)
1
2
3
4 15 ¾
5
6
h(“Joe”) = 4
Lookups
If you carefully choose the size of the hash table
and the hash function, you can store and lookup
values in constant time (on average) in hash
tables.
Measuring Lookup Speed
In theory, looking up values in environments
should be faster than looking up values in vectors.
In practice, how much difference does this make?
Let’s measure how much time it takes to look up
values in vectors and environments, using different
lookup methods
Measuring Lookup Speed
Let's build a large, labeled vector for testing:labeled.array<- function(n) {
a <- 1:n
from <- “1234567890"
to <- "ABCDEFGHIJ"
for (i in 1:n) {
names(a)[i] <- chartr(from, to, i)
}
a
}
Here's an example of the output of this function:
>a.20 <- labeled.array(20)
>a.20
A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Measuring Lookup Speed
Let's also create environment objects for testing:
labeled.environment<- function(n) {e<- new.env(hash=TRUE, size=n) from <- "1234567890”to <- "ABCDEFGHIJ”for (i in 1:n) {
assign(x=chartr(from, to, i),value=i, envir=e)
}e}
Here’s an example of the output of this function:
> e.20 <- labeled.environment(20)
> e.20
<environment: 0x143756c>
Measuring Lookup Speed
You can fetch values from an environment object
with the get function
>get("A",envir=e.20)
[1] 1
>get("BA",envir=e.20)
[1] 20
You can also fetch values from an environment
with the double bracket operator
> e.20[["A"]]
[1] 1
>e.20[["BA"]]
[1] 20
Measuring Lookup Speed
• Creating examples for testing
arrays <- list()
for (i in 10:15) {
arrays[[as.character(2 ** i)]] <-
labeled.array(2 ** i)
}
environments <- list()
for (i in 10:15) {
environments[[as.character(2 ** i)]] <-
labeled.environment(2 ** i)
}
Measuring Lookup Speed
• Using the test function:
test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1]
}},arrays, 1024)
• Output:
first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
Notice that these values increase linearly with the number of
elements in the array
Measuring Lookup Speed
• Results for 1024 lookups:
1024 2048 4096 8192 16384 32768
Array index First 0.01 0.003 0.004 0.003 0.005 0.004
Array index Last 0.01 0.004 0.004 0.004 0.003 0.004
Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397
Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266
Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002
Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107
Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003
Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112
Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005
Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005
Let’s focus on the results for the largest arrays (which are the
most precise)
Measuring Lookup Speed
• Results for 1024 lookups, 32768 elements:
Array index First 0.004
Array index Last 0.004
Array Label Single Bracket First 5.397
Array Label Single Bracket Last 5.266
Array Label Double Bracket Exact First 0.002
Array Label Double Bracket Exact Last 1.107
Array Label Double Bracket Not exact First 0.003
Array Label Double Bracket Not exact Last 1.112
Environment Label First 0.005
Environment Label Last 0.005
Optimizing Lookup Speed
How to write efficient code:
1. Write code for clarity, not speed
2. Check to see if the code is fast enough. If it is
fast enough, stop.
3. Test your code to find where time is being spent
4. Fix the parts of your code that are taking
enough time.
5. Go to step 2
Optimizing Lookup Speed
• How do you make lookups fast?
– Lookups by position are fastest
– If you have to lookup up single values by name, write
your code with double-brackets
• Double-bracket lookups are a little faster than single bracket
lookups
• If you discover that your code is too slow, you can easily
change from vectors to environments
Optimizing Lookup Speed
• What if
– Your code is too slow
– You need to look up values by name
– It would be hard to change your code to use double-
bracket notation
• Define a bracket operator for environments!
Optimizing Lookup Speed
Remember that everything in R is a function, even
lookup operators.
Example code:
>b<- c(Joe=1, Bob=2, Jim=3)
>b["Bob"]
Bob
2
Optimizing Lookup Speed
Translation of the example code:
>b["Bob"]
Bob
2
>as.list(quote(b["Bob"]))
[[1]]
`[`
[[2]]
b
[[3]]
[1] "Bob"
Optimizing Lookup Speed
R translates
b["B"]
to
`[`(b, "B")
Optimizing Lookup Speed
Here is the code for our new subset function
`[` <- function(x, i, j, ..., drop=TRUE) {
if (class(x) == "environment”) {
get(x=i, envir=x)
} else {
.Primitive("[")(x, i, j, ..., drop=TRUE)
}
}
Optimizing Lookup Speed
Assignments through bracket notation are a little
funny. For example, R evaluates
x[3:5] <- 13:15
as if this code had been executed:
`*tmp*` <- x
x<- "[<-"(`*tmp*`, 3:5, value=13:15)
rm(`*tmp*`)
Optimizing Lookup Speed
Here is the code for our new subset assignment
function
`[<-` <- function(x, i, j, ..., value) {
if (class(x) == "environment”) {
assign(x=i, value=value, envir=x)
# the assign statement returns value,
# but we want to return the environment:
x
} else {
.Primitive("[<-")(x, i, j, ..., value)
}
}
Backup Slides
• A function to test the performance of a lookup
function on an object:
test_expressions<-
function(description, fun, data, reps) {
cat(paste(description,"\n"))
results <- vector()
for (n in names(data)) {
results[[n]] <- system.time(
fun(data[[n]], as.integer(n), reps)
)[["user.self"]]
}
print(results)
}
To figure out the full argument list for the bracket
operator, use the getGeneric function:
>getGeneric("[")
standardGeneric for "[" defined from package "base"
function (x, i, j, ..., drop = TRUE)
standardGeneric("[", .Primitive("["))
<environment: 0x11a6828>
Methods may be defined for arguments: x, i, j, drop
Use showMethods("[") for currently available ones.
In general, you should set new methods with the setMethod function. Example:
setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {
get(x=i,envir=x@e)}
)
Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.