extending lifespan with hadoop and r
DESCRIPTION
Many experts believe that ageing can be delayed, this is one of the main goals of the the Institute of Healthy Ageing at University College London. I will present the results of my lifespan-extension research where we integrated publicly available genes databases in order to identify ageing related genes. I will show what challenges we met and what we have learned about the process of ageing. Ageing is one of the fundamental mysteries in biology and many scientists are starting to study this fascinating process. I am part of the research group led by Dr Eugene Schuster at UCL Institute of Healthy Ageing. We experiment with Drosophila and Caenorhabditis elegans by modifying their genes in order to create long-lived mutants. The results of our experiments are quantified using high-throughput microarray analysis. Finally we apply information technology in order to understand how the ageing process works. I will show how we mine microarrays data in order to find the connections between thousands of genes and how we identify candidates for ageing genes. We are interested in building a better understanding of genes functions by harnessing the large quantity of experimental microarray data in the public databases. Our hope is that after understanding the ageing process in simpler organisms we will be able to apply this knowledge in humans. Cross-referencing expressions levels in thousands of genes and hundreds of experiments turned out to be a computationally challenging problem but Hadoop and Amazon cloud came to our rescue. In this talk I will present a case study based on our use of R with Amazon Elastic MapReduce and will give background on our bioinformatics challenges. These slides were presented at ApacheCon Europe 2012: http://www.apachecon.eu/schedule/presentation/3/TRANSCRIPT
![Page 1: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/1.jpg)
Extending lifespan with R and Hadoop
Radek Maciaszek
Founder of DataMine Lab, CTO Ad4Game, studying towards PhD in Bioinformatics at UCL
![Page 2: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/2.jpg)
2
Agenda● Project background● Parallel computing in R● Hadoop + R● Future work (Storm)● Results and summary
![Page 3: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/3.jpg)
3
Project background● Lifespan extension - project at UCL during MSc in
Bioinformatics
● Bioinformatics – computer science in biology (DNA, Proteins, Drug discovery, etc.)
● Institute of Healthy Ageing at UCL – lifespan is a king. Dozens of scientists, dedicated journals.
● Ageing is a complex process or is it? C. Elegans (2x by a single gene DAF-2, 10x).
● Goal of the project: find genes responsible for ageing
Caenorhabditis Elegans
![Page 4: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/4.jpg)
4
Primer in Bioinformatics● Central dogma of molecular biology
● Cell (OS+3D), Gene (Program), TF (head on HDD)
● How to find ageing genes (such as DAF-2)?
Images: Wikipedia
![Page 5: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/5.jpg)
5
RNA microarray
DAF-2 pathway in C. elegans
Source: Staal et al, 2003Source: Partridge & Gems, 2002
![Page 6: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/6.jpg)
6
Goal: raw data → network
100 x 100 x 50 x 10 (~10k genes)
Genes Network ● Pairwise comparisons of
10k x 10k genes + clustering
![Page 7: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/7.jpg)
7
Why R?● Incredibly powerful for data science with
big data● Functional, scripting programming
language with many packages.● Popular in mathematics, bioinformatics,
finance, social science and more.● TechCrunch lists R as trendy technology for
BigData. ● Designed by statisticians for statisticians
![Page 8: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/8.jpg)
8
R exampleK-Means clustering
require(graphics)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))colnames(x) <- c("x", "y")(cl <- kmeans(x, 2))plot(x, col = cl$cluster)points(cl$centers, col = 1:2, pch = 8, cex=2)
![Page 9: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/9.jpg)
9
R limitations & Hadoop● 10k x 10k (100MM) Fisher exact
correlations is slow● Memory allocation is a common problem● Single-threaded● Hadoop integration:
– Hadoop Streaming
– Rhipe: http://ml.stat.purdue.edu/rhipe/
– Segue: http://code.google.com/p/segue/
![Page 10: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/10.jpg)
10
Scaling R● Explicit
– snow, parallel, foreach
● Implicit– multicore (2.14.0)
● Hadoop– RHIPE, rmr, Segue, RHadoop
● Storage– rhbase, rredis, Rcassandra, rhdfs
![Page 11: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/11.jpg)
11
R and Hadoop● Streaming API (low level)
mapper.R
#!/usr/bin/env Rscriptin <- file(“stdin”, “r”)while (TRUE) { lineStr <- readLines(in, n=1) line <- unlist(strsplit(line, “,”)) ret = expensiveCalculations(line) cat(data, “\n”, sep=””)}close(in)
jar hadoop-streaming-*.jar –input data.csv –output data.out –mapper mapper.R
![Page 12: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/12.jpg)
12
RHIPE● Can use with your Hadoop cluster● Write mappers/reduces using R only
z <- rhmr(map=map,reduce=reduce, inout=c("text","sequence") ,ifolder=filename ,ofolder=sprintf("%s- out",filename))
job.result <- rhstatus(rhex(z,async=TRUE), mon.sec=2)
map <- expression({ f <- table(unlist(strsplit(unlist( map.values)," "))) n <- names(f) p <- as.numeric(f) sapply(seq_along(n),function(r) rhcollect(n[r],p[r]))})
reduce <- expression( pre={ total <- 0}, reduce = { total <- total+sum(unlist(reduce.values)) }, post = { rhcollect(reduce.key,total) } )
Example from Rhipe Wiki
![Page 13: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/13.jpg)
13
Segue● Works with Amazon Elastic MapReduce.● Creates a cluster for you.● Designed for Big Computations (rather than
Big Data)● Implements a cloud version of lapply()● Parallelization in 2 lines of code!● Allowed us to speed up calculations down
to 2h with the use of 16 servers
![Page 14: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/14.jpg)
14
Segue workflow (emrlapply)
![Page 15: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/15.jpg)
15
lapply()m <- list(a = 1:10, b = exp(-3:3))
lapply(m, mean)$a[1] 5.5$b[1] 4.535125
lapply(X, FUN) returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
![Page 16: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/16.jpg)
16
Segue in a cluster> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > # pearson.cor <- lapply(probes, AnalysePearsonCorelation)
Moving to the cloud in 3 lines of code!
![Page 17: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/17.jpg)
17
Segue in a cluster> AnalysePearsonCorelation <- function(probe) { A.vector <- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values)} > # pearson.cor <- lapply(probes, AnalysePearsonCorelation)> myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor <- emrlapply(myCluster, probes, AnalysePearsonCorelation)> stopCluster(myCluster)
![Page 18: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/18.jpg)
18
R + HBase
library(rhbase)hb.init(serialize="raw")
#create new table hb.new.table("mytable", "x","y","z",opts=list(y=list(compression='GZ'))) #insert some values into the tablehb.insert("mytable",list( list(1,c("x","y","z"),list("apple","berry","cherry"))))
rows<-hb.scan.ex("mytable",filterstring="ValueFilter(=,'substring:ber')")rows$get()
https://github.com/RevolutionAnalytics/RHadoop/wiki/rhbase
![Page 19: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/19.jpg)
19
Discovering genes
Topomaps of clustered genes
This work was based on:A Gene Expression Map for Caenorhabditis elegans, Stuart K. Kim, et al., Science 293, 2087 (2001)
![Page 20: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/20.jpg)
20
Genes clusters
Clusters based on Fisher exactpairwise genes comparisons
Green lines represent random probesRed lines represent up-regulated probesBlue lines are down-regulated probes (in daf-2 vs daf-2;daf-16 experiment)
![Page 21: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/21.jpg)
21
Genes networks
Network created with Cytoscape, platform for complex network analysis: http://www.cytoscape.org/
![Page 22: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/22.jpg)
22
Future work - real time R● Hadoop has high throughput but for small
tasks is slow. It is not good for continuous calculations.
● A possible solution is to use Storm● Storm multilang can be used with any
language, including R
![Page 23: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/23.jpg)
23
Storm R
Image source: Storm github wiki
Storm may be easily integrated with third party languages and databases:
● Java● Python● Ruby
● Redis● Hbase● Cassandra
![Page 24: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/24.jpg)
24
Storm Rsource("storm.R")
initialize <- function(){ emitBolt(list("bolt initializing"))}
process <- function(tup){ word <- tup$tuple rand <- runif(1) if (rand < 0.75) { emitBolt(list(word + "lalala")) } else { log(word + " randomly skipped!") }}
boltRun(process, initialize)
https://github.com/rathko/storm
![Page 25: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/25.jpg)
25
Summary● It’s easy to scale R using Hadoop.
● R is not only great for statistics, it is a versatile programming language.
● Is ageing a disease? Are we all going to live very long lives?
![Page 26: Extending lifespan with Hadoop and R](https://reader034.vdocument.in/reader034/viewer/2022042623/54b6cc884a79593a378b45a2/html5/thumbnails/26.jpg)
26
Questions?● References:
http://hadoop.apache.org/ http://hbase.apache.org/ http://code.google.com/p/segue/http://www.datadr.org/https://github.com/RevolutionAnalytics/https://github.com/rathko/storm