r on biohpc...20 mpi is available on r/3.1.2-intel only we will continue to use the simple parallel...

29
R on BioHPC Rstudio, Parallel R and BioconductoR 1 Updated for 2016-04-19

Upload: others

Post on 12-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

R on BioHPCRstudio, Parallel R and BioconductoR

1 Updated for 2016-04-19

Page 2: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Today we’ll be looking at…

2

Page 3: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Why R?

3

• The dominant statistics environment in academia

• Large number of packages to do a lot of different analyses

• Excellent uptake in Bioinformatics – specialist packages

• (Relatively) easy to accomplish complex stats work

• Very active development right nowR Foundation, R Consortium, Revolution Analytics, RStudio, Microsoft…

Page 4: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Why not R?

4

• Quirky language – painful for e.g. Python programmers

• Generally thought to be quite slow – except for optimized linear algebra

• Complex ‘old-fashioned’ documentation

• Parallelization packages can be complex / outdated

… but it’s getting better quickly….

Page 5: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Exciting Recent Developments in R

5

Page 6: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

RStudio – An IDE for R, on the web

6

http://rstudio.biohpc.swmed.edu

BioHPC optimized R, access to cluster storage, persistent sessions

Page 7: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

When to use RStudio

7

• Development work with small datasets

• Creating R Markdown documents

• Working with Shiny for dataset visualizations

• Any small, short-running data analysis tasks

Large datasets, very long running jobs, parallel code?

Must use R on the cluster…

Page 8: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Using R on the cluster / clients

8

module load R/3.2.1-intel

Latest version, optimized, same as used by rstudio.biohpc.swmed.edu

Use ‘R’ for command line R, or run scripts with ‘Rscript’

Page 9: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Rstudio in a GUI Session

9

Start a webGUI Session

$ module load R/3.2.1-Intel

$ module load rstudio

$ rstudio

Standard 20 hr limit

Whole node to yourself

Page 10: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Installing Packages

10

We have a set of common packages pre-installed in the R module.

You can install your own into your home directory (~/R)

install.packages(c("microbenchmark", "data.table"))

Some packages need additional libraries, won’t compile successfully.- Ask us to install them for you ([email protected])

This is for packages from CRAN – BioconductoR packages install differentlySee later!

Page 11: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Our R is faster than standard downloads

11

Compiled using Intel compiler and Intel Math Kernel Library

Task Standard R BioHPC R Speedup

Matrix Multiplication 139.15 1.80 77x

Cholesky Decomposition 19.53 0.32 61x

SVD 45.66 1.95 23x

PCA 201.30 6.25 32x

LDA 135.37 17.60 7x

This is on a cluster node – speedup is less on clients with fewer CPU cores

For your own Mac or PC see http://www.revolutionanalytics.com/revolution-r-open

mkl_test.R

Page 12: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Benchmarking functions in R (and compiling them)

12

Compiling a function that is called often can increase speedThe microbenchmark package allows you to benchmark functions

library(compiler)f <- function(n, x) for (i in 1:n) x = (1 + sin(x))^(cos(x))g <- cmpfun(f)

library(microbenchmark)compare <- microbenchmark(f(1000, 1), g(1000, 1), times = 1000)

library(ggplot2)autoplot(compare)

functions.R

Page 13: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

For speed – always vectorize!

13

54x speedup!

Using a function compilation improved median some (< 2x)Using vector form was much faster

distnorm <- function(){

x <- seq(-5, 5, 0.01)y <- rep(NA,length(x))

for(i in 1:length(x)) {y[i] <- stdnorm(x[i])

}

return(list(x=x,y=y))}

vdistnorm <- function(){

x <- seq(-5, 5, 0.01)y <- stdnorm(x)

return(list(x=x, y=y))

}

functions.R

Page 14: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Our Example Application

14

# Define a function that performs a random walk with a# specified bias that decaysrw2d <- function(n, mu, sigma){

steps=matrix(, nrow=n, ncol=2)for (i in 1:n){

steps[i,1] <- rnorm(1, mean=mu, sd=sigma )steps[i,2] <- rnorm(1, mean=mu, sd=sigma )mu <- mu/2

}return( apply(steps, 2, cumsum) )

}

mc_parallel.R

Page 15: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

A bigger task…

15

# Generate random walks of lengths between 1000 and 5000# foreach loopsystem.time(

results <- foreach(l=1000:5000) %do% rw2d(l, 3, 1))# user system elapsed# 85.872 0.145 86.242

# Applysystem.time(

results <- lapply( 1000:5000, rw2d, 3, 1))# user system elapsed# 81.175 0.114 81.511

mc_parallel.R

Page 16: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Start a cluster (of R slave workers on a single machine)

16

Single node, multiple cores running multiple R slaves

#Parallel Single nodelibrary(parallel)library(doParallel)

# Create a cluster of workers using all corescl <- makeCluster( detectCores() )# Tell foreach with %dopar% to use this clusterregisterDoParallel(cl)

stopCluster(cl)

mc_parallel.R

Page 17: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Explicit Parallelization in R

17

Our optimized R automatically parallelizes linear algebra on a single machine- enough in a lot of cases!

Always prefer using vector/matrix form over for loops and apply functions to get the most out of these optimizations.

If you need more options you can control the parallelization:

library(parallel) # Single-node and cluster parallelization# apply functions and explicit execution

library(doParallel) # Simple parallel foreach loops

Can run parallel code on a single node (multicore) or across nodes (MPI)

Page 18: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

R parallel vs MKL conflict

18

Intel MKL tries to use all cores for every linear algebra operationR is running multiple iterations of a loop in parallel using all cores

If used together too many threads/processes are launched – far more than cores!

export OMP_NUM_THREADS=1 # on terminal before running R

sys.setenv(OMP_NUM_THREADS="1") # within R

~ 5% improvement by disabling MKL multi-threading

Page 19: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

This time in parallel!

19

cl <- makeCluster( detectCores() )RegisterDoParallel(cl)Sys.setenv(OMP_NUM_THREADS="1")

# Generate 1000 random walks of increasing length# Parallel foreach loopsystem.time(

results <- foreach(l=1000:5000) %dopar% rw2d(l, 3, 1))# user system elapsed# 2.928 0.441 17.374

# Parallel applysystem.time(

results <- parLapply( cl, 1000:5000, rw2d, 3, 1))# user system elapsed# 0.339 0.171 8.460

stopCluster(cl)

5x Speedup

9x Speedup

mc_parallel.sh

Page 20: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

MPI parallelization – for really big jobs

20

MPI is available on R/3.1.2-intel only

We will continue to use the simple parallel and doParallel packages

Lots online about ‘snow’ – this is now behind the scenes in new versions of R

Please join us for coffee to discuss MPI projectsusing R

Work in progress optimizations with your help

Page 21: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

MPI parallelization – easy!

21

cl <- makeCluster( 128, type="MPI" )

Number of MPI tasks

cores per node * nodes (or less if RAM limited)

48 cores per node for 256GB partition32 cores per node for other partitions

mpi_parallel.R

mpi.exit()

Add to bottom of your R code to ensure tidy exit

Page 22: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

MPI parallelization – submitting the job

22

#!/bin/bash

#SBATCH --job-name R_MPI_TEST

# Number of nodes required to run this job#SBATCH -N 4# Distribute n tasks per node#SBATCH --ntasks-per-node=32

#SBATCH -t 0-2:0:0#SBATCH -o job_%j.out#SBATCH -e job_%j.err#SBATCH --mail-type ALL#SBATCH --mail-user [email protected]

module load R/3.2.1-intel

ulimit -l unlimitedmpirun R --vanilla < mpi_parallel.R

# END OF SCRIPT

No mpirun!

mpi_parallel.sh

Page 23: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

MPI Performance

23

# Sequential (with MKL multi-threading)system.time(

results <- lapply( 1000:10000, rw2d, 3, 1))# user system elapsed # 329.173 0.610 330.607

# Parallel apply, 4 nodes, 128 MPI taskssystem.time(

results <- parLapply( cl, 1000:10000, rw2d, 3, 1))# user system elapsed # 18.815 0.951 19.848 16x Speedup

Page 24: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Rmarkdown / Knitr

24

Write R code inside markdown documents

Create attractive HTML, PDF, Word output that includes the code and output

Page 25: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

BioconductoR

25

A comprehensive set of Bioinformatics related packages for R

Software and datasets

Page 26: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Bioconductor

26

Base packages installed, plus some commonly used extras

Install additional packages to home directory:

source("http://bioconductor.org/biocLite.R")biocLite('limma')

Ask [email protected] for packages that fail to compile

Page 27: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

BioconductoR

27

Bioconductor workflows are fantastic tutorials

http://www.bioconductor.org/help/workflows/

Page 28: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

BioconductoR Example

28

DEMO

RNA-Seq Analysis&

UCSC Genome Browser

See bioconductor.Rmd

Page 29: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes

Dallas R Users Group

29

http://www.meetup.com/Dallas-R-Users-Group/

University of Dallas, Irving, Saturdays