lrz kurs: big data analysis

105
Big Data Analysis Christoph Bernau and Ferdinand Jamitzky [email protected] http://goo.gl/kS31X

Upload: ferdinand-jamitzky

Post on 16-Jul-2015

101 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Lrz kurs: big data analysis

Big Data Analysis

Christoph Bernau and Ferdinand Jamitzky

[email protected]

http://goo.gl/kS31X

Page 2: Lrz kurs: big data analysis

Big Data Analysis

Christoph Bernau and Ferdinand Jamitzky

[email protected]

http://goo.gl/kS31X

Page 3: Lrz kurs: big data analysis

Big Data Analysis

Christoph Bernau and Ferdinand Jamitzky

[email protected]

http://goo.gl/kS31X

Page 4: Lrz kurs: big data analysis

Contents

1. A short introduction to big data

2. Parallel programming is hard

3. Hardware @LRZ

4. Functional Programming

5. Available packages for R

6. Parallel Programming Tools

7. SMP Programming

8. Cluster Programming

9. Job Scheduler

10.Calling external binary code

Page 5: Lrz kurs: big data analysis

big data

a short introduction

Page 6: Lrz kurs: big data analysis

What is Big Data?

In information technology, big data is a loosely-

defined term used to describe data sets so large

and complex that they become awkward to work

with using on-hand database management tools

(from wikipedia)

● Buzz Word

● High dimensional data

● Memory intensive data and/or algorithms

Page 7: Lrz kurs: big data analysis

Who does Big Data?

● Bioinformatics

● Genomics and other "Omics"

● Astronomy

● Meteorology

● Environmental Research

● Multiscale physics simulations

● Economic and financial simulations

● Social Networks

● Text Mining

● Large Hadron Collider

Page 8: Lrz kurs: big data analysis

Hardware for Big Data

● Large Arrays of Harddisks

● Solid State Disks as temp storage

● Large RAM

● Manycore

● Multicore

● Accelerators

● Tape Archives

Page 9: Lrz kurs: big data analysis

Software Middleware for Big Data

● MapReduce

● Distributed File Systems

● Parallel File Systems

● Distributed Databases

● Task Queues

● Memory Attached Files

Page 10: Lrz kurs: big data analysis

Supercomputer for Big Data

(Flash) Gordon: Data-Intensive Supercomputing at

the San Diego Supercomputing Centre

● 1,024 dual-socket Intel Sandy Bridge nodes,

each with 64 GB DDR3 1333 memory

● Over 300 TB of high performance Intel flash

memory SSDs via 64 dual-socket Intel

Westmere I/O nodes

● Large memory supernodes capable of

presenting over 2 TB of cache coherent

memory

● Dual rail QDR InfiniBand network

http://www.sdsc.edu/supercomputing/gordon/

Page 11: Lrz kurs: big data analysis

SuperMUC as Big Data System

SuperMUC

● 9,216 dual-socket Intel Sandy Bridge nodes,

each with 32 GB DDR3 1333 memory

● Parallel File System GPFS

● FDR10 InfiniBand network

● Bandwith to GPFS 200 GByte/s

● No Flash :-(

Page 12: Lrz kurs: big data analysis

parallel programming is hard

Page 13: Lrz kurs: big data analysis

Why parallel programming?

End of the free lunch

Moore's law means

no longer faster

processors, only more

of them. But beware!

2 x 3 GHz < 6 GHz

(cache consistency,

multi-threading, etc)

Page 14: Lrz kurs: big data analysis

The future is parallel

●Moore's law is still valid

●Number of transistors doubles every 2 years

●Clock speed saturates at 3 to 4 GHz

●multi-core processors vs many-core processors

●grid/cloud computing

●clusters

●GPGPUs

(intel 2000)

Page 15: Lrz kurs: big data analysis

The future is massively parallel

Connection Machine

CM-1 (1983)

12-D Hypercube

65536 1-bit cores

(AND, OR, NOT)

Rmax: 20 GFLOP/s

Page 16: Lrz kurs: big data analysis

The future is massively parallel

JUGENE

Blue Gene/P (2007)

3-D Torus or Tree

65536 64-bit cores

(PowerPC 450)

Rmax: 222 TFLOP/s

now: 1 PFLOP/s

294912 cores

Page 17: Lrz kurs: big data analysis

Supercomputer: SMP

SMP Machine:

shared memory

typically 10s of cores

threaded programs

bus interconnect

in R:

library(multicore)

and inlined code

Example: gvs1

128 GB RAM

16 cores

Example: uv3.cos.lrz.de

2000 GB RAM

1120 cores

Page 18: Lrz kurs: big data analysis

Supercomputer: MPI

Cluster of machines:

distributed memory

typically 100s of cores

message passing interface

infiniband interconnect

in R:

library(Rmpi)

and inlined code

Example: coolMUC

4700 GB RAM

2030 cores

Example: superMUC

320.000 GB RAM

160.000 cores

Page 19: Lrz kurs: big data analysis

Levels of Parallelism

●Node Level (e.g. SuperMUC has approx. 10000 nodes)

each node has 2 sockets

●Socket Level

each socket contains 8 cores

●Core Level

each core has 16 vector registers

●Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)

●Pipeline Level (how many simultaneous pipelines)

hyperthreading

● Instruction Level (instructions per cycle)

out of order execution, branch prediction

Page 20: Lrz kurs: big data analysis

Problems: Access Times

Getting data from:

CPU register 1ns

L2 cache 10ns

memory 80 ns

network(IB) 200 ns

GPU(PCIe) 50.000 ns

harddisk 500.000 ns

Getting some food from:

fridge 10s

microwave 100s ~ 2min

pizza service 800s ~ 15min

city mall 2000s ~ 0.5h

mum sends cake 500.000 s~1 week

grown in own garden 5Ms ~ 2months

Page 21: Lrz kurs: big data analysis

Computing MFlop/s

mflops.internal <- function(np) {

a=matrix(runif(np**2),np,np)

b=matrix(runif(np**2),np,np)

nflops=np**2*(2*np-1)

time=system.time(a %*% b)[[3]]

nflops/time/1000000}

This function computes a matrix-matrix multiplication using np x np random matrices.

The number of floating point operations is:

● np x np matrix elements

● np multiplications and (np-1) additions

resulting in

np x np x (np+np-1) = np**2*(2*np-1) FLOPS

Page 22: Lrz kurs: big data analysis

Amdahl's law

Computing time for N processors

T(N) = T(1)/N + Tserial + Tcomm * N

Accelerator factor:

T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)

small N: T(1)/T(N) ~ N

large N: T(1)/T(N) ~ 1/N

saturation point!

Page 23: Lrz kurs: big data analysis

Amdahl's Law II

Acceleration factor for

Tserial/T(1)=0.01

Page 24: Lrz kurs: big data analysis

Amdahl's law III

> plot(N,type="l")

> lines(N/(1+0.01*N),col="red")

> lines(N/(1+0.01*N+0.001*N**2),col="green")

Page 25: Lrz kurs: big data analysis

R on the HLRB-II

Strong scaling for

up to 120 cores

then the computing time is

too low.

Page 26: Lrz kurs: big data analysis

Leibniz Supercomputing Centre

Hardware @ LRZ

Page 27: Lrz kurs: big data analysis

● Computer Centre (~175 employees) for all Munich Universities with

o more than 80,000 students and

o more than 26,000 employees

o including 8,500 scientists

● Regional Computer Centre for all Bavarian Universities

o Capacity computing

o Special equipment

o Backup and Archiving Centre (10 petabyte, more than 6 billion files)

o Distributed File Systems

o Competence centre (Networks, HPC, IT Management)

● National Supercomputing Centre

o Gauss Centre for Supercomputing

o Integrated in European HPC and Grid projects

The Leibniz Supercomputing Centre is…

Page 28: Lrz kurs: big data analysis

Hardware @ LRZhttp://www.lrz.de/services/compute/linux-cluster/overview/

The LRZ Linux Cluster:

Heterogeneous Cluster of Intel-compatible systems

● lx64ia, lx64ia2, lx64ia3 (login nodes)

●gvs1, gvs2, gvs3, gvs4 (remote visualisation nodes 8 GPUs)

●uv2, uv3 (SMP nodes 1.040 cores)

● ice1-login (cluster)

● lxa1 (coolMUC, MPP cluster)

The SuperMUC

●superMIG (migration system and fat island, 8.200 cores)

●superMUC (cluster of thin islands, 147.456 cores available in Sept 2012)

Page 29: Lrz kurs: big data analysis

SuperMUC Linux Cluster

Hardware@LRZ (new Sept 2012)

SuperMIG

8200 cores

CoolMUC

4300 cores

SGI UV

2080 cores

gvs1...4

64 cores

SGI ICE

512 cores

ia64 x86_64 GPU

lx64ia2

8 cores

lx64ia3

8 cores

supzero

80 cores

login

login

SuperMUC

147456 cores

supermuc

16 cores

Page 30: Lrz kurs: big data analysis

File space @ LRZhttp://www.lrz.de/services/compute/backup/

$HOME

25 GB per group, with backup and snapshots

cd $HOME/.snapshot

$OPT_TMP

temporary scratch space (beware!)High Watermark Deletion

When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the

oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.

$PROJECT

project space (max 1TB), no automatic backup, use dsmc

Page 31: Lrz kurs: big data analysis

module system@LRZhttp://www.lrz.de/services/software/utilities/modules/

module avail

module list

module load <name>

e.g. module load matlab

module unload <name>

module show <name>

insert module system into qsub job:

. /etc/profile

or

. /etc/profile.d/modules.sh

Page 32: Lrz kurs: big data analysis

What our user do: Usage 2010 by Research Area

Page 33: Lrz kurs: big data analysis

Performance per core by Research area

Page 34: Lrz kurs: big data analysis

batch system@LRZhttp://www.lrz.de/services/compute/linux-cluster/batch-parallel

simple slurm script:

#!/bin/bash

#SBATCH -J myjob

#SBATCH --mail-

user=me@my_domain

#SBATCH --time=00:05:00

. /etc/profile

cd mydir

./myprog.exe

echo $JOB_ID

ls -al

pwd

this is ignored by SGE, but could be used if

executed normally

(Placeholder) name of job

(Placeholder) e-Mail address (don't forget!)

maximum run time; this may be increased up to

the queue limit

load the standard environment (see below)

change to working directory

start executable

Page 35: Lrz kurs: big data analysis

batch system@LRZhttp://www.lrz.de/services/compute/linux-cluster/batch-parallel

sbatch jobfile.sh submit job to SLURM

squeue -u <userid> get status of my job

scancel <jobid> delete my job

Start interactive shell:

srun --ntasks=32 --partition=uv2_batch xterm

Page 36: Lrz kurs: big data analysis

R makes life easier

functional programming matters

Page 37: Lrz kurs: big data analysis

How are High-Performance Codes constructed?

●“Traditional” Construction of High-Performance Codes:

oC/C++/Fortran

oLibraries

●“Alternative” Construction of High-Performance Codes:

oScripting for ‘brains’

oGPUs/multicore for ‘inner loops’

●Play to the strengths of each programming environment.

●Hybrid programming: o use cluster and task parallelism at the same time

o cluster parallelism: separated memory

o task parallelism: shared memory

Page 38: Lrz kurs: big data analysis

Why scripting?

A scripting language. . .

● is discoverable and interactive.

●has comprehensive built-in functionality.

●manages resources automatically.

● is dynamically typed.

●works well for “glueing” lower-level blocks together.

●examples: tcl/tk, perl, python, ruby, R, MATLAB

Page 39: Lrz kurs: big data analysis

Why functional matters...

● for parallel programming:

ono side effects

ocode as data

● for structured programming:

o late binding

o recursion

o lazy evaluation

overy high abstraction

Page 40: Lrz kurs: big data analysis

R functions

●R can define named and anonymous functions

●Define a (named or anonymous) function:

todB <- function(X) {10*log10(X)}

●Functions can even return (anonymous) functions

●The last value evaluated is the return value

●Variables from the calling namespace are visible

●All other variables are local unless specified

●Variable number of inputs:

myfunc <- function(...) list(...)

●Variable names and predefined values

myfunc <- function(a,b=1,c=a*b) c+1

Page 41: Lrz kurs: big data analysis

Available packages for R

Page 42: Lrz kurs: big data analysis

How to use multiple cores with R

●R provides modularization

●R provides high level abstractions

●R provides mixing of programming paradigms

●R provides dynamic libraries

●R provides vector expressions

Use It!

You can write multi-machine, multi-core, GPGPU accelerated, client-

server based, web-enabled applications using R

Page 43: Lrz kurs: big data analysis

Parallel R Packages

● foreach

●pnmath/MKL

●multicore

●snow

●Rmpi

●rgpu, gputools

●R webservices

●sqldf

●rredis

●mapReduce

parallel abstraction

parallel intrinsic functions

SMP programming

Simple Network of Workers

Message Passing Interface

GPGPU programming

client/server webservices

SQL server for R

noSQL server for R

large scale parallelization

Page 44: Lrz kurs: big data analysis

Parallel programming with R

●Parallel APIs:

oSMP - multicore

oMPP/MPI - mpi

ossh/sockets - snow

●Abstraction:

o foreach package

doMC

doMPI

doSNOW

doREDIS

Example:

library(doMC)

registerDoMC(cores=5)

foreach(i=1:10) %dopar%

sqrt(i)

roots -> foreach(i=1:10)

%dopar% sqrt(i)

Page 45: Lrz kurs: big data analysis

SMP programming

Page 46: Lrz kurs: big data analysis

library(multicore)

● send tasks into the background with parallel

● wait for completion and gather results with collect

library(multicore)

# spawn two tasks

p1 <- parallel(sum(runif(10000000)))

p2 <- parallel(sum(runif(10000000)))

# gather results blocking

collect(list(p1,p2))

# gather results non-blocking

collect(list(p1,p2),wait=F)

Page 47: Lrz kurs: big data analysis

library(multicore)

● Extension of the apply function family in R

● function-function or functional

● utilizes SMP:

library(multicore)

doit <- function(x,np)sum(sort(runif(np)))

# single call

system.time( doit(0,10000000) )

# serial loop

system.time( lapply(1:16,doit,10000000))

# parallel loop

system.time( mclapply(1:16,doit,10000000,mc.cores=4 ))

Page 48: Lrz kurs: big data analysis

doMC

# R

> library(foreach)

> library(doMC)

> registerDoMC(cores=4)

> foreach(i=1:10) %do% sum(runif(10000000))

user system elapsed

9.352 2.652 12.002

> foreach(i=1:10) %dopar% sum(runif(10000000))

user system elapsed

7.228 7.216 3.296

Page 49: Lrz kurs: big data analysis

multithreading with R

library(foreach)

foreach(i=1:N) %do%

{

mmult.f()

}

# serial execution

library(foreach)

library(doMC)

registerDoMC()

foreach(i=1:N) %dopar%

{

mmult.f()

}

# thread execution

Page 50: Lrz kurs: big data analysis

Cluster Programming

Page 51: Lrz kurs: big data analysis

doSNOW

# R

> library(doSNOW)

> registerDoSNOW(makeSOCKcluster(4))

> foreach(i=1:10) %do% sum(runif(10000000))

user system elapsed

15.377 0.928 16.303

> foreach(i=1:10) %dopar% sum(runif(10000000))

user system elapsed

4.864 0.000 4.865

Page 52: Lrz kurs: big data analysis

SNOW with R

library(foreach)

foreach(i=1:N) %do%

{

mmult.f()

}

# serial execution

library(foreach)

library(doSNOW)

registerDoSNOW()

foreach(i=1:N) %dopar%

{

mmult.f()

}

# cluster execution

Page 53: Lrz kurs: big data analysis

Job Scheduler

Page 54: Lrz kurs: big data analysis

noSQL databases

Redis is an open source, advanced key-value store. It is often referred

to as a data structure server since keys can contain strings, hashes,

lists, sets and sorted sets.

http://www.redis.io

Clients are available for C, C++, C#, Objective-C, Clojure, Common

Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala,

smalltalk, tcl

Page 55: Lrz kurs: big data analysis

doRedis / workers

start redis worker:

> echo "require('doRedis');redisWorker('jobs')" | R

The workers can be distributed over the internet

> startRedisWorkers(100)

Page 56: Lrz kurs: big data analysis

doRedis

# R

> library(doRedis)

> registerDoRedis("jobs")

> system.time(foreach(i=1:10) %do% sum(runif(10000000)))

user system elapsed

15.377 0.928 16.303

> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))

user system elapsed

4.864 0.000 4.865

Page 57: Lrz kurs: big data analysis

doMC

# R

> library(doMC)

> registerDoMC(cores=4)

> system.time(foreach(i=1:10) %do% sum(runif(10000000)))

user system elapsed

9.352 2.652 12.002

> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))

user system elapsed

7.228 7.216 3.296

Page 58: Lrz kurs: big data analysis

doSNOW

# R

> library(doSNOW)

> cl <- makeSOCKcluster(4)

> registerDoSNOW(cl)

> system.time(foreach(i=1:10) %do% sum(runif(10000000)))

user system elapsed

15.377 0.928 16.303

> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))

user system elapsed

4.864 0.000 4.865

Page 59: Lrz kurs: big data analysis

redis and R: rredis, doREDIS

redisConnect()

redisSet('x',runif(5))

redisGet('x')

redisClose()

redisAuth(pwd)

redisConnect()

redisLPush('x',1)

redisLPush('x',2)

redisLPush('x',3)

redisLRange('x',0,2)

# connect to redis store

# store a value

# retrieve value from store

# close connection

# simple authentication

# push numbers into list

# retrieve list

Page 60: Lrz kurs: big data analysis

Calling external binary code

Page 61: Lrz kurs: big data analysis

One R to rule them all

●C/C++/objectiveC

●Fortran

●java

●Mpi

●Threads

●opengl

●ssh

●web server/client

●linux mac mswin

●R shell

●R gui

●math notebook

●automatic latex/pdf

●vtk

Page 62: Lrz kurs: big data analysis

One R to bind them

●C/C++/objectiveC

●Fortran

●java

●R objects

●R objects

●.C("funcname", args...)

●.Fortran("test", args...)

●.jcall("class", args...)

●.Call

●.External

Page 63: Lrz kurs: big data analysis

Use R as scripting language

R can dynamically load shared objects:

dyn.load("lib.so")

these functions can then be called via

.C("fname", args)

.Fortran("fname", args)

Page 64: Lrz kurs: big data analysis

C integration

●shared object libraries can be

used in R out of the box

●R arrays are mapped to C

pointers

R

Cinteger int*

numeric double*

character char*

Example:

R CMD SHLIB -o test.so test.c

use in R:

> dyn.load("test.so")

> .C("test", args)

Page 65: Lrz kurs: big data analysis

Fortran 90 Example

program myprog

! simulate harmonic oscillator

integer, parameter :: np=1000, nstep=1000

real :: x(np), v(np), dx(np), dv(np), dt=0.01

integer :: i,j

forall(i=1:np) x(i)=i

forall(i=1:np) v(i)=i

do j=1,nstep

dx=v*dt; dv=-x*dt

x=x+dx; v=v+dv

end do

print*, " total energy: ",sum(x**2+v**2)

end program

Page 66: Lrz kurs: big data analysis

Fortran Compiler

use Intel fortran compiler

$ ifort -o myprog.exe myprog.f90

$ time ./myprog.exe

exercise for you:

●compute MFlop/s (Floating Point Operations: 4 * np * nstep)

●optimize (hint: -fast, -O3)

Page 67: Lrz kurs: big data analysis

R subroutine

subroutine mysub(x,v,nstep)

! simulate harmonic oscillator

integer, parameter :: np=1000000

real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001

integer :: i,j, nstep

forall(i=1:np) x(i)=real(i)/np

forall(i=1:np) v(i)=real(i)/np

do j=1,nstep

dx=v*dt; dv=-x*dt

x=x+dx; v=v+dv

end do

return

end subroutine

Page 68: Lrz kurs: big data analysis

Matrix Multipl. in FORTRAN

subroutine mmult(a,b,c,np)

integer np

real*8 a(np,np), b(np,np), c(np,np)

integer i,j, k

do k=1, np

forall(i=1:np, j=1:np) a(i,j) = a(i,j) +

b(i,k)*c(k,j)

end do

return

end subroutine

Page 69: Lrz kurs: big data analysis

Call FORTRAN from R

# compile f90 to shared object library

system("ifort -shared -fPIC -o mmult.so mmult.f90");

# dynamically load library

dyn.load("mmult.so")

# define multiplication function

mmult.f <- function(a,b,c)

.Fortran("mmult",a=a,b=b,c=c,np=as.integer(dim(a)[1]

))

Page 70: Lrz kurs: big data analysis

Call FORTRAN binary

np=100

system.time(

mmult.f(

a = matrix(numeric(np*np),np,np),

b = matrix(numeric(np*np)+1.,np,np),

c = matrix(numeric(np*np)+1.,np,np)

)

)

Exercise: make a plot system-time vs matrix-dimension

Page 71: Lrz kurs: big data analysis

Disk

Big Memory

R R

MEM MEM

Logical Setup of Node

without shared memory

R R

MEM

Logical Setup of Node

with shared memory

DiskDisk

R R

MEM

Logical Setup of Node

with file-backed memory

R R

MEM

Logical Setup of Node

with network attached file-

backed memory

Network Network Network

Page 72: Lrz kurs: big data analysis

library(bigmemory)

● shared memory regions for several

processes in SMP

● file backed arrays for several node over

network file systems

library(bigmemory)

x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))

sum(x[1,1:1000])

Page 73: Lrz kurs: big data analysis

Part II

Applications

Page 74: Lrz kurs: big data analysis

Potential Problems on Big Data Sets

1. many small tasks have to be performed for

each of many thousands of variables (long

run time)

2. analysis/ processing needs more main

memory than available

3. several R processes on a node need to

process the same big data set and each

process creates its own big R-object

4. data set cannot be loaded into R because

the R-object representing it would be too big

for the main memory available (worst case)

Page 75: Lrz kurs: big data analysis

Approaches for Big Data Problems

1. C-function (shared library)

2. Accelerators (gpgpu, MICs)

3. SMP parallelisation

4. Cluster parallelisation

5. distributed data

6. in memory data files (arrays as big as

available memory)

7. parallel file systems (file backed arrays, no

size limit)

8. hierarchical and heterogeneous file systems

Page 76: Lrz kurs: big data analysis

Problem 1: Example (Microarray Data)

● gene expressions for approximately 20000 genes

● influence of each variable on a Survival response shall be tested

Compute a Cox-Survival-Model for each variable

S(t|x) = S (t)

● In R: function coxph() in package Surv (already part of package base)

● even more challenging problem: test all second order interactions

(all pairs, 20000 choose 2)

exp(bx)

0

Page 77: Lrz kurs: big data analysis

Problem 1: Example (Microarray Data)

First approach: for-loop in R using function coxph() [which actually calls a C-function using dyn.load to

compute the Cox-Model ]:

library(survHD)

data(beer.survival)

data(beer.exprs)

set.seed(123)

X<-t(as.matrix(beer.exprs))

y<-Surv(beer.survival[,2],beer.survival[,1])

coefs<-c()

system.time(

for(j in 1:ncol(X)){

fit <- coxph( y ~ X[,j])

coefs<-rbind(coefs,summary(fit)$coefficients[ 1 , c(1, 3, 5) ])})

Second Approach: using apply

system.time(output <- apply(t(X),1,function(xrow){

fit <- coxph( y ~ xrow )

summary(fit)$coefficients[ 1 , c(1, 3, 5) ]

}))

User System elapsed

34.635 0.002 34.686

User System elapsed

26.531 0.020 26.676

Page 78: Lrz kurs: big data analysis

Problem 1: Example (Microarray Data)

2nd Approach:

● Passing a matrix to C and perform the for-loop inside C

● only coefficients and cooresponding p-values are returned for each variable

● function rowCoxTests in R-package survHD

time <- y[,1]

status <- y[,2]

sorted <- order(time)

time <- time[sorted]

status <- status[sorted]

X <- X[sorted,]

##compute columnwise coxmodels

#dynload not necessary, because 'coxmat.so' is integrated into survHD

system.time(out<-

.C('coxmat',regmat=as.double(X),ncolmat=as.integer(ncol(X)),nrowmat=as.integer(n

row(X)),reg=as.double(X[,1]),zscores=as.double(numeric(ncol(X))),coefs=as.double

(numeric(ncol(X))),maxiter=as.integer(20),...))

● performing computations in C/Fortran, i.e. optimizing sequential code, often yields significant speed-up

● principally difficult to program and quite error prone

● C-functions for single variables are usually available and wrappers are usually easy to program

User System elapsed

0.229 0.000 0.229

max(abs(out$coefs-coefs[,1]))

[1] 1.004459e-07

Page 79: Lrz kurs: big data analysis

Comparison to parallel programming:

Parallelization of for-loop using snow:

#create cluster

library(snow)

cl<-makeSOCKcluster(10)

#broadcast X

Z<-X

clusterExport(cl=cl,list=list('Z'))

#function to be applied in parallel

parcoxph<-function(ind,y){

require(survHD)

zcol<-Z[,ind]

fit<-coxph( y ~ zcol )

summary(fit)$coefficients[ 1 , c(1, 3, 5) ]}

#run function on 10 cores

system.time(result <- parLapply(cl=cl,x=1:ncol(Z),fun=parcoxph,y=y))

● parallelization of very small and short tasks usually not efficient

● possible improvement: rewrite code such that bunches of tests are performed

User System elapsed

0.031 0.003 3.474

Page 80: Lrz kurs: big data analysis

Combining both approaches:

For really big data sets (>100000 variables) one can combine both approaches?

X2<-X

for(i in 30){

X2<-cbind(X2,X)}

colnames(X2) <- 1:ncol(X2)

system.time(tt<-rowCoxTests(t(X2),y,option='fast'))

system.time(rowCoxTests(t(X),y,option='fast'))

##using snow

#create cluster

library(snow)

cl<-makeSOCKcluster(10)

#function to be applied in parallel

parfun<-function(ind,Z,y){

require(survHD)

rowCoxTests(X=t(Z),y=y,option='fast')}

#run function on 10 cores

system.time(result<-parLapply(cl=cl,x=1:30,fun=parfun,Z=X,y=y))

X2<-cbind(X,X,X)

system.time(result<-parLapply(cl=cl,x=1:10,fun=parfun,Z=X,y=y))

User System elapsed

0.593 0.010 0.606

User System elapsed

0.303 0.000 0.303

User System elapsed

1.825 0.291 7.215

User System elapsed

2.255 0.206 3.436

Page 81: Lrz kurs: big data analysis

Combining both approaches: Exercise

In the current example, however, parallel computing is less effective anyway

Exercise:

1. Create a large data set by concatenating the gene-expression matrix 20 times

(use cbind)

2. apply the function rowCoxTests() and measure the runtime.

3. use snow in order to sent the expression matrix to 20 cores and let each core

perform rowCoxTests() on its own matrix.

4. Measure the runtime.

Page 82: Lrz kurs: big data analysis

Problem 2: Example

Normalization of Gene-Expression-Microarrays:

● approximately 500k measurements per array

● background correction has to be performed

● ca. 50 measurements have to be summarized to a single value representing one gene expression

(summarization step)

● R functions: rma() or vsn() in Bioconductor package affy

● high memory requirements as soon as number of observations exceeds 100 arrays (>10GB RAM)

Distributed Data Approach (Bioconductor Package affyPara)

Page 83: Lrz kurs: big data analysis

Problem 2: Example

source: Markus Schmidberger (): Parallel Computing for Biological Data, Dissertation

Distributed Data Approach for backgound correction

Page 84: Lrz kurs: big data analysis

AffyPara: Code Example

#load packages and initialize snow-cluster (for affyPara)

library(snow) #parallelization

library(affyPara) #parallel preprocessing

library(affy) #for reading in affy batches

ncpusaffy<-7 #number of cpus

cl<-makeSOCKcluster(ncpusaffy) #create cluster

#reading AffyBatch from cel-files

setwd('~/dataCEL/wang05/cel') #directory containing cel files

aboall<-ReadAffy() #reading

#create subcluster of length ncores

ncores<-7

cll<-cl[1:ncores]

#perform preprocessing using subcluster cll

res<-system.time(arrs.out<-

preproPara(aboall,bgcorrect=T,bgcorrect.method='rma',normalize=T,normalize.method='quantiles

',pmcorrect.method='pmonly',summary.method='avgdiff',cluster=cll))

###stop cluster/ finalize MPI

stopCluster(cl)

single core RAM > 6GB 7 cores: ca. 1.5GB/core minor speedup

Page 85: Lrz kurs: big data analysis

Problem 2: Exercise

Exercise for you:

1. Perform a microarray background correction using serial code (ReadAffy() ,bg.correct() in package

affy)

2. use top to observe the memory consumption of the process.

3. Additionally, measure its runtime.

4. Perform the background correction as a distributed data approach using snow

(you can pass a character-vector of filenames in ReadAffy() in order to load specific cel-files)

1. Compare memory consumption and runtime to the sequential code

Page 86: Lrz kurs: big data analysis

Problem 3/4: Data set too large for

RAM

● R cannot handle data indices which are larger than 2 Billion (16GB double, 4GB in Windows XP)

● modern biological data can have several dozen GB (e.g. Next Generation Sequencing)

● If the R-object representing the data set grows larger than the available RAM, R stops throwing an

error reading "Cannot allocate vector of xx byte".

Possible solution: R package bigmemory (based on C++-libraries for big data objects)

2 areas of usage:

● if several processes operate on the same big matrix

● file-backed-matrices if data sets are larger than available main memory

and the combination of both situations

Page 87: Lrz kurs: big data analysis

R-Package bigmemory

Essential functions:

● bigmatrix(): for creating a big matrix (useful if RAM is large enough but several processes have to

access the matrix)

● filebacked.big.matrix: for creating a file backed matrix (necessary if main memory is too small)

● describe(): creates a descriptor file for an existing (filebacked)bigmatrix-object

● bigmatrix[i1,i2]: the bigmatrix objects can be handled in R code as normal matrix objects, i.e. their

elements can be accessed using brackets

Page 88: Lrz kurs: big data analysis

bigmemory: code example

###write

data(golub)

library(bigmemory)

setwd('~/tmp/bigmem')

X<-as.matrix(golub[,-1])

#create filebacked.bigmatrix and write data into its elements

z<-

filebacked.big.matrix(nrow=30*5000,ncol=ncol(X),type='double',backingfile="m

agolub.bin",descriptorfile="magolub.desc")

k<-0

for(i in 1:5000){

inds<-sample(1:nrow(X),30)

z[(1:30)+(k*30),]<-X[inds,]

k<-k+1}

#create and save descriptorfile for later usage

desc<-describe(z)

save(desc,file='desc_z.RData')

Page 89: Lrz kurs: big data analysis

bigmemory: code example

###read

library(bigmemory)

setwd('tmp/bigmem')

#load descriptorfile

load('desc_z.RData')

#attach bigmatrix object using the descriptor file

y<-attach.big.matrix(desc)

#access elements

y[1:10,7]

#read element 7 in the 5th row

b<-y[5,7]

#compute sum of a submatrix

(sum1<-sum(y[1:10,5:20]))

Page 90: Lrz kurs: big data analysis

bigmemory: exercise

Exercise for you:

1. create a bigmatrix object using big.matrix()

2. create a descriptor and save it

3. start another R-session on the same node

4. load the descriptor file and attach the bigmatrix

5. use the bigmatrix object for communication between both R processes

Page 91: Lrz kurs: big data analysis

Gaining Flexibility: doRedis

● separates job administration and execution

● subtasks are stored in a redis data base

o master process sends subtasks of a computation to the server

o worker can log in and request the tasks

o all necessary R objects are stored in the redis server, too

● necessary software:

o R-packages: rredis, doRedis

o data base: redis-server (debian-package)

Page 92: Lrz kurs: big data analysis

doRedis: essential functionality

● Master process:

o registerDoRedis(jobqueue,host): connects to the redis-server at 'host' and specifies a jobqueue

for the tasks to come o foreach(j=1:n) %dopar% {FUN(j)}: sends subtasks to redis data base

o redisFlushAll(): clears the data base

o removeQueue(): removes a queue from the data base

● Worker process:

o registerDoRedis(jobqueue,host): registers a jobqueue whose taks shall be precessed

o startLocalWorkers(n,jobqueue,hoste): starts n local worker processes which process the tasks

specified in jobqueue (uses multicore)

o redisWorker(jobqueue,host): useful in mpi-environments

usually users do not request or set the data base values directly

typical parallelization as known from other "Do-packages"

Page 93: Lrz kurs: big data analysis

Worker processes can run on any R-compatible hardware and can connect at any time

redis-server

master:

doRedis

sends

jobs +objects

NODE 1

worker 1a

...

worker 1z

NODE 2

worker 2a

...

worker 2z

NODE 3

worker 3a

...

worker 3z

NODE 4

worker 4a

...

worker 4z

distributes

jobs and objects

eventually returns results

● robust

● flexible

● dynamic

Page 94: Lrz kurs: big data analysis

doRedis: code example

Master (sending subtasks to redis-server and wait for results):#redis-server ~/redis/redis-2.2.14/redis.conf (in linux shell, starts the

redis-server)

#cross-validation of classification on microarray data

library(CMA)

X <- as.matrix(golub[,-1])

y <- golub[,1]

ls <- GenerateLearningsets(y=y,method='CV',

fold=10,niter=10000)

#function to be applied on each node

cl2 <- function(j){

require(CMA)

ttt<-system.time(cl<-svmCMA(y=y,X=X,learnind=ls@learnmatrix[j,],cost=10))

list(cl,ttt,Sys.info())}

#connect to redis-server, sent subtasks and wait for results

library(doRedis)

redisFlushAll()

registerDoRedis('jobscmanew')

numtodo<-nrow(ls@learnmatrix)

lll3<-foreach(j=1:numtodo) %dopar% {cl2(j)}

Page 95: Lrz kurs: big data analysis

doRedis: code example

Worker processes (connect to server, receive subtasks and objects, return results):

###using multicore (just two lines)

#register jobqueue from redis-server

registerDoRedis('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')

#start 10 local workers

startLocalWorkers(n=10, queue='jobscmanew')

###using MPI

#function to be run by each mpi-process

startdr<-function(ll){

library(doRedis)

redisWorker('jobscmanew',host='bernau1.ibe.med.uni-muenchen.de')

}

#start rmpi

library(Rmpi)

numworker<-mpi.universe.size()

mpi.spawn.Rslaves()

#let each mpi-process connect to redis-server and perform subtasks

mpi.apply(1:numworker,startdr)

Page 96: Lrz kurs: big data analysis

doRedis: exercise

1. connect to the redis server in R

2. submit a job queue

3. start workers to perform the subtasks

4. set a value for variable xnewinteger (use)

5. request the value of variable xnewinteger (use)

Page 97: Lrz kurs: big data analysis

● redis and doRedis provide high flexibility for performing independent subtasks

o worker processes can connect at any time

o errors in individual processes do not stop the entire computation (robustness)

o worker processes can run on totally different architectures

o worker processes can run all around the world

● disadvantage: database can become a bottleneck if large R objects have to be stored/sent

solution: separation of large data objects (bigmemory) and job tasks (redis)

Combining doRedis and bigmemory

Page 98: Lrz kurs: big data analysis

Separate task and data channel:

Combining doRedis and bigmemory

Page 99: Lrz kurs: big data analysis

doredis/bigmemory: Code Example

worker process:

redisbigreadwrite<-function(procind){

require(CMA)

require(bigmemory)

j<-procind

setwd('~/tmp/bigmemlrz')

load('desc_z.RData') #big data object containing many gene expression sets

load('desc_out.RData') #big data file for misclassification rates

z<-attach.big.matrix(desc)

out<-attach.big.matrix(descout)

load('descresmat.RData')

resmat<-attach.big.matrix(descresmat) #big data object for simulating large

writing operation

for(iter in 1:10){

start<-(j-1)*30*10*10+(iter-1)*30*10+1

X<-z[start:(start+299),] #read gene expression matrix

cl<-svmCMA(y=sample(c(1,2),nrow(X),replace=T),X=X,learnind=1:25,cost=10))

#construct classifier

out[(j-1)*10+iter]<-mean(abs(cl@y-cl@yhat)) #compute misclassification rate

resmat[start:(start+299)]<-X #write X

}

#flush

flush(resmat);flush(out)}

Page 100: Lrz kurs: big data analysis

doredis/bigmemory: Code Example

master process:###create bigmatrix (gene expressions)

library(bigmemory)

setwd('~/tmp/bigmemlrz')

X<-as.matrix(golub[,-1])

z<-

filebacked.big.matrix(nrow=30*1500,ncol=ncol(X),type='double',backingfile="magolu

b.bin",

descriptorfile="magolub.desc")

for(i in 1:1500){

inds<-sample(1:nrow(X),30)

z[(1:30)+(i*30),]<-X[inds,]}

#create descrptor file and save it for other processes

desc<-describe(z)

save(desc,file='desc_z.RData')

###doredis part

library(doRedis)

registerDoRedis('rwbigmem')

lll3<-foreach(j=1:1500) %dopar% redisbigreadwrite{(j)}

results are returned in a file-backed object so master could quit

Page 101: Lrz kurs: big data analysis

doredis/bigmemory: code example

main difference: underlying network and network file system

IBE (NFS)LRZ (NAS)

Page 102: Lrz kurs: big data analysis

comparison to standard MPIIO-

approach

Difference: MPI less flexible

● not robust

● collective open/close calls

Fortran90 - MPIIO - Implementation R - bigmemory - implementation

Page 103: Lrz kurs: big data analysis

Exercise:

1. run the previous example using only two doredis-workers which perform only a single task

2. rewrite the previous example such that the proportion of class 1 predictions is returned

3. try to rewrite the previous example such that each worker process reads 10 subdatasets at a time

and then constructs a classifier for each of the ten read in subdatasets

4. create a larger bigmemory matrix of gene expression data (e.g. 1500 matrices of dimension

200x10000 ) using random numbers and run the previous example using that input 'bigmatrix'

doRedis/bigmemory: Exercise

Page 104: Lrz kurs: big data analysis

Thanks for your attention.

Further questions?

The End

Page 105: Lrz kurs: big data analysis

Worker processes can run on any R-compatible hardware and can connect at any time

redis-server

master:

doRedis

sends

jobs +objects

NODE 1

worker 1a

...

worker 1z

NODE 2

worker 2a

...

worker 2z

NODE 3

worker 3a

...

worker 3z

NODE 4

worker 4a

...

worker 4z

distributes

jobs and objects

eventually returns results

● robust

● flexible

● dynamic