using&sprint&and¶llelised& func6ons&for&analysis&of&large ... ·...

38
Using SPRINT and parallelised func6ons for analysis of large data on mul6core Mac and HPC pla?orms Eilidh Troup 1* , Thorsten Forster 2 , Luis Cebamanos 1 , Terence Sloan 1 , Peter Ghazal 2 1. Edinburgh Parallel Compu6ng Centre, University of Edinburgh, Edinburgh, UK 2. Division of Pathway Medicine, University of Edinburgh Medical School, Edinburgh, UK *Contact author: [email protected]

Upload: others

Post on 26-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Using  SPRINT  and  parallelised  func6ons  for  analysis  of  large  data  on  mul6-­‐core  Mac  and  HPC  pla?orms    

Eilidh  Troup1*,  Thorsten  Forster2,  Luis  Cebamanos1,  Terence  Sloan1,  Peter  Ghazal2  

   1.  Edinburgh  Parallel  Compu6ng  Centre,  University  of  Edinburgh,  Edinburgh,  UK                                2.  Division  of  Pathway  Medicine,  University  of  Edinburgh  Medical  School,  

Edinburgh,  UK  *Contact  author:  [email protected]  

Page 2: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func6ons  Performance  Case  study  

Page 3: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo@va@on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func6ons  Performance  Case  study  

Page 4: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

R performance bottlenecks

Large  data  set  

Complex  analysis  +  

Running  6me  (CPU)  

Memory  overflow  (RAM)  

Small  data  set  

Complex  analysis  +/-­‐    

Page 5: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

R solutions for parallelisation and memory usage

Original R script

parallel() rmpi() snow() ff() bigmemory() …

For extensive resources, see Dirk Eddelbuettel’s HPC task view: http://cran.r-project.org/web/views/HighPerformanceComputing.html

1.  Simplify analysis, reduce data set, batch process 2. Use R functionality to parallelise code and extend memory (on supercomputers, clusters, multicore machines)

OS command line parallel script submission

Page 6: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Not  all  func6ons  are  easy  to  parallelise.  

•  Correla6on  for  example.  

1  2  3  4  5  6  

•  Clustering  is  another  example  where  the  data  cannot  be  considered  separately.  

•  Other  SPRINT  func6ons  provide  op6mised  implementa6ons,  or  handle  larger  datasets.    

cor()  

cor()  

Page 7: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT  approach  Overcomes  limita6ons  on  data  size  and  analysis  @me  and  by  providing  easy  access  to  HPC  for  all  R  users  

 

Simple  Parallel  R  INTerface  (www.r-­‐sprint.org)    

“SPRINT:  A  new  parallel  framework  for  R”,  J  Hill  et  al,  BMC  Bioinforma6cs,  Dec  2008.  

Photo:  Mark  Sadowski  

Page 8: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func6ons  Performance  Case  study  

Page 9: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Example"

1 Dec 2011 9

!

my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!

genecor <- cor( t(my.matrix) )!

!

quit(save="no”)!

Page 10: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Example"

1 Dec 2011 10

library("sprint”)!

my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!

genecor <- cor( t(my.matrix) )!

!

quit(save="no”)!

Page 11: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Example"

1 Dec 2011 11

library("sprint”)!

my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!

genecor <- pcor( t(my.matrix) )!

!

quit(save="no”)!

Page 12: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Example"

1 Dec 2011 12

library("sprint”)!

my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!

genecor <- pcor( t(my.matrix) )!

pterminate()!

quit(save="no”)!

Page 13: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

How to run"

library("sprint”)!

my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!

genecor <- pcor( t(my.matrix) )!

pterminate()!

quit(save="no”)!

sprint_script.R    

$  mpiexec  -­‐n  4  R  -­‐f  sprint_script.R  

Page 14: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa@on  SPRINT  Func6ons  Performance  Case  study  

Page 15: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

mpiexec "

$  mpiexec  -­‐n  4  R  -­‐f  sprint_script.R  

R  -­‐f  sprint_script.R   R  -­‐f  sprint_script.R   R  -­‐f  sprint_script.R   R  -­‐f  sprint_script.R  

Page 16: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT architecture SPRINT  overview  schema,  for  those  interested  in  implementa:on    (not  important  to  using  SPRINT)  

Page 17: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func@ons  Performance  Case  study  

Page 18: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pcor() ppam() prandomForest() pmaxt() pRP() psvm() pstringdistmatrix() papply() pboot() pdist()*

Pearson correlation for pairs of numeric variables Partitioning-Around-Medoids clustering Random Forest classification algorithm Permutation-adjusted p-values Rank-Product non-parametric statistical permutation-based test Support-Vector-Machine classification algorithm Hamming distance for pairs of character strings Apply any function to each row/column in a matrix Bootstrap estimates for any given statistic/function A variety of distance metric to compute (dis)similarity of data vectors

Parallelised SPRINT functions

Page 19: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pcor() Pearson correlation for pairs of numeric variables

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

Var10

Var11

Var12

Var13

Var14

VarP

Input data Va

r1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1 1

Var2 1

Var3 1

Var4 1

Var5 1

Var6 1

Var7 1

Var8 1

Var9 1

Var10 1

Var11 1

Var12 1

Var13 1

Var14 1

VarP 1

Output – Correlation matrix (“adjacency matrix”, “similarity matrix”)

(N rows) (N2 correlation coefficients)

Perform correlation on all possible pairs of variables.

(1-correlation matrix = distance matrix)

Page 20: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

ppam() Partitioning-Around-Medoids clustering

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

Var10

Var11

Var12

Var13

Var14

VarP

Input data

4 clusters

Mea

sure

men

t

Observations

Mea

sure

men

t

Observations

Compute distance between all possible pairs of variables, then partition into sets of maximal distinction

Op6misa6on  and  parallelisa6on  of  the  par66oning  around  medoids  func6on  in  R.  Piotrowksi  M.  et  al.  BILIS  2011,  Jul  2011.  

Page 21: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

prandomForest()

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

Var10

Var11

Var12

Var13

Var14

VarP

Data

Construct ‘forest’ of decision trees. Each tree is for a bootstrap sample of the input data set. Aggregate by majority vote.

Treated Control Tr

eate

d Tr

eate

d C

ontro

l C

ontro

l Tr

eate

d C

ontro

l Tr

eate

d Tr

eate

d

Predicted class for new observations

Most important variables for predicting class

Var5

Var7

Var34

Var100

Var29

Var655 Known class:

Random Forest classification algorithm

C   T  T   T  C  

C   T  T   T  C  

C   T  T   T  C  

A  parallel  random  forest  classifier  for  R,  L.  Mitchell  et  al,  HPDC  2011,  Jun  2011.  

Page 22: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pmaxT()

22

Permutation-adjusted p-values Based on mt.maxT()

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

VarP

Data

Treated Control Class:

Adj

uste

d p-

valu

e

p-va

lue

t Obs

3

Obs

7

Obs

4

Obs

5

Obs

1

Obs

6

Obs

2

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

VarP

Resample columns and re-compute statistical test

Repeat B times to derive a Null distribution, base adjusted p on this

“Treated” “Control”

Op6miza6on  of  a  parallel  permuta6on  tes6ng  func6on  for  the  SPRINT  R  package,  S.  Petrou  et  al,  Concurrency  and  Computa6on:  Prac6ce  and  Experience,  Jun  2011.  

Page 23: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pRP() Rank-Product non-parametric statistical permutation-based test

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

VarP

Treated Control Class:

T/C

ratio

1

T/C

ratio

2

T/C

ratio

TxC

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

VarP

Data

Ratio data (all possible pairs of T/C)

Ran

k Pr

oduc

t

Ran

k Pr

oduc

t

RP

p-v

alue

Repeat steps after permuting rows of source data B times to derive a p-value

Parallel  classifica6on  and  feature  selec6on  in  microarray  data  using  SPRINT.  Mitchell  L.  et  al.  2012.  Concurrency  and  Computa6on:  Prac6ce  and  Experience.  

Page 24: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pstringdistmatrix() Hamming distance for pairs of character strings

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1

Var1 0

Var2 0

Var3 0

Var4 0

Var5 0

Var6 0

Var7 0

Var8 0

Var9 0

Var10 0

Var11 0

Var12 0

Var13 0

Var14 0

VarP 0

Distance matrix

(N strings)

(N2 alignment scores)

Calculate string alignment on all possible string pairs.

Input data= string vector

abc acd abc … bcd ada

Page 25: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

papply()

Obs

1

Obs

2

Obs

3

Obs

4

Obs

5

Obs

6

Obs

7

Obs

N

Var1

Var2

Var3

Var4

Var5

Var6

Var7

Var8

Var9

Var10

Var11

Var12

Var13

Var14

VarP

Input

Apply any function to each row or column in a matrix*

Output papply()

papply()

Output *papply() includes lapply() functionality

Page 26: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

pboot() Bootstrap estimates for any given statistic/function

Input

Group 1 Group 2

Random sample 1

Random sample 2

Random sample B

Random sample …

For each sample, calculate: Mean(Group 1) / Mean (Group 2)

Ratio = Mean(Group 1) / Mean (Group 2)

Summary estimate (e.g. standard error)

Parallel  Op6misa6on  of  Bootstrapping  in  R.  Sloan  TM,  Piotrowski  M,  Forster  T,  Ghazal  P.  arXiv.org  pre-­‐publica6on  January  2014.  

Page 27: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func6ons  Performance  Case  study  

Page 28: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT and Data Size"Overcome limitations on data size and analysis time by providing easy

access to High Performance Computing for all R users

Input  Matrix    Size  

Output  Matrix    Size  

Serial    Run  Time  

Parallel  Run  Time  

11,000  x  320  26.85  MB  

0.9  GB   63.18 secs 4.76  secs  

22,000  x  320  53.7  MB  

3.6  GB  Insufficient    memory  

13.87  secs  

35,000  x  320  85.44  MB  

9.12  GB   Crashed 36.64  secs  

45,000  x  320  109.86  MB  

15.08  GB   Crashed 42.18  secs  

For example, Pearson’s correlation, pcor() enables processing of datasets where the output does not fit in physical memory using R ff package.

Benchmark  on  HECToR  -­‐  UK  Na6onal  Supercompu6ng  Service  on  256  cores.  S.  Petrou  et  al,  dCSE  NAG  Report,  www.r-­‐sprint.org.  

Page 29: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Performance increase – pcor()

The pcor() function scales well. (Shown here up to 256 cores).

Page 30: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT and Analysis Time "Overcome limitations on data size and analysis time by providing easy

access to High Performance Computing for all R users

Input  Matrix  Size  

#  Permuta@ons    

Serial  Run  Time  (es@mated)  

Parallel  Run  Time  

36,612  x  76   500,000   6  hrs   73.18  secs  

36,612  x  76   1,000,000   12  hrs   146.64  secs  

36,612  x  76   2,000,000   23  hrs   290.22  secs  

73,224  x  76   500,000   10  hrs   148.46  secs  

73,224  x  76   1,000,000   20  hrs   294.61  secs  

73,224  x  76   2,000,000   39  hrs   591.48  secs  

Benchmark  on  HECToR  -­‐  UK  Na6onal  Supercompu6ng  Service  on  256  cores.  S.  Petrou  et  al,  HPDC  2010  &  CCPE,  2011.  

For  example,  permuta6on  tes6ng,  pmaxT()  is  a  parallel  implementa6on  of  mt.maxT()  from  mulkest  package  (available  from  CRAN)  

Page 31: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT Data Size and Analysis Time "

Overcome limitations on data size and analysis time by providing easy access to High Performance Computing for all R users

Input  Data  Size  

#  Clusters    

Serial  Run  Time  Pam()  

Parallel  Run  Time    Ppam()  

10  000   24   99  mins   1.2  mins  

22  374   24   Insufficient  memory   4.5  mins  

Benchmark on a shared memory cluster with 8 dual-core 2.6GHz AMD Opteron processors with 2GB of RAM per core. M.  Piotrowski  et  al,  BILIS  2011.  

For example, clustering with partitioning around medoids, ppam() •  Parallel implementation of pam() from cluster package (available from CRAN) •  Optimisation of serial version through memory and data storage management •  Increased capacity by using external memory (i.e. ff objects)

Page 32: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Overview  

Mo6va6on  How  to  use  SPRINT  SPRINT  Implementa6on  SPRINT  Func6ons  Performance  Case  study  

Page 33: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Case study 3 microarray gene expression time courses (each a data matrix of 14,819 rows x 25 columns)

~14

,000

gen

es 25 time points

~14

,000

gen

es 25 time points

~14

,000

gen

es 25 time points

T1 T25

Gen

e ex

pres

sion

leve

l

=14,819 gene expression profiles

A B C

T1 T25

Gen

e ex

pres

sion

leve

l

=14,819 gene expression profiles

T1 T25 G

ene

expr

essi

on le

vel

=14,819 gene expression profiles

Page 34: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Usual approach

To measure correlation of gene expression profiles within each of the data matrices OR between the 3 possible pairs of data matrices: N = 3 x 14,8192 ≈ 659 million correlation computations

BUT…

Page 35: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

But we wanted to expand on this

We want to look at time-shifted correlations, where part of each gene’s expression profile in one of the data sets could match a part of another gene’s in another of the data sets.

T1 T25

We split each gene’s expression profile into 21 overlapping time windows of length 5 (= 2 hours)

Page 36: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

Serial cor(), computing all correlations of 2-hour time windows between 2 data sets (14819 x 21 time windows)2 ≈ 97 billion calculations

pcor() use case

Fails

Serial cor() with reduced number of genes (5561) (5561x 21 time windows)2 ≈ 14 billion calculations

~3 hours

Same computation in parallel (and using ‘ff’ package to exceed RAM constraints) on full gene set with SPRINT pcor()

~10 min

Page 37: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT future

- Ensembl learning (multiple classification algorithms and multiple classification parameter values) with clinical microarray data sets to diagnose/prognose disease

- Data fusion of clinical and biological data sets

Biomedical research projects will drive parallelisation of R functionality

…but we’re open to collaborations if there are specific problems to solve

Page 38: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time

SPRINT

EPCC Eilidh Troup Luis Cebamanos Terence Sloan (PI)

DPM Thorsten Forster Peter Ghazal (PI)

Former Contributors and Funders Muriel Mewissen Savaas Petrou Michal Piotrowski Jon Hill Florian Scharinger Laurence Baldwin Bartek Dobrzelecki Lawrence Mitchell Kevin Robertson Andy Turner

www.r-­‐sprint.org