using&sprint&and¶llelised& func6ons&for&analysis&of&large ... ·...
TRANSCRIPT
![Page 1: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/1.jpg)
Using SPRINT and parallelised func6ons for analysis of large data on mul6-‐core Mac and HPC pla?orms
Eilidh Troup1*, Thorsten Forster2, Luis Cebamanos1, Terence Sloan1, Peter Ghazal2
1. Edinburgh Parallel Compu6ng Centre, University of Edinburgh, Edinburgh, UK 2. Division of Pathway Medicine, University of Edinburgh Medical School,
Edinburgh, UK *Contact author: [email protected]
![Page 2: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/2.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa6on SPRINT Func6ons Performance Case study
![Page 3: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/3.jpg)
Overview
Mo@va@on How to use SPRINT SPRINT Implementa6on SPRINT Func6ons Performance Case study
![Page 4: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/4.jpg)
R performance bottlenecks
Large data set
Complex analysis +
Running 6me (CPU)
Memory overflow (RAM)
Small data set
Complex analysis +/-‐
![Page 5: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/5.jpg)
R solutions for parallelisation and memory usage
Original R script
parallel() rmpi() snow() ff() bigmemory() …
For extensive resources, see Dirk Eddelbuettel’s HPC task view: http://cran.r-project.org/web/views/HighPerformanceComputing.html
1. Simplify analysis, reduce data set, batch process 2. Use R functionality to parallelise code and extend memory (on supercomputers, clusters, multicore machines)
OS command line parallel script submission
![Page 6: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/6.jpg)
Not all func6ons are easy to parallelise.
• Correla6on for example.
1 2 3 4 5 6
• Clustering is another example where the data cannot be considered separately.
• Other SPRINT func6ons provide op6mised implementa6ons, or handle larger datasets.
cor()
cor()
![Page 7: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/7.jpg)
SPRINT approach Overcomes limita6ons on data size and analysis @me and by providing easy access to HPC for all R users
Simple Parallel R INTerface (www.r-‐sprint.org)
“SPRINT: A new parallel framework for R”, J Hill et al, BMC Bioinforma6cs, Dec 2008.
Photo: Mark Sadowski
![Page 8: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/8.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa6on SPRINT Func6ons Performance Case study
![Page 9: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/9.jpg)
Example"
1 Dec 2011 9
!
my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!
genecor <- cor( t(my.matrix) )!
!
quit(save="no”)!
![Page 10: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/10.jpg)
Example"
1 Dec 2011 10
library("sprint”)!
my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!
genecor <- cor( t(my.matrix) )!
!
quit(save="no”)!
![Page 11: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/11.jpg)
Example"
1 Dec 2011 11
library("sprint”)!
my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!
genecor <- pcor( t(my.matrix) )!
!
quit(save="no”)!
![Page 12: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/12.jpg)
Example"
1 Dec 2011 12
library("sprint”)!
my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!
genecor <- pcor( t(my.matrix) )!
pterminate()!
quit(save="no”)!
![Page 13: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/13.jpg)
How to run"
library("sprint”)!
my.matrix <- matrix(rnorm(500000,9,1.7), nrow=20000, ncol=25)!
genecor <- pcor( t(my.matrix) )!
pterminate()!
quit(save="no”)!
sprint_script.R
$ mpiexec -‐n 4 R -‐f sprint_script.R
![Page 14: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/14.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa@on SPRINT Func6ons Performance Case study
![Page 15: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/15.jpg)
mpiexec "
$ mpiexec -‐n 4 R -‐f sprint_script.R
R -‐f sprint_script.R R -‐f sprint_script.R R -‐f sprint_script.R R -‐f sprint_script.R
![Page 16: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/16.jpg)
SPRINT architecture SPRINT overview schema, for those interested in implementa:on (not important to using SPRINT)
![Page 17: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/17.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa6on SPRINT Func@ons Performance Case study
![Page 18: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/18.jpg)
pcor() ppam() prandomForest() pmaxt() pRP() psvm() pstringdistmatrix() papply() pboot() pdist()*
Pearson correlation for pairs of numeric variables Partitioning-Around-Medoids clustering Random Forest classification algorithm Permutation-adjusted p-values Rank-Product non-parametric statistical permutation-based test Support-Vector-Machine classification algorithm Hamming distance for pairs of character strings Apply any function to each row/column in a matrix Bootstrap estimates for any given statistic/function A variety of distance metric to compute (dis)similarity of data vectors
Parallelised SPRINT functions
![Page 19: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/19.jpg)
pcor() Pearson correlation for pairs of numeric variables
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
Var10
Var11
Var12
Var13
Var14
VarP
Input data Va
r1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1 1
Var2 1
Var3 1
Var4 1
Var5 1
Var6 1
Var7 1
Var8 1
Var9 1
Var10 1
Var11 1
Var12 1
Var13 1
Var14 1
VarP 1
Output – Correlation matrix (“adjacency matrix”, “similarity matrix”)
(N rows) (N2 correlation coefficients)
Perform correlation on all possible pairs of variables.
(1-correlation matrix = distance matrix)
![Page 20: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/20.jpg)
ppam() Partitioning-Around-Medoids clustering
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
Var10
Var11
Var12
Var13
Var14
VarP
Input data
4 clusters
Mea
sure
men
t
Observations
Mea
sure
men
t
Observations
Compute distance between all possible pairs of variables, then partition into sets of maximal distinction
Op6misa6on and parallelisa6on of the par66oning around medoids func6on in R. Piotrowksi M. et al. BILIS 2011, Jul 2011.
![Page 21: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/21.jpg)
prandomForest()
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
Var10
Var11
Var12
Var13
Var14
VarP
Data
Construct ‘forest’ of decision trees. Each tree is for a bootstrap sample of the input data set. Aggregate by majority vote.
Treated Control Tr
eate
d Tr
eate
d C
ontro
l C
ontro
l Tr
eate
d C
ontro
l Tr
eate
d Tr
eate
d
Predicted class for new observations
Most important variables for predicting class
Var5
Var7
Var34
Var100
Var29
Var655 Known class:
Random Forest classification algorithm
C T T T C
C T T T C
C T T T C
A parallel random forest classifier for R, L. Mitchell et al, HPDC 2011, Jun 2011.
![Page 22: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/22.jpg)
pmaxT()
22
Permutation-adjusted p-values Based on mt.maxT()
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
VarP
Data
Treated Control Class:
Adj
uste
d p-
valu
e
p-va
lue
t Obs
3
Obs
7
Obs
4
Obs
5
Obs
1
Obs
6
Obs
2
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
VarP
Resample columns and re-compute statistical test
Repeat B times to derive a Null distribution, base adjusted p on this
“Treated” “Control”
Op6miza6on of a parallel permuta6on tes6ng func6on for the SPRINT R package, S. Petrou et al, Concurrency and Computa6on: Prac6ce and Experience, Jun 2011.
![Page 23: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/23.jpg)
pRP() Rank-Product non-parametric statistical permutation-based test
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
VarP
Treated Control Class:
T/C
ratio
1
T/C
ratio
2
…
T/C
ratio
TxC
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
VarP
Data
Ratio data (all possible pairs of T/C)
Ran
k Pr
oduc
t
Ran
k Pr
oduc
t
RP
p-v
alue
Repeat steps after permuting rows of source data B times to derive a p-value
Parallel classifica6on and feature selec6on in microarray data using SPRINT. Mitchell L. et al. 2012. Concurrency and Computa6on: Prac6ce and Experience.
![Page 24: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/24.jpg)
pstringdistmatrix() Hamming distance for pairs of character strings
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1
Var1 0
Var2 0
Var3 0
Var4 0
Var5 0
Var6 0
Var7 0
Var8 0
Var9 0
Var10 0
Var11 0
Var12 0
Var13 0
Var14 0
VarP 0
Distance matrix
(N strings)
(N2 alignment scores)
Calculate string alignment on all possible string pairs.
Input data= string vector
abc acd abc … bcd ada
![Page 25: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/25.jpg)
papply()
Obs
1
Obs
2
Obs
3
Obs
4
Obs
5
Obs
6
Obs
7
Obs
N
Var1
Var2
Var3
Var4
Var5
Var6
Var7
Var8
Var9
Var10
Var11
Var12
Var13
Var14
VarP
Input
Apply any function to each row or column in a matrix*
Output papply()
papply()
Output *papply() includes lapply() functionality
![Page 26: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/26.jpg)
pboot() Bootstrap estimates for any given statistic/function
Input
Group 1 Group 2
…
Random sample 1
Random sample 2
Random sample B
Random sample …
For each sample, calculate: Mean(Group 1) / Mean (Group 2)
Ratio = Mean(Group 1) / Mean (Group 2)
Summary estimate (e.g. standard error)
Parallel Op6misa6on of Bootstrapping in R. Sloan TM, Piotrowski M, Forster T, Ghazal P. arXiv.org pre-‐publica6on January 2014.
![Page 27: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/27.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa6on SPRINT Func6ons Performance Case study
![Page 28: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/28.jpg)
SPRINT and Data Size"Overcome limitations on data size and analysis time by providing easy
access to High Performance Computing for all R users
Input Matrix Size
Output Matrix Size
Serial Run Time
Parallel Run Time
11,000 x 320 26.85 MB
0.9 GB 63.18 secs 4.76 secs
22,000 x 320 53.7 MB
3.6 GB Insufficient memory
13.87 secs
35,000 x 320 85.44 MB
9.12 GB Crashed 36.64 secs
45,000 x 320 109.86 MB
15.08 GB Crashed 42.18 secs
For example, Pearson’s correlation, pcor() enables processing of datasets where the output does not fit in physical memory using R ff package.
Benchmark on HECToR -‐ UK Na6onal Supercompu6ng Service on 256 cores. S. Petrou et al, dCSE NAG Report, www.r-‐sprint.org.
![Page 29: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/29.jpg)
Performance increase – pcor()
The pcor() function scales well. (Shown here up to 256 cores).
![Page 30: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/30.jpg)
SPRINT and Analysis Time "Overcome limitations on data size and analysis time by providing easy
access to High Performance Computing for all R users
Input Matrix Size
# Permuta@ons
Serial Run Time (es@mated)
Parallel Run Time
36,612 x 76 500,000 6 hrs 73.18 secs
36,612 x 76 1,000,000 12 hrs 146.64 secs
36,612 x 76 2,000,000 23 hrs 290.22 secs
73,224 x 76 500,000 10 hrs 148.46 secs
73,224 x 76 1,000,000 20 hrs 294.61 secs
73,224 x 76 2,000,000 39 hrs 591.48 secs
Benchmark on HECToR -‐ UK Na6onal Supercompu6ng Service on 256 cores. S. Petrou et al, HPDC 2010 & CCPE, 2011.
For example, permuta6on tes6ng, pmaxT() is a parallel implementa6on of mt.maxT() from mulkest package (available from CRAN)
![Page 31: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/31.jpg)
SPRINT Data Size and Analysis Time "
Overcome limitations on data size and analysis time by providing easy access to High Performance Computing for all R users
Input Data Size
# Clusters
Serial Run Time Pam()
Parallel Run Time Ppam()
10 000 24 99 mins 1.2 mins
22 374 24 Insufficient memory 4.5 mins
Benchmark on a shared memory cluster with 8 dual-core 2.6GHz AMD Opteron processors with 2GB of RAM per core. M. Piotrowski et al, BILIS 2011.
For example, clustering with partitioning around medoids, ppam() • Parallel implementation of pam() from cluster package (available from CRAN) • Optimisation of serial version through memory and data storage management • Increased capacity by using external memory (i.e. ff objects)
![Page 32: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/32.jpg)
Overview
Mo6va6on How to use SPRINT SPRINT Implementa6on SPRINT Func6ons Performance Case study
![Page 33: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/33.jpg)
Case study 3 microarray gene expression time courses (each a data matrix of 14,819 rows x 25 columns)
~14
,000
gen
es 25 time points
~14
,000
gen
es 25 time points
~14
,000
gen
es 25 time points
T1 T25
Gen
e ex
pres
sion
leve
l
=14,819 gene expression profiles
A B C
T1 T25
Gen
e ex
pres
sion
leve
l
=14,819 gene expression profiles
T1 T25 G
ene
expr
essi
on le
vel
=14,819 gene expression profiles
![Page 34: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/34.jpg)
Usual approach
To measure correlation of gene expression profiles within each of the data matrices OR between the 3 possible pairs of data matrices: N = 3 x 14,8192 ≈ 659 million correlation computations
BUT…
![Page 35: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/35.jpg)
But we wanted to expand on this
We want to look at time-shifted correlations, where part of each gene’s expression profile in one of the data sets could match a part of another gene’s in another of the data sets.
T1 T25
We split each gene’s expression profile into 21 overlapping time windows of length 5 (= 2 hours)
![Page 36: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/36.jpg)
Serial cor(), computing all correlations of 2-hour time windows between 2 data sets (14819 x 21 time windows)2 ≈ 97 billion calculations
pcor() use case
Fails
Serial cor() with reduced number of genes (5561) (5561x 21 time windows)2 ≈ 14 billion calculations
~3 hours
Same computation in parallel (and using ‘ff’ package to exceed RAM constraints) on full gene set with SPRINT pcor()
~10 min
![Page 37: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/37.jpg)
SPRINT future
- Ensembl learning (multiple classification algorithms and multiple classification parameter values) with clinical microarray data sets to diagnose/prognose disease
- Data fusion of clinical and biological data sets
Biomedical research projects will drive parallelisation of R functionality
…but we’re open to collaborations if there are specific problems to solve
![Page 38: Using&SPRINT&and¶llelised& func6ons&for&analysis&of&large ... · SPRINT&Func6ons& Performance Case&study& SPRINT and Data Size" Overcome limitations on data size and analysis time](https://reader034.vdocument.in/reader034/viewer/2022042323/5f0e423a7e708231d43e5fa4/html5/thumbnails/38.jpg)
SPRINT
EPCC Eilidh Troup Luis Cebamanos Terence Sloan (PI)
DPM Thorsten Forster Peter Ghazal (PI)
Former Contributors and Funders Muriel Mewissen Savaas Petrou Michal Piotrowski Jon Hill Florian Scharinger Laurence Baldwin Bartek Dobrzelecki Lawrence Mitchell Kevin Robertson Andy Turner
www.r-‐sprint.org