the state of hpc in the open source r ecosystem · a closer look at hpc and r \olcf researchers...
TRANSCRIPT
The State of HPC in the Open Source R Ecosystem
Drew Schmidt
November 12, 2016
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Support and Disclaimer
This material is based upon work supported by the National Science Foundation Division ofMathematical Sciences under Grant No. 1418195.
The findings and conclusions in this presentation have not been formally disseminated by theU.S. Department of Health & Human Services nor by the U.S. Department of Energy, andshould not be construed to represent any determination or policy of University, Agency,Administration and National Laboratory.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Speaker Bio
M.S. in mathematics.
Former statistics consultant.
Former full-time university researcher.
Now a miserable grad student.
Prolific complainer on twitter.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Goals of This Talk
Convince you that R has a legitimate place in HPC.
Give a broad overview of the R package landscape.
Make some very safe predictions.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Contents
1 Background and Motivation
2 A Little History
3 Packages
4 A Closer Look at HPC and R
5 Concluding Remarks
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Background and Motivation
1 Background and MotivationR Is WeirdR Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Background and Motivation R Is Weird
1 Background and MotivationR Is WeirdR Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Background and Motivation R Is Weird
Types
logical (“boolean”)
integer (32-bit int)
numeric (double)
complex (double complex)
character (string)
Also raw and external pointer
Data Structures
Vectors (matrices, n-dim arrays)
Lists (arrays of pointers)
Dataframes (lists with constraints)
Environments (hash tables?!)
That’s it.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 1/39
Background and Motivation R Is Weird
Happy Opposite Day!
1 T
2 ## [1] TRUE
3 F
4 ## [1] FALSE
5
6 T <- FALSE
7 F <- TRUE
8
9 T
10 ## [1] FALSE
11 F
12 ## [1] TRUE
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 2/39
Background and Motivation R Is Weird
Odd Conventions
. has no semantic meaning (except when it does
t.test()
t.data.frame()
A package is installed in a library.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 3/39
Background and Motivation R Is Weird
Package or Library?
I wrote a library.
I put that library into a package.
I installed the package . . . into a library.
I load the package with library() ???
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 4/39
Background and Motivation R Is Weird
*BOOM*
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 5/39
Background and Motivation R Is Popular
1 Background and MotivationR Is WeirdR Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Background and Motivation R Is Popular
Part Programming Language, Part Data Analysis Package
“R is a shockingly dreadful language for an exceptionally useful data analysis environment.”— Tim Smith, from aRrgh: a newcomer’s (angry) guide to R.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 6/39
Background and Motivation R Is Popular
IEEE Spectrum’s 2014 Ranking of Programming Languages
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 7/39
Background and Motivation R Is Popular
IEEE Spectrum’s 2016 Ranking of Programming Languages
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 8/39
Background and Motivation R Is Popular
Rexer 2015 data scientist survey
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 9/39
Background and Motivation R Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 10/39
Background and Motivation R Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 11/39
Background and Motivation R Is Popular
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 12/39
Background and Motivation R Is Popular
Why use R at all?
Most diverse set of statistical methods available.
Rapid prototyping.
CRAN (and increasingly GitHub) packages.
Awesome community.
Syntax is designed for analysis of data.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 13/39
A Little History
2 A Little HistoryStatistics, Data Science, Big Data, and So OnEnter R
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
A Little History Statistics, Data Science, Big Data, and So On
2 A Little HistoryStatistics, Data Science, Big Data, and So OnEnter R
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
A Little History Statistics, Data Science, Big Data, and So On
HPC: Not Just for PDE’S Anymore!
R’s use in HPC.
No traditional HPC. . .
Lots of interesting work
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 14/39
A Little History Statistics, Data Science, Big Data, and So On
About Traditional HPC. . .
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 15/39
A Little History Statistics, Data Science, Big Data, and So On
Changing Landscape of HPC
“non-traditional” HPC: everybody but physics.
What kind of software do they need?
Can we leverage any existing HPC stuff?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 16/39
A Little History Statistics, Data Science, Big Data, and So On
Problems with ”Big Data“ Software
Many frameworks; what do they all do?
Don’t always play nice with HPC systems.
Often not as ”high level“ as advertised.
Almost exclusively batch!
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 17/39
A Little History Statistics, Data Science, Big Data, and So On
Data Analysis Is An Interactive Activity
Data analysis is an interactive activitya
aData analysis is an interactive activity
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 18/39
A Little History Statistics, Data Science, Big Data, and So On
Data science in action
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 19/39
A Little History Enter R
2 A Little HistoryStatistics, Data Science, Big Data, and So OnEnter R
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
A Little History Enter R
http://datascience.la/john-chambers-user-2014-keynote/
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 20/39
Packages
3 PackagesAdvanced Compute PackagesHPC PackagesHadoop and ApplicationsOk, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Packages
Where to Begin?
Many packages of varying scope and quality.
1 core package (parallel)
Over 100 contributed packageshttps://cran.r-project.org/web/views/HighPerformanceComputing.html
Even more on GitHub.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 21/39
Packages
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 22/39
Packages Advanced Compute Packages
3 PackagesAdvanced Compute PackagesHPC PackagesHadoop and ApplicationsOk, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Packages Advanced Compute Packages
Out of Core Packages
ff, bigmemory and friends
R is very “copy happy”
Many statisticians don’t know about things like XSEDE.
Others hear “Linux” and run away screaming.
Bizarrely, cloud computing is changing this.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 23/39
Packages Advanced Compute Packages
Rcpp
Rcpp
RcppArmadillo, RcppEigen
RcppParallel
. . .
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 24/39
Packages HPC Packages
3 PackagesAdvanced Compute PackagesHPC PackagesHadoop and ApplicationsOk, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Packages HPC Packages
Accelerator Packages
gputools, Magma, HiPLARM, a few others.
Accessibility mostly from things like nvblas and Intel MKL.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 25/39
Packages HPC Packages
Distributed Packages
Rmpi
snow
pbdMPI and friends
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 26/39
Packages HPC Packages
Remote Evaluation Packages
rzmq, pbdZMQ
remoter, future
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 27/39
Packages Hadoop and Applications
3 PackagesAdvanced Compute PackagesHPC PackagesHadoop and ApplicationsOk, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Packages Hadoop and Applications
Hadoop et al Packages
RHadoop, RHIPE
SparkR
sparklyr
h2o
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 28/39
Packages Hadoop and Applications
“Applications”
dplyr and data.table
caret
randomForest
xgboost
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 29/39
Packages Ok, So What?
3 PackagesAdvanced Compute PackagesHPC PackagesHadoop and ApplicationsOk, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Packages Ok, So What?
Is the R community using this stuff?
Short answer: yes.
Long answer: mostly single-node parallelism.
Hard truth: in addition to hype and buzzwords — fear and distrust
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 30/39
Packages Ok, So What?
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 31/39
Packages Ok, So What?
Source https://twitter.com/eddelbuettel/status/787740983433854977
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 32/39
A Closer Look at HPC and R
4 A Closer Look at HPC and R
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
A Closer Look at HPC and R
HPC may be dying, but we’re behind the times
0
50
100
150
2014 2015 2016 2017Date
Pac
kage
Dow
nloa
d M
arke
tsha
repackage
pbdMPI
Rmpi
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 33/39
A Closer Look at HPC and R
“OLCF Researchers Scale R to Tackle Big Science Data Sets”
A problem that takes several hours on Apache Spark[was analyzed] in less than a minute using R on OLCFhigh-performance hardware.
“. . . for situations where one needs interactivenear-real-time analysis, the pbdR approach is muchbetter.”
https://www.hpcwire.com/2016/07/06/
olcf-researchers-scale-r-tackle-big-science-data-sets/
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 34/39
A Closer Look at HPC and R
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 35/39
A Closer Look at HPC and R
Interconnection Network
PROC + cache
PROC + cache
PROC + cache
PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE + cache
CORE + cache
CORE + cache
CORE + cache
Network
Shared Memory Local Memory
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 36/39
A Closer Look at HPC and R
Local Memory
Co-Processor
GPU: Graphical Processing Unit
MIC: Many Integrated Core
Interconnection Network
PROC + cache
PROC + cache
PROC + cache
PROC + cache
Mem Mem Mem Mem
Distributed Memory
Memory
CORE + cache
CORE + cache
CORE + cache
CORE + cache
Network
Shared Memory
Trilinos
PETSc
PLASMA
DPLASMALibSci (Cray) MKL (Intel)
ScaLAPACK PBLAS BLACS
cuBLAS (NVIDIA)
MAGMA
PAPI
Tau
MPImpiP
fpmpi
NetCDF4
ADIOS
pbdMPI
pbdPAPI
pbdNCDF4
pbdADIOS
pbdPROF pbdPROF pbdPROF
ACML (AMD)
pbdDEMO
CombBLAS
cuSPARSE (NVIDIA)
pbdDMATpbdDMATpbdDMATpbdDMAT
pbdBASE pbdSLAP
HiPLARHiPLARM
magma
ZeroMQ pbdCS
Profiling
I/O
Learning
Released Under DevelopmentSlides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 37/39
Concluding Remarks
5 Concluding Remarks
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem
Concluding Remarks
The Future?
Better dplyr backends.
More threading + accelerator usage in packages (Rcpp + RcppParallel).
Astronomical amounts of buzz in the Haddop/Spark-and-friends space — will ultimatelyhurt us in the MPI space.
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 38/39
Concluding Remarks
∼Thanks!∼
Questions?
Email: [email protected]
GitHub: https://github.com/wrathematics
Web: http://wrathematics.info
Twitter: @wrathematics
Slides: wrathematics.github.io/hpcdevcon2016/ Drew Schmidt The State of HPC in the Open Source R Ecosystem 39/39