gpu and mic programming (using python, r and … and mic programming (using python, r and matlab) *...

GPU and MIC programming (using python, R and MATLAB)

*

Ferdinand Jamitzky ([email protected])http://goo.gl/JkYJFY

Moore’s Law (1965-2015)

Number of transistors doubles every 2 years

Why parallel programming?

End of the free lunchin 2000 (heat death)

Moore's law means not faster processors, only more of them.

But!2 x 3 GHz < 6 GHz

(cache consistency,multi-threading, etc)

From Supercomputers to Notebook PCs

The future was (always) massively parallel

Connection MachineCM-1 (1983)

12-D Hypercube

65536 1-bit cores (AND, OR, NOT)

Rmax: 20 GFLOP/s

Today’s notebook PC

The future is massively parallel

JUGENEBlue Gene/P (2007)

3-D Torus or Tree

65536 64-bit cores (PowerPC 450)

Rmax: 222 TFLOP/s

now: 1 PFLOP/s294912 cores

Problem: Moving Data/Latency

Getting data from:CPU register 1ns

L2 cache 10ns

memory 80 ns

network(IB) 200 ns

GPU(PCIe) 50.000 ns

harddisk 500.000 ns

Light ray travels:30 cm

3 m

24 m

60 m

15 km

150 km

Data hungry...

Getting data from:CPU register 1ns

L2 cache 10ns

memory 80 ns

network(IB) 200 ns

GPU(PCIe) 50.000 ns

harddisk 500.000 ns

Getting some food from:fridge 10s

microwave 100s ~ 2min

pizza service 800s ~ 15min

city mall 2000s ~ 0.5h

mum sends cake 500.000 s~1 week

grown in own garden 5Ms ~ 2months

Problem: Transport energy

Moving Data is expensive:

FLOP on CPU 170 pJ

FLOP on GPU 20 pJ

Read from RAM 16000 pJ

Wire 10 cm 3100 pJWire per mm 0.15 pJ/bit/cm

source: http://www.davidglasco.com/Papers/ieee-micro.pdf and W. Dally (nVidia)

http://www.davidglasco.com/Papers/ieee-micro.pdf

Outlook: The Ultimate Supercomputer

How will the ultimate Supercomputer look like?

● Massively parallel architecture● Unified RAM and CPU (ExaFlop/s, PetaByte)● Uniform generic basic building blocks● Fast Algorithms in Hardware, reconfigurable● Highly energy efficient (20-40W)● Hibernate/Switch off unused parts● Warm water cooling (37°C)● Low Clock Frequency ( ~1kHz)● High Resiliency (self repairing)

Supercomputer: Shared Memory

SMP Machine:

shared memorytypically 10s of coresthreaded programsbus interconnect

in R:

library(multicore)and inlined code

Example: gvs1

128 GB RAM16 cores

Example: uv2/3

3.359 GB RAM2.080 cores

Supercomputer: Distributed Mem

Cluster of machines:

distributed memorytypically 100s of coresmessage passing interfaceinfiniband interconnect

in R:

library(Rmpi) and inlined code

Example: linux MPP cluster

2,752 GB RAM2,752 cores

Example: superMUC

340,000 GB RAM155,656 Intel cores

Supercomputer: GPGPU

Graphics Card:

shared memorytypically 1000s of coresCUDA or openCLon chip interconnect

in R:

library(gputools)and dyn.load code

Example: Tesla K20X6 GB RAM2688 Threads

Example: Titan ORNL262,000 GB RAM18,688 GPU Cards50,233,344 Threads

Supercomputer: MIC

Many Core Accelerator:

shared memory60 coresoffload or nativeon chip interconnect

in R:

MKL auto-offloadand dyn.load code

Example: SuperMIC8 GB RAM240 Threads

Example: Tianhe-21,024,000 GB RAM3,120,000 cores

Levels of Parallelism

● Node Level (e.g. SuperMUC has approx. 10000 nodes)each node has 2 sockets

● Socket Level each socket contains 8 cores

● Core Leveleach core has 16 vector registers

● Vector Level (e.g. lxgp1 GPGPU has 480 vector registers)● Pipeline Level (how many simultaneous pipelines)

hyperthreading● Instruction Level (instructions per cycle)

out of order execution, branch prediction, FMA

Amdahl's law

Computing time for N processors

T(N) = T(1)/N + Tserial + Tcomm * N

Acceleration factor (Speedup)

T(1)/T(N) = N / (1 + Tserial/T(1)*N + Tcomm/T(1)*N^2)

small N Speedup: T(1)/T(N) ~ Nlarge N Speedup: T(1)/T(N) ~ T(1)/Tcomm * 1/N

saturation point!

Amdahl's law

> plot(N,type="l")> lines(N/(1+0.01*N+0.001*N**2),col="green")> lines(N/(1+0.01*N),col="red")> Tserial=0.01> Tcomm=0.001

Gustafson's law

large N Speedup: T(1proc)/T(Nproc) ~ T(1proc)/Tcomm * 1/NGrow your problem then it scales better.

Weak Scaling vs. Strong Scaling

e.g. molecularsimulations:

100 atoms/coreare needed

How are High-Performance Codes constructed?

● “Traditional” Construction of High-Performance Codes:○ C/C++/Fortran○ Libraries

● “Alternative” Construction of High-Performance Codes:○ Scripting for ‘brains’ (Computer Games: Logic, AI)○ GPUs for ‘inner loops’ (Computer Games: Visualisation)

● Play to the strengths of each programming environment.

Hierarchical architecture of hardware vs software

● accelerators (gpus, xeon phi)● in-core vectorisation (avx)● multicore nodes (qpi, pci bus)● strongly coupled nodes (infiniband, 10GE)● weakly coupled clusters (cloud)

● Cuda, intrinsics● vectorisation pragmas● openMP● MPI● workflow middleware

Why Scripting?

Do you:● want to reuse CUDA code easily (e.g. as a library) ?● want to dynamically determine whether CUDA is available?● want to use multi-threading (painlessly)?● want to use MPI (painlessly)?● want to use loose coupling (grid computing)?● want dynamic exception handling and fallbacks?● want dynamic compilation of CUDA code?

If you answered "yes" to one of these questions, you should consider a scripting language

Parallel Tools in python, R and MATLAB

SMPmulticore

parallelism

doMC, doSMP, pnmath, BLASno max cores

multiprocessingfutures

MMP massive parallel

processing

doSNOW, doMPI, doRedis

parallel python, mpi4py

GPGPUCUDA

openCL

rgpu, gputools

pyCUDA, pyOpenCL

parfor, spmdmax 8 cores

jobs, pmode gpuArray

R

python

MATLAB

Scripting CUDA

Compiler

CUDA

Interpreter

PGI Fortran NumbraPro pyCUDA rgpu MATLAB

python R

MATLAB GPU Commands

MATLAB GPU @ LRZ

# load matlab module and start command line version

module load cudamodule load matlab/R2011Amatlab -nodesktop

MATLAB gpuArray

● Copy data to GPGPU and return a handle on the object● All operations on the handle are performed on the GPGPU

x=rand(100);gx=gpuArray(x);

● how to compute the GFlop/s

tic;M=gpuArray(rand(np*1000));gather(sum(sum(M*M)));2*np^3/toc

pyCUDA

Gives you the following advantages:

1. Combining Two Strong Tools2. Scripting CUDA3. Run-Time Code Generation

http://mathema.tician.de/software/pycuda

special thanks to a.klöckner




pyCUDA @ LRZ

log in to lxgp1

$ module load python$ module load cuda$ module load boost

$ pythonPython 2.6.1 (r261:67515, Apr 17 2009, 17:25:25) [GCC 4.1.2 20070115 (SUSE Linux)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>>

Simple Example

from numpy import *import pycuda.autoinitimport pycuda.gpuarray as gpu

a = random.randn(4,4).astype(float32) #single

a2_cpu = 2*a #runs on cpu

ag = gpu.to_gpu(a)#transfer to gpuag2 = 2∗ag #runs on gpua2_gpu = ag2.get()#transfer back

gpuarray class

pycuda.gpuarray:

Meant to look and feel just like numpy.

● gpuarray.to gpu(numpy array)● numpy array = gpuarray.get()● +, -, ∗, /, fill, sin, exp, rand, basic indexing, norm, inner product● Mixed types (int32 + float32 = float64)● print gpuarray for debugging.● Allows access to raw bits● Use as kernel arguments, textures, etc.

gpuarray: Elementwise expressions

Avoiding extra store-fetch cycles for elementwise math:

from pycuda.curandom import rand as curanda_gpu = curand((50,))b_gpu = curand((50,))

from pycuda.elementwise import ElementwiseKernellin_comb = ElementwiseKernel(” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i]”)

c_gpu = gpuarray.empty_like (a_gpu)lin_comb(5, a_gpu, 6, b_gpu, c_gpu)

assert la.norm((c_gpu − (5∗a_gpu+6∗b_gpu)).get()) < 1e−5

gpuarray: Reduction made easy

Example: A scalar product calculation

from pycuda.reduction import ReductionKerneldot = ReductionKernel(dtype_out=numpy.float32, neutral=”0”,reduce_expr=”a+b”, map_expr=”x[i]∗y[i]”,arguments=”const float ∗x, const float ∗y”)

from pycuda.curandom import rand as curandx = curand((1000∗1000), dtype=numpy.float32)y = curand((1000∗1000), dtype=numpy.float32)

x_dot_y = dot(x,y).get()x_dot_y_cpu = numpy.dot(x.get(), y.get ())

CUDA Kernels in pyCUDA

import pycuda.autoinitimport pycuda.driver as drvimport numpyfrom pycuda.compiler import SourceModulemod = SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""")multiply_them = mod.get_function("multiply_them")a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32)dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1)print dest-a*b

Completeness

PyCUDA exposes all of CUDA.

For example:● Arrays and Textures● Pagelocked host memory● Memory transfers (asynchronous, structured)● Streams and Events● Device queries● GL Interop

And furthermore:● Allow interactive use● Integrate tightly with numpy

pyCUDA showcase

http://wiki.tiker.net/PyCuda/ShowCase● Agent-based Models● Computational Visual Neuroscience● Discontinuous Galerkin Finite Element PDE Solvers● Estimating the Entropy of Natural Scenes● Facial Image Database Search● Filtered Backprojection for Radar Imaging● LINGO Chemical Similarities● Recurrence Diagrams● Sailfish: Lattice Boltzmann Fluid Dynamics● Selective Embedded Just In Time Specialization● Simulation of spiking neural networks● Time Encoding and Decoding Toolkit● Copenhagen CT toolbox

http://wiki.tiker.net/PyCuda/ShowCase#Copenhagen_CT_toolbox

http://wiki.tiker.net/PyCuda/ShowCase#Copenhagen_CT_toolbox

NumbraPro

Generate CUDA Kernels using a Just-in-time compiler

from numbapro import cuda

@cuda.jit('void(float32[:], float32[:], float32[:])')

def sum(a, b, result):

i = cuda.grid(1) # equals to threadIdx.x + blockIdx.x *

blockDim.x

result[i] = a[i] + b[i]

# Invoke like: sum[grid_dim, block_dim](big_input_1, big_input_2,

result_array)

The Language R

http://www.r-project.org/



R in a nutshell

module load cuda/2.3module load R/serial/2.13

> x=1:10> y=x**2> str(y)> print(x)> times2 = function(x) 2*xgraphics!> plot(x,y)

= and <- are interchangable

rgpu

a set of functions for loading data to a gpu and manipulating the data there:● exportgpu(x)● evalgpu(x+y)● lsgpu()● rmgpu("x")● sumgpu(x), meangpu(x), gemmgpu(a,b)● cos, sin,.., +, -, *, /, **, %*%

https://trac.nbic.nl/rgpu/

Example

load the correct R module$ module load R/serial/2.13

start R$ RR version 2.13.1 (2011-07-08)Copyright (C) 2011 The R Foundation for Statistical ComputingISBN 3-900051-07-0load rgpu library> library(rgpu)> help(package="rgpu")> rgpudetails()

Data on the GPGPU

one million random uniform numbers> x=runif(10000000)

send data to gpu> exportgpu(x)

do some calculations> evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x)))

do some timing comparisons (GPU vs CPU):> system.time(evalgpu(sumgpu(sin(x)+cos(x)+tan(x)+exp(x))))> system.time(sum(sin(x)+cos(x)+tan(x)+exp(x)))

real world examples: gputools

gputools is a package of precompiled CUDA functions for statistics, linear algebra and machine learning● chooseGpu● getGpuId()● gpuCor, gpuAucEstimate● gpuDist, gpuDistClust, gpuHclust, gpuFastICA● gpuGlm, gpuLm● gpuGranger, gpuMi ● gpuMatMult, gpuQr, gpuSvd, gpuSolve● gpuLsfit● gpuSvmPredict, gpuSvmTrain● gpuTtest

Example: Matrix Inversion

np <- 2000x <- matrix(runif(np**2), np,np)

system.time(gpuSolve(x))

system.time(solve(x))

Example: Hierarchical Clustering

numVectors <- 5dimension <- 10Vectors <- matrix(runif(numVectors*dimension), numVectors, dimension)distMat <- gpuDist(Vectors, "euclidean")myClust <- gpuHclust(distMat, "single")plot(myClust)

for other examples try:example(hclust)

Fortran 90 Example

program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do print*, " total energy: ",sum(x**2+v**2)end program

PGI Compiler

log in to lxgp1

$ module load fortran/pgi/11.8

$ pgf90 -o myprog.exe myprog.f90

$ time ./myprog.exe

exercise for you: ● compute MFlop/s (Floating Point Operations: 4 * np * nstep)● optimize (hint: -Minfo, -fast, -O3)

Fortran 90 Example

program myprog! simulate harmonic oscillator integer, parameter :: np=1000, nstep=1000 real :: x(np), v(np), dx(np), dv(np), dt=0.01 integer :: i,j forall(i=1:np) x(i)=i forall(i=1:np) v(i)=i do j=1,nstep!$acc region dx=v*dt; dv=-x*dt x=x+dx; v=v+dv!$acc end region end do print*, " total energy: ",sum(x**2+v**2)end program

PGI Compiler accelerator

module load fortran/pgi

pgf90 -ta=nvidia -o myprog.exe myprog.f90

time ./myprog.exe

exercise for you: ● compute MFlop/s (Floating Point Operations: 4 * np * nstep)● optimize (hint: change acc region)

Use R as scripting language

R can dynamically load shared objects:

dyn.load("lib.so")

these functions can then be called via

.C("fname", args)

.Fortran("fname", args)

R subroutine

subroutine mysub_cuda(x,v,nstep)! simulate harmonic oscillator integer, parameter :: np=1000000 real*8 :: x(np), v(np), dx(np), dv(np), dt=0.001 integer :: i,j, nstep forall(i=1:np) x(i)=real(i)/np forall(i=1:np) v(i)=real(i)/np do j=1,nstep dx=v*dt; dv=-x*dt x=x+dx; v=v+dv end do returnend subroutine

Compile two versions

don't forget to load the modules!module unload ccomp fortranmodule load ccomp/pgi/11.8module load fortran/pgi/11.8module load R/serial/2.13

pgf90 -shared -fPIC -o mysub_host.so mysub_host.f90

pgf90 -ta=nvidia -shared -fPIC -o mysub_cuda.so mysub_cuda.f90

Load and run

Load dynamic libraries> dyn.load("mysub_host.so"), dyn.load("mysub_cuda.so"); np=1000000Benchmark> system.time(str(.Fortran("mysub_host",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: 666667.6633012500 total energy: 667334.6641391169 List of 3 $ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ... $ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ... $ nstep: int 1000 user system elapsed 26.901 0.000 26.900 > system.time(str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np),nstep=as.integer(1000)))) total energy: 666667.6633012500 total energy: 667334.6641391169 List of 3 $ x : num [1:1000000] -3.01e-07 -6.03e-07 -9.04e-07 -1.21e-06 -1.51e-06 ... $ v : num [1:1000000] 1.38e-06 2.76e-06 4.15e-06 5.53e-06 6.91e-06 ... $ nstep: int 1000 user system elapsed 0.829 0.000 0.830 Acceleration Factor:> 26.9/0.83[1] 32.40964

Matrix Multipl. in FORTRAN

subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np forall(i=1:np,j=1:np)a(i,j)=a(i,j)+b(i,k)*c(k,j) end do returnend subroutine

two inner loops, one outer loop: np*np*npaddition and multiplication: 2 Flop

2*np**3 Float Operations per call!

Call FORTRAN from R

# compile f90 to shared object librarysystem("pgf90 -shared -fPIC -o mmult.so mmult.f90");

# dynamically load librarydyn.load("mmult.so")

# define multiplication functionmmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,c=c, np=as.integer(dim(a)[1]))

Call FORTRAN binary

np=100

system.time( mmult.f( a = matrix(numeric(np*np),np,np), b = matrix(numeric(np*np)+1.,np,np), c = matrix(numeric(np*np)+1.,np,np) ))

Exercise: make a plot system-time vs matrix-dimension

PGI accelerator directives

subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k do k=1, np!$acc region forall(i=1:np, j=1:np) a(i,j) = a(i,j) + b(i,k)*c(k,j)!$acc end region end do returnend subroutine

Call FORTRAN from R

# compile f90 to shared object librarysystem("pgf90 -ta=nvidia -shared -fPIC -o mmult.so mmult.f90");

# dynamically load librarydyn.load("mmult.so")

# define multiplication functionmmult.f <- function(a,b,c) .Fortran("mmult",a=a,b=b,c=c, np=as.integer(dim(a)[1]))

Compute MFlop/s

print(paste(2.*np**3/1000000./system.time( str(mmult.f(...)) )[[3]]," MFlop/s"))

Exercise: Compare MFlop/s vs dimension for serial and accelerated code

Intel accelerator directives

subroutine mmult(a,b,c,np) integer np real*8 a(np,np), b(np,np), c(np,np) integer i,j, k !dir$ offload begin target(mic) inout(a,b,c)!$omp parallel shared(a,b,c) !$omp dodo j=1,np do k=1,np forall(i=1:np) a(i,j)=a(i,j)+b(i,k)*c(k,j) end doend do!$omp end do!$omp end parallel!dir$ end offload returnend subroutine

Compiler command Intel Fortran

$ ifort -vec-report=3 -openmp-report -openmp -shared -fPIC mmult_mic.f90 -o mmult_mic.so

mmult_omp.f90(7): (col. 7) remark: OpenMP DEFINED LOOP WAS PARALLELIZEDmmult_omp.f90(6): (col. 7) remark: OpenMP DEFINED REGION WAS PARALLELIZEDmmult_omp.f90(10): (col. 20) remark: LOOP WAS VECTORIZEDmmult_omp.f90(9): (col. 3) remark: loop was not vectorized: not inner loopmmult_omp.f90(8): (col. 1) remark: loop was not vectorized: not inner loopmmult_omp.f90(7): (col. 7) remark: * MIC* OpenMP DEFINED LOOP WAS PARALLELIZEDmmult_omp.f90(6): (col. 7) remark: * MIC* OpenMP DEFINED REGION WAS PARALLELIZEDmmult_omp.f90(10): (col. 20) remark: * MIC* LOOP WAS VECTORIZEDmmult_omp.f90(10): (col. 20) remark: * MIC* PEEL LOOP WAS VECTORIZEDmmult_omp.f90(10): (col. 20) remark: * MIC* REMAINDER LOOP WAS VECTORIZEDmmult_omp.f90(9): (col. 3) remark: * MIC* loop was not vectorized: not inner loopmmult_omp.f90(8): (col. 1) remark: * MIC* loop was not vectorized: not inner loop

mmult on MIC (offload)

$ R -f mmult_mic.R

> system.time(mmult.f(a,b,c))[Offload] [MIC 0] [File] mmult_mic.f90[Offload] [MIC 0] [Line] 5[Offload] [MIC 0] [Tag] Tag 0[Offload] [HOST] [Tag 0] [CPU Time] 173.768076(seconds)[Offload] [MIC 0] [Tag 0] [CPU->MIC Data] 2400000280 (bytes)[Offload] [MIC 0] [Tag 0] [MIC Time] 155.217991(seconds)[Offload] [MIC 0] [Tag 0] [MIC->CPU Data] 2400000016 (bytes) user system elapsed 157.034 2.844 176.542

> system.time(b%*%c)[MKL] [MIC --] [AO Function] DGEMM[MKL] [MIC --] [AO DGEMM Workdivision] 0.50 0.25 0.25[MKL] [MIC 00] [AO DGEMM CPU Time] 5.699716 seconds[MKL] [MIC 00] [AO DGEMM MIC Time] 1.142100 seconds[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1001600000 bytes[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 806400000 bytes[MKL] [MIC 01] [AO DGEMM CPU Time] 5.699716 seconds[MKL] [MIC 01] [AO DGEMM MIC Time] 1.255698 seconds[MKL] [MIC 01] [AO DGEMM CPU->MIC Data] 1001600000 bytes[MKL] [MIC 01] [AO DGEMM MIC->CPU Data] 806400000 bytes user system elapsed 42.414 6.492 6.326

mmult on host (16c) vs MIC

$ R -f mmult_mic.R

* Fortran Version HOST> system.time(mmult.f(a,b,c)) user system elapsed 1297.197 0.576 104.143 38 GFlop/s

* MKL Version HOST> system.time(b%*%c) user system elapsed 93.022 0.248 8.955 450 GFlop/s

compare:* Fortran Version MIC offload> system.time(mmult.f(a,b,c)) user system elapsed 157.034 2.844 176.542 22 GFlop/s

* MKL Version MIC auto-offload> system.time(b%*%c) user system elapsed 9.421 0.948 13.046 300 GFlop/s

optimal: HOST+MIC: 670 GFlop/s

Scripting Parallel Execution

implicit

R

explicite

jit pnmath doSNOWdoMPIdoMC doRedis

hierarchical parallelisation: - accelerator: rgpu, pnmath, MKL - intra-node: jit, doMC, MKL - intra-cluster: SNOW, MPI, pbdMPI - inter-cluster: Redis, SNOW

MKLrgpu

foreach package

# new R foreach

library(foreach)

alist <- foreach (i=1:N) %do% call(i)

foreach is a function

# old R code

alist=list()

for(i in 1:N) alist[i]<-call(i)

for is a language keyword

multithreading with R

library(foreach)

foreach(i=1:N) %do% { mmult.f() }

# serial execution

library(foreach)library(doMC)registerDoMC()

foreach(i=1:N) %dopar% { mmult.f() }

# thread execution

MPI with R

library(foreach)

foreach(i=1:N) %do% { mmult.f() }

# serial execution

library(foreach)library(doSNOW)registerDoSNOW()

foreach(i=1:N) %dopar% { mmult.f() }

# MPI execution

doSNOW

# R

> library(doSNOW)

> cl <- makeSOCKcluster(4)

> registerDoSNOW(cl)

> system.time(foreach(i=1:10) %do% sum(runif(10000000)))

user system elapsed

15.377 0.928 16.303

> system.time(foreach(i=1:10) %dopar% sum(runif(10000000)))

user system elapsed

4.864 0.000 4.865

doMC

# R

> library(doMC)

> registerDoMC(cores=4)


user system elapsed

9.352 2.652 12.002


user system elapsed

7.228 7.216 3.296

MPI-CUDA with R

Using doSNOW and dyn.load with pgifortran:

library(doSNOW)cl=makeCluster(c("gvs1","gvs2"),type="SOCK")registerDoSNOW(cl)

foreach(i=1:2) %dopar% setwd("~/KURSE/R_cuda")foreach(i=1:2) %dopar% dyn.load("mysub_cuda.so")

system.time(foreach(i=1:4) %dopar% str(.Fortran("mysub_cuda",x=numeric(np),v=numeric(np), nstep=as.integer(1000))))

noSQL databases

Redis is an open source, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets.

http://www.redis.io

Clients are available for C, C++, C#, Objective-C, Clojure, Common Lisp, Erlang, Go, Haskell, Io, Lua, Perl, Python, PHP, R ruby, scala, smalltalk, tcl

doRedis / workers

start redis worker:

> echo "require('doRedis');redisWorker('jobs')" | R

The workers can be distributed over the internet

> startRedisWorkers(100)

doRedis

# R

> library(doRedis)

> registerDoRedis("jobs")


user system elapsed

15.377 0.928 16.303


user system elapsed

4.864 0.000 4.865

Disk

Big Memory

R R

MEM MEM

Logical Setup of Node without shared memory

R R

MEM

Logical Setup of Node with shared memory

DiskDisk

R R

MEM

Logical Setup of Node with file-backed memory

R R

MEM

Logical Setup of Node with network attached file-

backed memory

Network Network Network

library(bigmemory)

● shared memory regions for several processes in SMP

● file backed arrays for several node over network file systems

library(bigmemory)

x <- as.big.matrix(matrix(runif(1000000), 1000, 1000)))

sum(x[1,1:1000])

tl;dr

● the future is massively parallel● consider scripting for rapid prototypes● think hybrid (vector+openmp+mpi+acc)● try to use high level abstractions● you must use all features of modern cpus/gpus to

get fast and scalable code"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." (Donald Knuth)

The End

Questions?

gpu and mic programming (using python, r and … and mic programming (using python, r and matlab) *...

Documents