fem integration with quadrature on the gpuknepley/presentations/shenzhen12.pdffem integration with...

FEM Integration with Quadrature on the GPU

Matthew Knepley

Computation InstituteUniversity of Chicago

Department of Molecular Biology and PhysiologyRush University Medical Center

GPU-SMP 2012Shenzhen, China June 1–4, 2012

M. Knepley (UC) GPU GPU-SMP 1 / 38

Page 2: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Collaborators

Andy R. Terrel

Andreas KlöcknerJed BrownRobert Kirby

http://andy.terrel.us/Professional/

http://mathema.tician.de/

http://www.59a2.org/research/

http://www.math.ttu.edu/~kirby/

Page 3: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Scientific Libraries?

Outline

1 Why Scientific Libraries?What is PETSc?

2 Linear Systems are Easy

3 Finite Element Integration

4 Future Direction

Page 4: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Main Point

To be widely accepted,

GPU computing must betransparent to the user,

and reuse existinginfrastructure.

Page 5: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Main Point

Page 6: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Main Point

Page 7: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Lessons from Clusters and MPPs

FailureParallelizing CompilersAutomatic program decomposition

SuccessMPI (Library Approach)PETSc (Parallel Linear Algebra)User provides only the mathematical description

Page 8: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Lessons from Clusters and MPPs

FailureParallelizing CompilersAutomatic program decomposition

SuccessMPI (Library Approach)PETSc (Parallel Linear Algebra)User provides only the mathematical description

Page 9: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Scientific Libraries? What is PETSc?

Outline

1 Why Scientific Libraries?What is PETSc?

Page 10: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

What is PETSc?

A freely available and supported research codeDownload from http://www.mcs.anl.gov/petscFree for everyone, including industrial usersHyperlinked manual, examples, and manual pages for all routinesHundreds of tutorial-style examplesSupport via email: [email protected] from C, C++, Fortran 77/90, and Python

http://www.mcs.anl.gov/petsc

mailto:[email protected]

Page 11: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

What is PETSc?

Portable to any parallel system supporting MPI, including:Tightly coupled systems

Cray XT5, BG/Q, NVIDIA Fermi, Earth SimulatorLoosely coupled systems, such as networks of workstations

IBM, Mac, iPad/iPhone, PCs running Linux or Windows

PETSc HistoryBegun September 1991Over 60,000 downloads since 1995 (version 2)Currently 400 per month

PETSc Funding and SupportDepartment of Energy

SciDAC, MICS Program, AMR Program, INL Reactor ProgramNational Science Foundation

CIG, CISE, Multidisciplinary Challenge Program

Page 12: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

The PETSc Team

Bill Gropp Barry Smith Satish Balay

Jed Brown Matt Knepley Lisandro Dalcin

Hong Zhang Victor Eijkhout Dmitry KarpeevM. Knepley (UC) GPU GPU-SMP 9 / 38

Page 13: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Who Uses PETSc?

Computational Scientists

Earth SciencePyLith (CIG)Underworld (Monash)Magma Dynamics (LDEO, Columbia)

Subsurface Flow and Porous MediaSTOMP (DOE)PFLOTRAN (DOE)

http://www.geodynamics.org/cig/software/pylith

http://www.underworldproject.org/

http://www.bu.edu/pasi/files/2011/01/MarcSpiegelman4-11-1000.pdf

http://stomp.pnnl.gov/

http://ees.lanl.gov/pflotran/

Page 14: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Who Uses PETSc?

Computational Scientists

CFDFluidityOpenFOAMfreeCFDOpenFVM

MicroMagneticsMagPar

FusionNIMROD

http://amcg.ese.ic.ac.uk/index.php?title=Fluidity

http://www.openfoam.com/

http://www.freecfd.com/

http://openfvm.sourceforge.net/

http://www.magpar.net/

http://www.nimrodteam.org/

Page 15: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Who Uses PETSc?

Algorithm Developers

Iterative methodsDeflated GMRESLGMRESQCGSpecEst

Preconditioning researchersPrometheus (Adams)ParPre (Eijkhout)FETI-DP (Klawonn and Rheinbach)

http://www.columbia.edu/~ma2325/prom_intro.html

http://www.columbia.edu/~ma2325/

http://www.netlib.org/scalapack/manual.ps

http://tacc-web.austin.utexas.edu/staff/home/veijkhout/public_html/

http://www.uni-due.de/numerik/klawonn.shtml

http://www.uni-due.de/numerik/rheinbach.shtml

Page 16: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Who Uses PETSc?

Algorithm Developers

Finite ElementsPETSc-FEMlibMeshDeal IIOOFEM

Other SolversFast Multipole Method (PetFMM)Radial Basis Function Interpolation (PetRBF)Eigensolvers (SLEPc)Optimization (TAO)

http://www.cimec.org.ar/petscfem

http://libmesh.sourceforge.net/

http://www.dealii.org/

http://www.oofem.org/

http://barbagroup.bu.edu/Barba_group/PetFMM.html

http://barbagroup.bu.edu/Barba_group/PetRBF.html

http://www.grycap.upv.es/slepc/

http://www.mcs.anl.gov/tao

Page 17: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

What Can We Handle?

PETSc has run implicit problems with over 500 billion unknownsUNIC on BG/P and XT5PFLOTRAN for flow in porous media

PETSc has run on over 290,000 cores efficientlyUNIC on the IBM BG/P Intrepid at ANLPFLOTRAN on the Cray XT5 Jaguar at ORNL

PETSc applications have run at 22 TeraflopsKaushik on XT5LANL PFLOTRAN code

http://www-unix.mcs.anl.gov/petsc/petsc-as/publications/petscapps.html#scalinglarge

Page 18: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

What Can We Handle?

Page 19: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

What Can We Handle?

Page 20: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

How should the user interact withmanycore systems?

Through computational libraries

How should the user interact with the library?Strong, data structure-neutral API (Smith and Gropp, 1996)

How should the library interact withmanycore systems?

Existing library APIsCode generation (CUDA, OpenCL, PyCUDA)Custom multi-language extensions

http://portal.acm.org/citation.cfm?id=245883

Page 21: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

Page 22: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

Page 23: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

Page 24: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

Page 25: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Questions

Page 26: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Linear Systems are Easy

Outline

1 Why Scientific Libraries?

4 Future Direction

Page 27: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Maturity

Some parts of PDEcomputation are less mature

Linear AlgebraOne universal interface

BLAS, PETSc, Trilinos,FLAME, Elemental

Entire problem can bephrased in the interface

Ax = b

Standalone component

Finite ElementsMany Interfaces

FEniCS, FreeFEM++, DUNE,dealII, Fluent

Problem definition requiresgeneral code

Physics, boundary conditionsCrucial interaction with othersimulation components

Discretization, mesh/geometryM. Knepley (UC) GPU GPU-SMP 17 / 38

Page 28: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Maturity

Ax = b

Page 29: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Maturity

Ax = b

Page 30: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Interface Maturity

Ax = b

Page 31: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc-GPU

PETSc now has support for Krylov solves on the GPU

-with-cuda=1 -with-cusp=1 -with-thrust=1Also possibly -with-precision=single

New classes VECCUDA and MATAIJCUDAJust change type on command line, -vec_type veccuda

Uses Thrust and Cusp libraries from Nvidia guysDoes not communicate vectors during solve

http://code.google.com/p/thrust

http://code.google.com/p/cusp-library

Page 32: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

ExampleDriven Cavity Velocity-Vorticity with Multigrid

ex50 -da_vec_type seqcusp-da_mat_type aijcusp -mat_no_inode # Setup types-da_grid_x 100 -da_grid_y 100 # Set grid size-pc_type none -pc_mg_levels 1 # Setup solver-preload off -cuda_synchronize # Setup run-log_summary

Page 33: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

ExamplePFLOTRAN

Flow Solver32× 32× 32 grid

Routine Time (s) MFlops MFlops/sCPUKSPSolve 8.3167 4370 526MatMult 1.5031 769 512GPUKSPSolve 1.6382 4500 2745MatMult 0.3554 830 2337

P. Lichtner, G. Hammond,R. Mills, B. Phillip

Page 34: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Finite Element Integration

Outline

4 Future Direction

Page 35: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Form Decomposition

Element integrals are decomposed into analytic and geometric parts:

∫T ∇φi(x) · ∇φj(x)dx (1)

=∫T∂φi (x)∂xα

∂φj (x)∂xα dx (2)

=∫Tref

∂ξβ∂xα

∂φi (ξ)∂ξβ

∂ξγ∂xα

∂φj (ξ)∂ξγ|J|dx (3)

=∂ξβ∂xα

∂ξγ∂xα |J|

∫Tref

∂φi (ξ)∂ξβ

∂φj (ξ)∂ξγ

dx (4)

= Gβγ(T )K ijβγ (5)

Coefficients are also put into the geometric part.

Page 36: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Tensor Product Formulation

FEniCS based code achieves

90 GF/s on 3D P1 Laplacian100 GF/s on 2D P1 Elasticity

Relies on analytic integration

Dot products are workhorse

Crossover point with quadrature with multiple fields

Finite Element Integration on GPUs, ACM TOMS, Andy R. Terrel and Matthew G. Knepley

http://www.fenicsproject.org

http://arxiv.org/abs/1103.0066

Page 37: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Quadrature?

Quadrature can handle

many fields (linearization)

non-affine elements (Argyris)

non-affine mappings (isoparametric)

functions not in the FEM space

Optimizations for Quadrature Representations of Finite Element Tensors through AutomatedCode Generation, ACM TOMS, Kristian B. Ølgaard and Garth N. Wells

http://arxiv.org/abs/1104.0199

Page 38: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Jed Brown’s Model

We consider weak forms dependent only on fields and gradients,∫Ωφ · f0(u,∇u) +∇φ : f1(u,∇u) = 0. (6)

Discretizing we have

∑e

ETe

[BT W qf0(uq,∇uq) +

∑k

DTk W qfk

1(uq,∇uq)

]= 0 (7)

fn pointwise physics functionsuq field at a quad pointW q diagonal matrix of quad weightsB,D basis function matrices which

reduce over quad pointsE assembly operator

Page 39: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u

Page 40: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u

__device__ vecType f1 ( realType u [ ] , vecType gradU [ ] , i n t comp) return gradU [ comp ] ;

Page 41: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · (∇u +∇uT )

Page 42: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · (∇u +∇uT )

__device__ vecType f1 ( realType u [ ] , vecType gradU [ ] , i n t comp) vecType f1 ;

switch ( comp) case 0:

f1 . x = 0 . 5 * ( gradU [ 0 ] . x + gradU [ 0 ] . x ) ;f1 . y = 0 . 5 * ( gradU [ 0 ] . y + gradU [ 1 ] . x ) ;break ;

case 1:f1 . x = 0 . 5 * ( gradU [ 1 ] . x + gradU [ 0 ] . y ) ;f1 . y = 0 . 5 * ( gradU [ 1 ] . y + gradU [ 1 ] . y ) ;

return f1 ;

Page 43: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u + φik2u

Page 44: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u + φik2u

__device__ vecType f1 ( realType u [ ] , vecType gradU [ ] , i n t comp) return gradU [ comp ] ;

__device__ realType f0 ( realType u [ ] , vecType gradU [ ] , i n t comp) return k * k *u [ 0 ] ;

Page 45: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u− (∇ · φ)p

Page 46: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ∇u− (∇ · φ)p

void f1 ( PetscScalar u [ ] , const PetscScalar gradU [ ] , PetscScalar f1 [ ] ) const PetscInt dim = SPATIAL_DIM_0 ;const PetscInt Ncomp = NUM_BASIS_COMPONENTS_0;PetscInt comp , d ;

for (comp = 0; comp < Ncomp; ++comp) for ( d = 0 ; d < dim ; ++d )

f1 [ comp* dim+d ] = gradU [ comp* dim+d ] ;f1 [ comp* dim+comp ] −= u [Ncomp ] ;

Page 47: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ν0e−βT∇u− (∇ · φ)p

Page 48: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Physics code

∇φi · ν0e−βT∇u− (∇ · φ)p

void f1 ( PetscScalar u [ ] , const PetscScalar gradU [ ] , PetscScalar f1 [ ] ) const PetscInt dim = SPATIAL_DIM_0 ;const PetscInt Ncomp = NUM_BASIS_COMPONENTS_0;PetscInt comp , d ;

for (comp = 0; comp < Ncomp; ++comp) for ( d = 0 ; d < dim ; ++d )

f1 [ comp* dim+d ] = nu_0 * exp(−beta *u [Ncomp+ 1 ] ) * gradU [ comp* dim+d ] ;f1 [ comp* dim+comp ] −= u [Ncomp ] ;

Page 49: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Not Quadrature?

Vectorization is a Problem

Strategy Problem

Vectorize over Quad Points Reduction needed to computeBasis Coefficients

Vectorize over Basis Coef foreach Quad Point

Too many passes through globalmemory

Vectorize over Basis Coefand Quad Points

Some threads idle when sizesare different

Page 50: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Not Quadrature?

Strategy Problem

Page 51: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Not Quadrature?

Strategy Problem

Page 52: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Why Not Quadrature?

Strategy Problem

Page 53: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Thread Transposition

Map values at quadrature

points to coefficients

t5t4t3

t2t1t0

t5t4t3

t2t1t0

t5t4t3

t2t1t0

Continue with kernel

Evaluate basis and process

values at quadrature points

t5

t4

t3

t2

t1

t0

t5

t4

t3

t2

t1

t0

Page 54: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Basis Phase

TT

Quadrature Phase

TT

TTNt = 24

Nt = 24

Nbc = 12

Nbs = 6

Nsbc = 3

Nsqc = 2

Nbl = 2 Nbl = 2

Page 55: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc Integration

PETSc FEM Organization

GPU evaluation is transparent to the user:

User Input Automation Solver Inputdomain == Triangle/TetGen ==> Meshelement == FIAT ==> Tabulationfn == Generic Evaluation ==> Residual

Loops are done in batchesRemainder cells are integrated on the CPUPETSc ex52 is a single-field example

http://www.mcs.anl.gov/petsc/petsc-dev/src/snes/examples/tutorials/ex52.c.html

Page 56: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc Integration

PETSc FEM Organization

GPU evaluation is transparent to the user:

User Input Automation Solver Inputdomain == Triangle/TetGen ==> Meshelement == FIAT ==> Tabulationfn == Generic Evaluation ==> Residual

Loops are done in batchesRemainder cells are integrated on the CPUPETSc ex52 is a single-field example

Page 57: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc Multiphysics

Each block of the Jacobian is evaluated separately:Reuse single-field code

Vectorize over cells, rather than fields

Retain sparsity of the Jacobian

Solver integration is seamless:Nested Block preconditioners from the command line

Segregated KKT MG smoothers from the command line

Fully composable with AMG, LU, Schur complement, etc.

PETSc ex62 solves the Stokes problem,and ex31 adds temperature

Page 58: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc Multiphysics

Page 59: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

PETSc Multiphysics

Page 60: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Performance ExpectationsElement Integration

FEM Integration, at the element level,is also limited by memory bandwidth,

rather than by peak flop rate.

We expect bandwidth ratio speedup (3x–6x for most systems)

Input for FEM is a vector of coefficients (auxiliary fields)

Output is a vector of coefficients for the residual

Page 61: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

2D P1 Laplacian Performance

Reaches 100 GF/s by 100K elementsM. Knepley (UC) GPU GPU-SMP 33 / 38

Page 62: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

2D P1 Laplacian Performance

Linear scaling for both GPU and CPU integrationM. Knepley (UC) GPU GPU-SMP 34 / 38

Page 63: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

2D P1 Rate-of-Strain Performance

Reaches 100 GF/s by 100K elementsM. Knepley (UC) GPU GPU-SMP 35 / 38

Page 64: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

2D P1 Rate-of-Strain Performance

Linear scaling for both GPU and CPU integrationM. Knepley (UC) GPU GPU-SMP 36 / 38

Page 65: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Future Direction

Outline

4 Future Direction

Page 66: FEM Integration with Quadrature on the GPUknepley/presentations/Shenzhen12.pdfFEM Integration with Quadrature on the GPU Matthew Knepley Computation Institute University of Chicago

Future Direction

Competing Models

How should kernels beintegrated into libraries?

CUDA+Code GenerationExplicit vectorizationCan inspect/optimize codeErrors easily localizedCan use high-level reasoningfor optimization (FErari)Kernel fusion is easy

TBB+C++ TemplatesImplicit vectorizationGenerated code is hiddenNotoriously difficult debuggingLow-level compiler-typeoptimizationKernel fusion is really hard

http://mathema.tician.de/software/pycuda

https://launchpad.net/ferari

http://threadingbuildingblocks.org/

http://www.cplusplus.com/reference/stl/