when the desktop is not enough• generally divide and conquer is a useful paradigm. • parallel...

When the desktop is not enough

Cliff Addison, Steve Downing, Ian Smith,

Manhui Wang

09 April 2019

Why Advanced Research Computing?

• Computational modelling - the third pillar of scientific

research

• Along with theory and experimentation

• Models the expensive, difficult or dangerous

• Computational wind tunnels for F1 cars, aircraft

• Weather forecasts, explosions in buildings, the Universe

• Assists understanding of observed data

• Large Hadron Collider, earthquake data, astronomical data

• Spots promising areas for theory / experiments

Alternatives when a desktop isn’t enough

• Condor Windows• 1,400 cores; serial high throughput; Windows only

• Chadwick – local Linux cluster• 1,888 cores for distributed memory parallel; 128 cores / 2 TB SMP

• Barkla – new local Linux cluster• 1,840 cores for distributed memory parallel (plus dedicated nodes)

• GridPP – largely for LHC data• Authentication requires a certificate; largely serial high throughput

• ARCHER – national system via grants• 118,080 cores for distributed memory parallel – world-class scale

• Hartree – national systems, largely project driven

• PRACE - world-class systems, upon application

What are the differences among options?

• Condor• Intended for lots (up to 100,000ish per month) of serial jobs

• Service runs on Teaching Centre PC’s so largely evenings (~12 hr)

• Linux clusters• Use commodity Intel x86_64 processors and high-speed / low

latency interconnect (GridPP systems Ethernet mainly).

• Traditional parallel systems (typically 16 cores / node, 2-4 GB

memory per core – Barkla has 40 cores/node, 9.5GB/core)

• Supports large-scale parallel and high-throughput workflows

• Accelerator systems• Use Nvidia / AMD GPUs or FPGAs to boost performance

GridPP Status

~10% of WLCG; 18 sites hosting hardware.

Tier-1 at RAL and 4 distributed Tier-2 centres.

ScotGrid

NorthGrid

London

SouthGrid

Network Connectivity

RAL Tier-1:

• 20 + 10 Gb/s active/active

LHCOPN connection.

• 2 x 40Gb/s active/standby UK

Core network.

Tier-2s:

• Typically 10-40 Gb/s

dedicated link to UK core

network

GridPP not just for HEP / Higgs

A 2013 analysis suggests

that many genes evolved

in parallel in bats and

dolphins as each

developed the remarkable

ability to echolocate.

• GridPP took roughly a

month to perform its

analyses that would

have taken years on a

desktop computer

Condor @ Liverpool

• Windows-based service to leverage idle PC’s

• Desktop Condor simplifies submission

• Supports running of over 100,000 jobs / month

• Jobs typically: • Only require 8 hours to complete

• Only need 1 GB of memory

• Are stand-alone Windows executables that do not need graphics

• MATLAB and R most common

• If a job is interrupted in a run, Condor resubmits the

job automatically.

• Help is available to automate workflow, particularly

with large studies

Condor contribution The industrial melanism

mutation in British

peppered moths is a

transposable elementArjen E. van’t Hof, Pascal Campagne,

Daniel J. Rigden, Carl J. Yung,

Jessica Lingley, Michael A. Quail,

Neil Hall, Alistair C. Darby & Ilik J.

Saccheri

NATURE, VOL 534, 2 JUNE 2016

Chadwick

• Entered production summer 2013

• Picked in 2012 by a cross-university evaluation panel• Intel Sandy Bridge nodes for regular job mix

• Two Visualisation nodes for remote visualisation and GPU

experimentation

• One Shared Memory (SMP) node with 2 TB memory and 128 cores

• About 180 TB of local disk storage (TB – 1012 bytes)

• 28.4 million-million floating point operations per

second achievable (90% of peak) • Tera is 1012, so talk of TeraFlops or TF, also have PetaFlops - PF

• Combination of Sandy Bridge and InfiniBand popular• Newer systems have faster InfiniBand and processors with more

cores.

Barkla

• Entered production January 2018

• Picked in 2017 by a cross-university evaluation panel• Intel Skylake nodes for regular job mix; 100 gbps OmniPath

• Two Visualisation nodes for remote visualisation and two nodes

with 4xGPUs for deep learning experimentation

• Two Shared Memory (SMP) nodes with 1.1 TB memory and 40

cores

• About 500 TB of local disk storage

• 160 TeraFlops (65% of peak)

• Combination of Skylake and OmniPath is state of the

art outside of circa top 20 world systems.

ARCHER

• EPSRC and NERC research council support• STFC have their own substantial facilities via DiRAC

• Grants are provided with a certain parcel of allocation

units to be used over a particular time scale.

• ARCHER – is a Cray XC30 - about 1.5 PF

achievable• Based around Ivy Bridge, 24 cores / node; 4920 nodes

• No GPU or accelerator nodes – maybe in child of ARCHER

• 4.4 PB of storage with a fast read-write bandwidth

• Vital large-scale resource for many Liverpool projects

• ARCHER2 planned Q2 2019 – 15 PF target

Liverpool / N8 HPC / ARCHER contribution

Molecular trick alters rules of

attraction for non-magnetic

metals

Fatma Al Ma’Mari et al.

"Beating the Stoner criterion using

molecular interfaces"

Nature 524 (2015) 69-73

http://www.nature.com/nature/journal/v5

24/n7563/full/nature14621.html

Dr Gilberto Teobaldi from the University

of Liverpool provided theoretical

insight from atomistic modelling to

this project.

Access to the recently upgraded

University of Liverpool, N8 HPC,

ARCHER UK (via the UKCP

consortium) and HPC Wales High

Performance Computing facilities

was essential to the theoretical and

computational work of this research.

http://www.nature.com/nature/journal/v524/n7563/full/nature14621.html

https://www.liverpool.ac.uk/chemistry/staff/gilberto-teobaldi/

https://www.liverpool.ac.uk/csd/advanced-research-computing/facilities/high-performance-computing/

http://n8hpc.org.uk/

http://www.archer.ac.uk/

http://www.ukcp.ac.uk/

http://www.hpcwales.co.uk/

Hartree (Daresbury)

• Main compute component• Bull Sequana 4 PF – upgraded service in 2018

• Accelerated and energy efficient platforms:• Nvidia GPU – largely JADE national system for Deep Learning

• IBM Cognitive computing and data analytics• Nvidia GPU + IBM POWER8

• Some cycles are available on a pay-per-use basis

(HPC as a service)

• Most applications are dependent on having a UK

industrial partner.

But how do I get off the desktop???

Expect to move out of

your comfort zone.

Dorothy, Wizard of Oz, 1939

Consider several scenarios

• Need to run a simulation-based statistical model to

validate a hypothesis (needing 100,000+ runs)• Melanism moth mutation

• Conducting a parameter sweep over input data to

identify best combination of components (Monte

Carlo)• Ensemble forecasts – perturb initial conditions, look for “likely”

weather patterns – AKA uncertainty quantification.

• Optimisation where objective function is very

expensive to evaluate.• 3D gait modelling

• Many model states to assess (e.g. wind angles)

Personalised Medicine example

• Project is an example of a Genome-Wide Association Study

• aims to identify genetic predictors of response to anti-epileptic drugs

• try to identify regions of the human genome that differ between individuals (referred to as Single Nucleotide Polymorphisms or SNPs)

• 800 patients genotyped at 500 000 SNPs along the entire genome

• Statistically test the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)

• Large data-parallel problem using R – ideal for Condor

• Divide datasets into small partitions so that individual jobs run for 15-30 minutes

• Batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC

Monte Carlo - related

• Covers a wide range of topics• Sometimes looking for needle in haystack

• Often want to build a picture of an underlying probability distribution

• Choices of compute solution driven by• Size of input / output data

• Length of time for a run / job

• Complexity of each job?

• Serial

• Multi-threaded

• Distributed memory parallel

• How are results pulled together for final analysis

• Is the framework you are using up to the task?

Complicated optimisation problems

• Bipedal gait example

• Institute of Ageing and

Chronic Disease

• Want accurate gait

model from footprints,

skeleton – non-intrusive

• Seek most efficient

(lowest energy) way of

moving

• Given a configuration

(muscle tensions, bone

position etc.) what is its

energy demand?

CFD – Direct Numerical Simulations

• Mach 3 Re 16,800 flow over

an aircraft flap.

• TriGlobal eigenmode of a

3D wing

CFD – supersonic flow over an aircraft flap

Using nature-based coastal flood defences

FVCOM Modelling

Funding via NOC – UK, EPSRC –

UK, NWO – Netherlands, NSFC

- China

Still more CFD

• Modelling of sand transport under irregular and

breaking waves.

OpenFOAM –

waves on beach Two phase model

Parallelism – the big logical stumbling block!

• Three basic levels to deal with:

1) Lots of similar, independent jobs, each job uses 1

core (high-throughput)

2) Jobs can use multiple cores on the same computer

(multi-threaded AKA shared memory or SMP jobs).

3) Jobs can use multiple cores on different computers

– communication via special libraries (distributed

memory jobs, AKA MPI jobs as MPI main library

used).

• Can have any mixture of the above.

Parallel Monte Carlo – Estimating 𝜋 (badly)

Generate a large number

N of random pairs

(𝑥𝑖 , 𝑦𝑖) in [0,1]

Count the cases M when

𝑥𝑖2 + 𝑦𝑖

2 ≤ 1

Then𝑀

𝑁≈

𝜋

4

Idea – generate N pairs in

parallel

Careful of random seeds,

global sum for M.

(-1,1)

(-1,-1) (1,-1)

(1,1)

program monte

include 'mpif.h'

double precision x, y, pi, pisum

integer*4 ierr, rank, np

integer*4 incircle, samplesize

integer*4 Rand_size,Rand_seed

allocatable::Rand_seed(:)

parameter(samplesize=20000000)

call random_seed(SIZE=rand_size)

allocate (Rand_seed(Rand_size))

call MPI_INIT(ierr)

call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)

call random_seed(get=Rand_seed)

Rand_seed = (rank+1)*Rand_seed

call random_seed(put=Rand_seed)

incircle = 0

do i = 1, samplesize

call random_number(x)

call random_number(y)

if ((x*x + y*y) .lt. 1.0d0) then

incircle = incircle+1 ! point is in the circle

endif

end do

pi = 4.0d0 * DBLE(incircle) / DBLE(samplesize)

call MPI_REDUCE(pi, pisum, 1, MPI_DOUBLE_PRECISION, MPI_SUM, &

0 , MPI_COMM_WORLD, ierr)

if (rank .eq. 0) then

pi = pisum / DBLE(np)

print '(A,I4,A,F12.8,A)','Monte-Carlo estimate of pi by ',np,&

' processes is ',pi,'.'

endif

call MPI_FINALIZE(ierr)

end

A different estimation of 𝜋 (not so badly)

𝜋 = 01 4

1+𝑥2d𝑥

But also

𝜋 =

𝑖=1

𝑁

න

(𝑖−1)/𝑁

𝑖/𝑁

Τ4 (1 + 𝑥2 ) d𝑥

where N can be large

Note: Each integral can be

approximated independently

then add all the partial sums

to form our 𝜋 estimate.

http://uk.mathworks.com/help/distcomp/exa

mples/numerical-estimation-of-pi-using-

message-passing.html

Practical comment on computing 𝜋

• Good example of using local vector syntax in

MATLAB or Python solution – approximate each sub-

integral with the mid-point rule:

n = fscanf(fileid,'%i');

h = 1/n;

x = h*([1:n]-0.5);

vecsum = sum(4.0 ./ (1.0 + x.*x));

piapprox = h*vecsum;

Final comment on parallelism

• Look for independent blocks of computation, but then

need to stitch the result of those blocks together.

• Also must be certain preconditions support this

parallel operation (e.g. different random seeds for

each Monte Carlo process).

• Generally divide and conquer is a useful paradigm.

• Parallel computation always has overheads

• Sometimes parallel computation is only time effective

if overheads are sufficiently small (e.g. parallel FFT)

• Must assert correctness of algorithm!!

Hints on using any batch system

• Understand what you need to do:• What sort of problem am I trying to address?

• How much input / output data am I going to need / generate?

• Order of magnitude on the number of runs needed?

• Idea on how long each run takes?

• What do I want to get back from a run that I can analyse / process?

• What are the weaknesses in my workflow to deal with lots of runs /

bigger data volumes etc.?

• How can I automate things to fit into a batch environment?

• Batch environment is shared with others!!!

• Check – some Linux / command line skills likely

needed.

Hints continued…

• Spend some time understanding the new

environment. (Lots of things about Linux / UNIX

available on line.)

• Ask for example scripts with walk-throughs.• Particularly examples that match what you need!

• Plan to start learning by copying • Editing a working script to match your particular job, data etc. good

way to get going.

• Large number of jobs to submit? • Try to automate process – often tools / scripts there already

• Array jobs

when the desktop is not enough• generally divide and conquer is a useful paradigm. • parallel...

Documents