when the desktop is not enough• generally divide and conquer is a useful paradigm. • parallel...
TRANSCRIPT
When the desktop is not enough
Cliff Addison, Steve Downing, Ian Smith,
Manhui Wang
09 April 2019
Slide 2
Why Advanced Research Computing?
• Computational modelling - the third pillar of scientific
research
• Along with theory and experimentation
• Models the expensive, difficult or dangerous
• Computational wind tunnels for F1 cars, aircraft
• Weather forecasts, explosions in buildings, the Universe
• Assists understanding of observed data
• Large Hadron Collider, earthquake data, astronomical data
• Spots promising areas for theory / experiments
Slide 3
Alternatives when a desktop isn’t enough
• Condor Windows• 1,400 cores; serial high throughput; Windows only
• Chadwick – local Linux cluster• 1,888 cores for distributed memory parallel; 128 cores / 2 TB SMP
• Barkla – new local Linux cluster• 1,840 cores for distributed memory parallel (plus dedicated nodes)
• GridPP – largely for LHC data• Authentication requires a certificate; largely serial high throughput
• ARCHER – national system via grants• 118,080 cores for distributed memory parallel – world-class scale
• Hartree – national systems, largely project driven
• PRACE - world-class systems, upon application
Slide 4
What are the differences among options?
• Condor• Intended for lots (up to 100,000ish per month) of serial jobs
• Service runs on Teaching Centre PC’s so largely evenings (~12 hr)
• Linux clusters• Use commodity Intel x86_64 processors and high-speed / low
latency interconnect (GridPP systems Ethernet mainly).
• Traditional parallel systems (typically 16 cores / node, 2-4 GB
memory per core – Barkla has 40 cores/node, 9.5GB/core)
• Supports large-scale parallel and high-throughput workflows
• Accelerator systems• Use Nvidia / AMD GPUs or FPGAs to boost performance
GridPP Status
~10% of WLCG; 18 sites hosting hardware.
Tier-1 at RAL and 4 distributed Tier-2 centres.
ScotGrid
NorthGrid
London
SouthGrid
Network Connectivity
RAL Tier-1:
• 20 + 10 Gb/s active/active
LHCOPN connection.
• 2 x 40Gb/s active/standby UK
Core network.
Tier-2s:
• Typically 10-40 Gb/s
dedicated link to UK core
network
GridPP not just for HEP / Higgs
A 2013 analysis suggests
that many genes evolved
in parallel in bats and
dolphins as each
developed the remarkable
ability to echolocate.
• GridPP took roughly a
month to perform its
analyses that would
have taken years on a
desktop computer
Slide 7
Condor @ Liverpool
• Windows-based service to leverage idle PC’s
• Desktop Condor simplifies submission
• Supports running of over 100,000 jobs / month
• Jobs typically: • Only require 8 hours to complete
• Only need 1 GB of memory
• Are stand-alone Windows executables that do not need graphics
• MATLAB and R most common
• If a job is interrupted in a run, Condor resubmits the
job automatically.
• Help is available to automate workflow, particularly
with large studies
Condor contribution The industrial melanism
mutation in British
peppered moths is a
transposable elementArjen E. van’t Hof, Pascal Campagne,
Daniel J. Rigden, Carl J. Yung,
Jessica Lingley, Michael A. Quail,
Neil Hall, Alistair C. Darby & Ilik J.
Saccheri
NATURE, VOL 534, 2 JUNE 2016
Slide 9
Chadwick
• Entered production summer 2013
• Picked in 2012 by a cross-university evaluation panel• Intel Sandy Bridge nodes for regular job mix
• Two Visualisation nodes for remote visualisation and GPU
experimentation
• One Shared Memory (SMP) node with 2 TB memory and 128 cores
• About 180 TB of local disk storage (TB – 1012 bytes)
• 28.4 million-million floating point operations per
second achievable (90% of peak) • Tera is 1012, so talk of TeraFlops or TF, also have PetaFlops - PF
• Combination of Sandy Bridge and InfiniBand popular• Newer systems have faster InfiniBand and processors with more
cores.
Slide 10
Barkla
• Entered production January 2018
• Picked in 2017 by a cross-university evaluation panel• Intel Skylake nodes for regular job mix; 100 gbps OmniPath
• Two Visualisation nodes for remote visualisation and two nodes
with 4xGPUs for deep learning experimentation
• Two Shared Memory (SMP) nodes with 1.1 TB memory and 40
cores
• About 500 TB of local disk storage
• 160 TeraFlops (65% of peak)
• Combination of Skylake and OmniPath is state of the
art outside of circa top 20 world systems.
Slide 11
ARCHER
• EPSRC and NERC research council support• STFC have their own substantial facilities via DiRAC
• Grants are provided with a certain parcel of allocation
units to be used over a particular time scale.
• ARCHER – is a Cray XC30 - about 1.5 PF
achievable• Based around Ivy Bridge, 24 cores / node; 4920 nodes
• No GPU or accelerator nodes – maybe in child of ARCHER
• 4.4 PB of storage with a fast read-write bandwidth
• Vital large-scale resource for many Liverpool projects
• ARCHER2 planned Q2 2019 – 15 PF target
Liverpool / N8 HPC / ARCHER contribution
Molecular trick alters rules of
attraction for non-magnetic
metals
Fatma Al Ma’Mari et al.
"Beating the Stoner criterion using
molecular interfaces"
Nature 524 (2015) 69-73
http://www.nature.com/nature/journal/v5
24/n7563/full/nature14621.html
Dr Gilberto Teobaldi from the University
of Liverpool provided theoretical
insight from atomistic modelling to
this project.
Access to the recently upgraded
University of Liverpool, N8 HPC,
ARCHER UK (via the UKCP
consortium) and HPC Wales High
Performance Computing facilities
was essential to the theoretical and
computational work of this research.
Slide 13
Hartree (Daresbury)
• Main compute component• Bull Sequana 4 PF – upgraded service in 2018
• Accelerated and energy efficient platforms:• Nvidia GPU – largely JADE national system for Deep Learning
• IBM Cognitive computing and data analytics• Nvidia GPU + IBM POWER8
• Some cycles are available on a pay-per-use basis
(HPC as a service)
• Most applications are dependent on having a UK
industrial partner.
But how do I get off the desktop???
Expect to move out of
your comfort zone.
Dorothy, Wizard of Oz, 1939
Slide 15
Consider several scenarios
• Need to run a simulation-based statistical model to
validate a hypothesis (needing 100,000+ runs)• Melanism moth mutation
• Conducting a parameter sweep over input data to
identify best combination of components (Monte
Carlo)• Ensemble forecasts – perturb initial conditions, look for “likely”
weather patterns – AKA uncertainty quantification.
• Optimisation where objective function is very
expensive to evaluate.• 3D gait modelling
• Many model states to assess (e.g. wind angles)
Slide 16
Personalised Medicine example
• Project is an example of a Genome-Wide Association Study
• aims to identify genetic predictors of response to anti-epileptic drugs
• try to identify regions of the human genome that differ between individuals (referred to as Single Nucleotide Polymorphisms or SNPs)
• 800 patients genotyped at 500 000 SNPs along the entire genome
• Statistically test the association between SNPs and outcomes (e.g. time to withdrawal of drug due to adverse effects)
• Large data-parallel problem using R – ideal for Condor
• Divide datasets into small partitions so that individual jobs run for 15-30 minutes
• Batch of 26 chromosomes (2 600 jobs) required ~ 5 hours wallclock time on Condor but ~ 5 weeks on a single PC
Slide 17
Monte Carlo - related
• Covers a wide range of topics• Sometimes looking for needle in haystack
• Often want to build a picture of an underlying probability distribution
• Choices of compute solution driven by• Size of input / output data
• Length of time for a run / job
• Complexity of each job?
• Serial
• Multi-threaded
• Distributed memory parallel
• How are results pulled together for final analysis
• Is the framework you are using up to the task?
Complicated optimisation problems
• Bipedal gait example
• Institute of Ageing and
Chronic Disease
• Want accurate gait
model from footprints,
skeleton – non-intrusive
• Seek most efficient
(lowest energy) way of
moving
• Given a configuration
(muscle tensions, bone
position etc.) what is its
energy demand?
Slide 19
CFD – Direct Numerical Simulations
• Mach 3 Re 16,800 flow over
an aircraft flap.
• TriGlobal eigenmode of a
3D wing
Slide 20
CFD – supersonic flow over an aircraft flap
Slide 21
Using nature-based coastal flood defences
FVCOM Modelling
Funding via NOC – UK, EPSRC –
UK, NWO – Netherlands, NSFC
- China
Slide 22
Still more CFD
• Modelling of sand transport under irregular and
breaking waves.
OpenFOAM –
waves on beach Two phase model
Slide 23
Parallelism – the big logical stumbling block!
• Three basic levels to deal with:
1) Lots of similar, independent jobs, each job uses 1
core (high-throughput)
2) Jobs can use multiple cores on the same computer
(multi-threaded AKA shared memory or SMP jobs).
3) Jobs can use multiple cores on different computers
– communication via special libraries (distributed
memory jobs, AKA MPI jobs as MPI main library
used).
• Can have any mixture of the above.
Parallel Monte Carlo – Estimating 𝜋 (badly)
Generate a large number
N of random pairs
(𝑥𝑖 , 𝑦𝑖) in [0,1]
Count the cases M when
𝑥𝑖2 + 𝑦𝑖
2 ≤ 1
Then𝑀
𝑁≈
𝜋
4
Idea – generate N pairs in
parallel
Careful of random seeds,
global sum for M.
(-1,1)
(-1,-1) (1,-1)
(1,1)
program monte
include 'mpif.h'
double precision x, y, pi, pisum
integer*4 ierr, rank, np
integer*4 incircle, samplesize
integer*4 Rand_size,Rand_seed
allocatable::Rand_seed(:)
parameter(samplesize=20000000)
call random_seed(SIZE=rand_size)
allocate (Rand_seed(Rand_size))
call MPI_INIT(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, np, ierr)
call random_seed(get=Rand_seed)
Rand_seed = (rank+1)*Rand_seed
call random_seed(put=Rand_seed)
incircle = 0
do i = 1, samplesize
call random_number(x)
call random_number(y)
if ((x*x + y*y) .lt. 1.0d0) then
incircle = incircle+1 ! point is in the circle
endif
end do
pi = 4.0d0 * DBLE(incircle) / DBLE(samplesize)
call MPI_REDUCE(pi, pisum, 1, MPI_DOUBLE_PRECISION, MPI_SUM, &
0 , MPI_COMM_WORLD, ierr)
if (rank .eq. 0) then
pi = pisum / DBLE(np)
print '(A,I4,A,F12.8,A)','Monte-Carlo estimate of pi by ',np,&
' processes is ',pi,'.'
endif
call MPI_FINALIZE(ierr)
end
A different estimation of 𝜋 (not so badly)
𝜋 = 01 4
1+𝑥2d𝑥
But also
𝜋 =
𝑖=1
𝑁
න
(𝑖−1)/𝑁
𝑖/𝑁
Τ4 (1 + 𝑥2 ) d𝑥
where N can be large
Note: Each integral can be
approximated independently
then add all the partial sums
to form our 𝜋 estimate.
http://uk.mathworks.com/help/distcomp/exa
mples/numerical-estimation-of-pi-using-
message-passing.html
Slide 27
Practical comment on computing 𝜋
• Good example of using local vector syntax in
MATLAB or Python solution – approximate each sub-
integral with the mid-point rule:
n = fscanf(fileid,'%i');
h = 1/n;
x = h*([1:n]-0.5);
vecsum = sum(4.0 ./ (1.0 + x.*x));
piapprox = h*vecsum;
Slide 28
Final comment on parallelism
• Look for independent blocks of computation, but then
need to stitch the result of those blocks together.
• Also must be certain preconditions support this
parallel operation (e.g. different random seeds for
each Monte Carlo process).
• Generally divide and conquer is a useful paradigm.
• Parallel computation always has overheads
• Sometimes parallel computation is only time effective
if overheads are sufficiently small (e.g. parallel FFT)
• Must assert correctness of algorithm!!
Slide 29
Hints on using any batch system
• Understand what you need to do:• What sort of problem am I trying to address?
• How much input / output data am I going to need / generate?
• Order of magnitude on the number of runs needed?
• Idea on how long each run takes?
• What do I want to get back from a run that I can analyse / process?
• What are the weaknesses in my workflow to deal with lots of runs /
bigger data volumes etc.?
• How can I automate things to fit into a batch environment?
• Batch environment is shared with others!!!
• Check – some Linux / command line skills likely
needed.
Slide 30
Hints continued…
• Spend some time understanding the new
environment. (Lots of things about Linux / UNIX
available on line.)
• Ask for example scripts with walk-throughs.• Particularly examples that match what you need!
• Plan to start learning by copying • Editing a working script to match your particular job, data etc. good
way to get going.
• Large number of jobs to submit? • Try to automate process – often tools / scripts there already
• Array jobs