UQ’s MeDiCIData in the right place, at the right time.
Jake Carroll, Senior ICT Manager, ResearchThe Queensland Brain Institute, The University of Queensland, Australia
This is a story of data locality, performance, namespace and financial
complexity.
QBI
CAI
IMB
AIBN
100’s of TB’s per day of data generated - eclectic mixture of Life
Sciences data, engineering, physics, nanotech
Every man, woman and child seems to build a (little) supercomputer, to deal with their problems…
Compute + Storage are tightly connected, in each building.
Instrument outputs + scientific endeavors grow - budgets for storage
and compute do not.
To add another complexity…
The MeDiCI Journey
Thus, the problem (or question) definition:
“How do we provide parallel access to scientific data, through a multitudeof protocols and give the illusion that the data is ‘next to’ the applications, on a budget, keeping the rightdata near the right type of computational infrastructure, noting our budgetary constraints?
SpectrumScale AFM (cache)
{Parallel IO via NSD protocol}
SpectrumScale AFM (home)
Back at UQ
uqjcarr1
Scale cluster “A”using UQ creds
Scale cluster “B”using other creds
Out at Polaris
someOtherName
mmname2uuidmmuuid2Name
Turns out, all that code was missing from SpectrumScale.
Network stumbles…
• We had, at best, 10GbE between our buildings and around the campus.
• Not made for the parallel IO aggression of spectrumScale AFM over the NSD protocol.
• Needed to spawn an entire mini-project to upgrade campus networks for big storage IO to 40/100G around the “ring” of nodes.
Recovery storms - AFM is a work in progress
• When you’re trying to recover 10’s of millions of files, AFM doesn’t always keep up.
• IBM working on it, for us (and others, globally).
• Scaling to 100’s of millions of files in a single (or multiple) file-sets, if not billions of files in sync/push/recovery is required.
Things we assumed users would doas per our mental model.
User puts data in cache frominstruments to send to a
supercomputer, at remote site
User processes data out atremote site on said supercomputer
Things people actually did, breaking our mental model.
User puts data in cache frominstruments. They start processing
on a supercomputer locally.
Simultaneously, they start using the storage fabric to process other “bits”of the outputs of the run on the other supercomputer for an additive workflow.[culminating in the fabric becoming a means for both supercomputers to work on the same tasks at the same time]
Same data namespaceended up everywhere.
That much, was intentional.
As a result, user could leverage*every bit of the compute* everywhere
simultaneously, if their workflowis smart enough…
IMB QBI
RCC
Turns out, we’re onto something
Thank you.
• UQ RCC, David Abramson for mentorship and a true sense of adventure.
• The Queensland Cyber Infrastructure Foundation (QCIF)
• My colleagues at UQ QBI, IMB, CAI, AIBN, ITS
• AIIA, ACS
• Justin Glen @ DDN