e-infrastructure across photon and neutron sources

Download e-Infrastructure  across Photon and  Neutron  Sources

If you can't read please download the document

Upload: thu

Post on 06-Jan-2016

40 views

Category:

Documents


5 download

DESCRIPTION

e-Infrastructure across Photon and Neutron Sources. Brian Matthews [email protected] Scientific Computing Department STFC Rutherford Appleton Laboratory. STFC. Formed by Royal Charter in 2007, the Science and Technology Facilities Council - PowerPoint PPT Presentation

TRANSCRIPT

  • e-Infrastructure across Photon and Neutron Sources

    Brian [email protected]

    Scientific Computing DepartmentSTFC Rutherford Appleton Laboratory

  • STFC

  • Formedby Royal Charter in 2007, the Science and Technology Facilities Councilis one of Europe's largest multidisciplinary research organisations supporting scientists and engineers world-wide.

    The Council operates world-class, large scale research facilities and provides strategic advice to the UK government on their development.

    The Council funds university research in particle physics, nuclear physics, astronomy and space.

  • CERN: LHCESO: Alma ArrayESA: Top SatILL and ESRFJames Clerk Maxwell TelescopeHawaiiRAL: DiamondDaresbury LabChilbolton LabRutherford Appleton LabUK ATC

  • The scientific data centre at RALThe data centre at RAL

    RAL Base of Scientific Computing

    UK LHC Tier 1 - ~20Pb dataEMERALD - Largest production GPU facility in UK: 372 Nvidia Telsa M2090 GPUsJasmin/CEMS 4Pb of climate modelling dataSCARF >2200 core cluster for simulation

  • Facilities and Resources of The Hartree CentreScientific Computing Department at Daresbury Laboratory

    Projects and codes developed on state of the art systems:BlueGene/Q Fastest UK machine and worlds largest software development platformOver 5 PB disc and 15 PB tape storesiDataplex cluster Data Intensive systemsVisualisation System

  • Data Infrastructure for Large-Scale Facilities

  • STFC Rutherford Appleton Laboratory

  • The science we do - Structure of materialsFitting experimental data to modelBioactive glass for bone growth Hydrogen storage for zero emission vehiclesMagnetic moments in electronic storage~30,000 user visitors each year in Europe: physics, chemistry, biology, medicine, energy, environmental, materials, culturepharmaceuticals, petrochemicals, microelectronics

    Longitudinal strain in aircraft wingDiffraction pattern from sampleVisit facility on research campusPlace sample in beamBillions of of investmentc. 400M for DLS+ running costsOver 5.000 high impact publications per year in EuropeBut so far no integrated data repositoriesLacking sustainability & traceability

  • Data monitoringDataSynchronisationNetworkmonitoringData archiveDataCataloguingNow

  • ICAT + Mantid(desktop client)ICAT Tool Suite and ClientsICAT APIsIDS(ICAT Data Service)ICATJob PortalTopCAT(Web Interface to ICATs)ICAT Data Explorer(Eclipse Plugin in DAWN)Desktop appClusters/HPCDiskTapehttp://www.mantidproject.org/http://www.dawnsci.org/https://code.google.com/p/icat-job-portal/

  • Scaling

  • Facility Data LifecycleProposalApproval SchedulingExperimentData reductionPublicationData analysisMetadata Cataloguehttp://www.icatproject.org

  • Managing Data Processing PipelinesCredits: Martin Dove, Erica Yang (Nov. 2009)Raw dataDerived dataResultant dataIssues:Valuable data amongst noiseSoftware versionData provenanceDistributed analysisComplex and dynamic workflowsUsability of toolsCredit: Phil Withers, Andy Alderson, Sam McDonald

  • Detector RatesDectris Pilatus 6M2463 x 2527 pixels7 MB Images25 frames per sec.175 MB/sHigh Duty Cycles means that 10 TB / day is quite possible

    "The Pilatus detector has completely transformed the way X-ray photons are being detected today at synchrotron radiation sources, such as Diamond. This is something we could only have dreamt of in the early days of synchrotron sciences.

    Prof. Gerhard Materlik CBE, CEO of Diamond Light Source, June 18th, 2012

  • Infrastructure for managing data flowsScanReconstructSegment + Quantify3D mesh + Image based ModellingPredict + CompareSome mage credit: Avizo, Visualization Sciences Group (VSG)Data CataloguePetabyteData storageParallel File systemHPCCPU+GPUVisualisationInfrastructure + Software + Expertise!Tomography: Dealing with high data volumes 200Gb/scan, ~5 TB/day (one experiment)MX: high data volumes, smaller files, but a lot more experimentsHard to move the data needs to be handled at the facility?

  • Managing Processed DataHDF libNexus libPythonCSVJSONXMLExcelStandalone web clientHosted web clientRestful APIsFile SystemHDF filesNexus filesMVC: Model, View, ControllerViewModelController(Access)(Content)

  • Sharing

  • PaNdataPhoton and Neutron Data InfrastructureEstablished in 2007 with 4 facilitiesnow standing at 13With friends around the worldCombined Number of Unique Users more than 35000 in 2011 Combines Scientific and IT staff from the collaborating facilitiesEuropean Framework 7 ProjectsPaNdata-Europe: SA, 2009-11PaNdata-Open Data Infrastructure, IP, 2011-14

    GuestimatesInvestment > 4.000.000.000*Running costs > 500.000.000/yr*Publications > 10.000/yr*RCosts/Publication ~ 50.000*%Data volume >> 10PB/yr*

  • PaN-data ODI an Open Data Infrastructure for European Photon and Neutron laboratories

    Federated data catalogues supporting cross-facility, cross-discipline interaction at the scale of atoms and moleculesUnification of data management policies Shared protocols for exchange of user informationCommon scientific data formatsInteroperation of data analysis software Data Provenance WP: Linking Data and Publications Digital Preservation: supporting the long-term preservation of the research outputs

  • Counting Usershttp://pan-data.eu/Users2012-Results

    Number of Users shared between facilitiesALBABER IIDESYDLSELETTRAESRFFRM-IIILLISISLLBSINQSLSSOLEILneutronphotonallALBA7737615851281251135107710569400773BER II7156311546271791573831989819162365803291563DESY611154197137222851116255113629531518846912944197DLS5846137440710281030267399335222919254611304407ELETTRA5127222102316743311773520181793671419003167ESRF2811798518104331028713990036919017496312861313358610287FRM-II2157116301113910953471378916133295092591095ILL51383255267779003474649731301395156222151813474649ISIS131981133993536913773128808923394569367452880LLB5986233201908930189123574391513913231235SINQ10191955218174161395233741219224315904151219SLS7762315229179963331569439224382739937114703827SOLEIL105361881923671286292225615131399456839418174568

    neutron69156346954614113131095464928801235121937139410023233410023photon7733294197440731671028725913477453234153827456823342533625336all7731563419744073167102871095464928801235121938274568100232533633025

  • PaN-Data Integration

  • Shared Model and TermsCore Metadata

  • Towards the FutureProvenanceIntegrating context, analysis and publication into the recordPreservatiomLong-term need for archiving and curating dataPersistence Identifiers, itegtrity, context, Costs and Benefits of data preservationScalabilityManaging high data rates and volumesParallel file stores

  • Publishing

  • DOI Data Access ProcessCredit: Brian MatthewsPaperDataCiteSTFC PageTopCAT

  • Facilities Data LifecycleInvestigation as a Research ObjectGenerate DOI Landing page from ROConstructPublish

  • Futures

  • E-INFRASTRUCTURE RequirementsData Rates Begin to Require Dedicated Central IT Infrastructure, Way Beyond Previous Requirements.Data sets become too numerous to keep track ofData Management, common metadataData Sets Become Too Large to Take Home.Data policiesArchive at facility and cloud data accessAnalysis requires high level of computational powerArchive at facility and cloud access to HPCintegrating data analysis processes into data management processesIntegrating workflow into data large scale data management processesVariety of Scientific areas leads to a variety of Data Formats and Analysis Software.Common data formats and APIsLarge Number of Users moving between labs.Federated authentication and data catalogues

    The rise of data intensive experiments and computationReal time data processing for live experimentsStreaming data processing

  • Integrating DataFacilities offer complementary experimental techniques for a single beamline (e.g. tomography+diffraction)

    Users increasingly use multiple facilities leading to the need for multi-stream data fusion and processingIncluding remote access to HPC and data storage resource

    Using provenance information effectivelyData PublicationData tracingData publication in contextReproducability

    Developments that will influence how the data is managed

  • RDA Interest GroupProposing a New RDA Interest GroupPhoton and Neutron Science (PaNSIG)

    PaNData Partners+ US and Aus partnersPlan to hold first workshop in Dubkin, March 2013

  • Brian Matthews

    [email protected] YouPrioriphora schroederhohenwarthiX-Ray Imaging at ESRFSolorzano et al, 2011, Systematic Entomology (2011)

    Users, users, users People, people, people

    This talk aims to give you an overview of the data management infrastructure in our lab and describes the emerging trends and the current activities we are conducting.

    Perhaps, one of the unique

    Data have not only changed the way we see science, for example, data analysis is used to be a single person job. Now, it is a norm to analyse data collaboratively, with more people are involved, maybe from different places, different institutions. At RAL, data analysis is certainly a staging process, for example, instrument scientists are involved in the initial analysis. Once back home, users conduct simulations before and after. Computing is also playing a central role in data analysis: software and in some cases, large computers/HPC are involved heavily in the process, even before users get their data (for example, pre-processing at DLS MX beamlines, because of modern detector technologies, high throughput experimental techniques have directly led to the demand of high throughput computing technologies, high data rate capturing, and storage). In the lab, simulations have also been mentioned a lot in many science.

    Most branches of modern science are heavily influenced by the rapid development of modern computing technologies. Us, Science and Technology Facilities Council, one of the seven research councils in the UK, is certainly a very good example of this. This talk aims to give you a flavour of the behind-the-scene computing technologies behind the modern instruments offered by the UK national laboratories, a range of infrastructure technologies for managing its diverse range of research data. This talk is from me, a computer scientist perspective

    Since 2001, the lab has gone through a significant journey to develop a common data management infrastructure for the facilities, from live experiment, to long term archival storage. From national data centres (e.g. BADC) to STFC facility oriented labs (e.g. DLS, ISIS, and CLF), they are now all sharing a common data infrastructure managed by SCD.

    Now, we have a substantial and mature infrastructure to manage our data. This is the result of a journey in the past 10+ years. This will give you a quick glance through this journey, tell you where we are today, and the directions we are heading towards. Of course, we would also like to share with you not only the lessons we have learned (perhaps a bit short of time to explain that), but we will certainly try to give you a picture of the exciting opportunities lying ahead of us. And hopefully, give you ideas of how you might want to collaborate with us.

    === from PanData websiteISISis the worlds leading pulsed spallation neutron source. It runs 700 experiments per year performed by 1600 users on the 22 instruments. These experiments generate 1TB of data in 700,000 files. All data ever measured at ISIS over twenty years is stored at the Facility, some 2.2 million files in all. ISIS use is predominantly UK but includes most European countries through bilateral agreements and EU funded access. There are nearly 10,000 people registered on the ISIS user database of which 4000 are non-UK EU. The user base is expanding significantly with the arrival of the Second Target Station.

    Since 2001, e-Science had been developing a common e-Infrastructure supporting a single user experience across the STFC facilities. Much of this is now in place at ISIS and Diamond as well as the STFC Central Laser Facility. Components are also being adopted by ILL, the Australian National Synchrotron and Oakridge National Laboratory in the US. On ISIS today, experiments instrument computers are closely coupled to data acquisition electronics and the main neutron beam control. Data is produced in ISIS specific RAW format and access is at the instrument level indexed by experiment run numbers. Beyond this data management comprises a series of discrete steps. RAW files are copied to intermediate and long term data stores for preservation. Reduction of RAW files, analysis of intermediate data and generation of data for publication is largely decoupled from the handling of the RAW data. Some connections in the chain between experiment and publication are not currently preserved.Future data management will focus on development of loosely coupled components with standardised interfaces allowing more flexible interactions between components. The RAW format is being replaced by NeXus. The ICAT metadata catalogue sits at the heart of this new strategy, implementing policy controlling access to files and metadata and using single authentication it allows linking of data from beamline counts through to publications and supports WWW-based searching across facilities.

    ===

    This talk discusses

    the emerging opportunities opened up by the availability of systematically managed data in a large laboratory settingThe technical barriers of making extensive use of them in the real world open data infrastructure developments. The latest developments of CSMD will be presented to highlight the current direction we are undertaking in the context of the PaNData-ODI project

    * Erica Yang: Managing research data for diverse scientific experiments

    The Rutherford Appleton Laboratory is a UK national facility that handles a considerable amount of experimental research, and operates many instruments that generate large volumes of data such as the ISIS Pulsed Neutron Source, Central Laser Facility, and Diamond Light Source. It has developed a Core Scientific Metadata Model (CSMD) for the management of the data resources of the facilities in a uniform way. This talk discuses he emerging opportunities opened up by the availability of systematically managed data in a large laboratory setting and the technical barriers of making extensive use of them in real world open data infrastructure developments. The latest developments of CSMD will be presented to highlight the current direction we are undertaking in the context of the PaNData-ODI project, a collaborative European project involving 13 major world class research laboratories that operate one or more neutron or photon sources in Europe.

    * Simon Coles: The data explosion and the need to manage diverse data sources in scientific research

    Crystal structure determination has become a high-throughput activity, and even at the level of the departmental laboratory, automation is increasingly important in managing the experiment and, crucially, the data collected from the experiment and its subsequent , analysis, and dissemination. The UK National Crystallography Service is a medium-scale facility which needs to address issues of data management, accountability and dissemination, on top of its efforts to achieve best experimental practice. In providing a service to chemists as well as crystallographers, it has considerable experience in cross-discipline ontology building, data publication via repository platforms, and integration with laboratory management systems.

    John Helliwell: of raw diffraction images

    The IUCr Executive Committee has charged a working group with an assessment of the potential benefits of depositing raw experiment datasets (with initial emphasis on x-ray diffraction images), and the cost, technical and structural ramifications of doing so. There are a number of potential locations for depositing raw images that allow their reuse in validation, re-refinements, reanalysis for new science, education software development for example, in discipline-specific data centres, in large-scale instrument facilities, or in institutional repositories. These are not necessarily exclusive (for example, a central data centre might archive only datasets associated with published structures), and initiatives such as the Australian TARDIS demonstrate approaches to federating separate repository platforms. Crucial to interoperability between such federated archives will be well-defined metadata and procedural standards.

    ===== Ericas Bergen talk

    Linking raw experimental data with scientific workflow and software repository: some early experience in the PanData-ODI projectLarge facility providers often have developed mature data and publication infrastructure to capture the scientific outputs from experiments. The aim is not only to ensure the long term accessibility of these digital assets, but also demonstrate the prolong impact of the research they support. Traditionally, the emphasis is on the cataloguing and archiving processes of the two ends: raw experimental data and publications. However, due to the rapidly rising data rate and volumes from scientific experiments and the complexity of certain types of data analysis, researchers become increasingly reliance on the infrastructure services provided by the facility operators. This talk presents the early evidences we gathered in the PanData-ODI project in the data provenance WP to demonstrate the emerging needs in the community and to present some early snapshots of our approach to address the problem. In particular, we will examine the interplay of experimental data archive, scientific workflows, and software repositories. =====***IBM Blue Gene/Q6+1 racks

    6 racks 98,304 cores6144 nodes16 cores & 16 GB per node1.26 Pflop/s peak

    1 rack of BGAS (Blue Gene Advanced Storage) Data Intensive System16,384 cores

    IBM iDataPlex

    node has 16 cores, 2 socketsIntel Sandy Bridge (AVX etc.)

    252 nodes with 32 GB4 nodes with 256 GB12 nodes with X3090 GPUs

    Data Intensive Computer

    256 nodes with 128 GBScaleMP virtualization software allows up to 4TB virtual shared memory

    8192 cores, 196 Tflop/s peak

    Storage:

    5.76 PB usable disk storage15 PB tape store

    Ease of use!!! Dont need to be software jockeysData intensive..................big data..........informatics*Diverse range of facilities

    === diversities:File format: nexus, hdf, raw, text, tiff/images, Data rate/volume:

    This is the site plan of RAL, showing the three major science facilities operated at the site. To the left, Diamond, UK national synchotron facility, to the left, ISIS TS2 and TS1. in the middle, CLF high power laser.

    The orange rectangle in the middle is SCD, which is the data centre for STFCs science data, including DLS, ISIS, CLF. So, we manage data from live experiments (i.e. from the facilities), also from data archives (satellie data, and environmental data) under the same roof.

    Offering a large range of experiment techniques:

    DLS: imaging, diffraction, scattering, macro crystallography, spectroscopy ISIS: diffraction, scattering, crystallography, imaging,

    Serving a large range of scientific disciplines

    It took us about 10 years to get to where we are today. Different facilities progress in a different pace. ISIS was the first to adopt **30,000 user visitors each yearBillions of Euro of infrastructureSupport experiments in many scientific fields: physics, chemistry, biology, material sciences, energy technology, environmental science, medical technology cultural heritage investigations.

    Industrial applications are growing pharmaceuticals, petrochemicals and microelectronics

    experimental techniques including photoemission and spectromicroscopy, macromolecular crystallography, low-angle scattering, dichroic absorption spectroscopy, neutron and x-ray imaging.

    This is where we are now. - (Internal) Common data management infrastructure (i.e. services) for diverse experiments. The point is that there is a sophisticated range of services behind the scene to ensure the operation and stability of the data management infrastructure. Maybe, compared with most smaller laboratories, the efficiency of the whole operation differentiates us from others. In other words, users, not only the users on site, at RAL, are able to have much better access to their data via this modern computing infrastructure, which is underpinned by a set of tools: nagios, ICAT, TopCat, and Castor, StorageD.

    Our data management infrastructure is backed by a comprehensive range of data services hosted within the data infrastructure at RAL to support the experiments:

    Data monitoring of data acquisition stations at the instruments (ISIS)Synchronisation (ISIS)Site-wide network traffic monitoring between key network components, from experimental halls, to ISIS main routers, to site-wide intranet, to SCD main router, to SCDs petabyte data storeData cataloguing (ISIS+SCD)Long term data archive and preservation store (SCD) petabyte data store

    DLS is similar. ===DLS: 30+ beamlines (including Phase III, planned and being built)ISIS: 40+ instruments (including being built)

    Data volumes: DLS: tomography imaging: 120GB/file, reconstructed file: about the same 4/5 TB/experiment more beamlines will be equipped with imaging capabilityISIS: max: WISH instrument: 27GB/file event mode (continuous data acquisition) similarly, multiple techniques

    A constant challenge in this infrastructure is the scalability of the infrastructure, especially, data cataloguing and archiving processes in response to the increasing data rate and volume. For example, DLS is introducing a localised data catalogue which is synchronised with remote ICAT, hosted by SCD. (due to the fact that DLS data ingestion to the catalogue needs to cope with the high data rate and volume) in other words, this constantly challenges the architecture of the infrastructure.

    ===Of course, this is the view of computing people, ISIS computing team and SCD. Users dont get to see these. The whole point of managing data is to maximize the usage of data,

    Immediately, this means for live experiments, this is the up most important for a facility operator like STFC. So that experimenters can access to experiment data as soon as they can.After experiment, this also means that users may need to come back to their data, traditionally via emails, direct from instrument scientists. Increasingly, via data portal because of data size and volume have rapidly increased in the recent years. Perhaps due to the open access movement, open access datasets are slowly made available via the data portal. People are interested in DOIs (like this group), communities are interested in analysing other peoples data (for example, tomography dataset is made available for the imaging community to explore better ways of reconstructing images, achieving better quality, testing codes on different computer architectures/GPU, and for algorithm development).

    Two types of data provisions: Via data catalogue and its interface; data is catalogued soon after experiment (typically 5 mins max 1hr after an ISIS experiment) users are Via the network file system typically, users get a copy of their experiment data, and do analysis back home

    *Mantid: ISIS/SNS, DAWN: ESRF+DLSMantid and DAWN are both desktop application, can standalone.

    ICAT JobPortal: LSF/CLF

    The major difference is that ICAT job portal is a web application, built on a distributed processing framework. It is used to run jobs remotely, can be leveraged to harnessing HPC or cluster computational capabilities. It is built on ICAT, the data catalogue, integrated with ICAT to steer its processing. Hence, it captures, tracks, and links data provenance with experiment data.

    CSMD has been extended to accommodate provenance data captured in the downstream data processing following experiments.

    *The more advanced the computing infrastructure (compute, storage, archiving) is the better and faster the science can be!

    In other words, researchers can move quickly from experiment to research outputs (derived data and publications).

    Experiment: ISIS - Migrating to Nexus, also have some ISIS specific raw format filesDLS Nexus, with

    *Project I2S2

    In the recent years, we studied a lot into understanding the downstream data processing pipelines/workflows after experiment, all the way to simulation and modelling. Through various JISC funded projects and internal projects, it is clear that users are facing increasing challenges of managing their processed data and workflows. Most processing in *Project: LightPath

    A new dimension for RAL, a science facility who manages data? For big data, big data analysis, for big data provision, access, exploitation? What is your feeling?

    Data flow, not user workflow. We dont manage workflow. But we provide the glue that gives them the ease of integrating data with computation seamlessly, namely computational data infrastructure.

    Data and a pipeline of surrounded activities and services offered around the data. This may change the way facilities are operated (not only providing experimental facilities, and immediate data reduction. But, also centralised data processing)If it works, it will also change the way users access their data, process their data, and long term relationship with the facilities.

    ===== paper notes:

    1. needs: dynamic mechanism to allow additions or removal of metadata (due to the fast changing pace of upgrades) and it incorporation in data management infrastructure2. Coordinated and collaborative means to work with other facilities to investigate and invest in common technologies*Project Accelerator*How federated data catalogues can be used by scientists to drive multi-technique data processing, i.e. processing data from multiple sources via the federated data catalogues? Should they be application specific? Like the Nexus application profile? If so, how we can harmonise these profiles among the facilities? *Data Object Persistency and Data Publication DOI: Digital Object Identifier

    Either: Users reading article can select DOI or users searching datacite metadata can select DOIWill bring them to the landing page for a datasetFrom which they go to TopCat to access the actual data file within that dataset

    The process works, to do the job, BUT

    *Hope this gives you a flavour of how the lab is managing its data. I would like to conclude this talk by listing a range of developments that will, in my opinion, influence on how the data in the lab is managed and hoping that maybe the audience will be interested in collaborating with us on these topics. One thing I know for sure is that to tackle them, we need resources, collaborative efforts.

    Emerging trends that will have definitely impact on how we manage our data.

    So, what does it mean? What are the implications of these trends/opportunities to how we manage data?

    Well, I dont have all the answers to these questions. But, I can show you some current activities we are conducting in exploring the support for these trends.

    All these could have important implications the way data is managed, used, and exploited (by facilities, users, and communities).

    LightPath because we have data catalogue, which stores *Collaborating institutions and funding bodies for our work*