three data delivery cases for embl- ebi’s...

26
Three data delivery cases for EMBL- EBI’s Embassy Guy Cochrane www.ebi.ac.uk

Upload: others

Post on 31-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Three data delivery cases for EMBL-EBI’s Embassy

Guy Cochrane

www.ebi.ac.uk

Page 2:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

EMBL European Bioinformatics Institute

Genes, genomes & variation •  European Nucleotide Archive •  1000 Genomes •  Ensembl •  Ensembl Genomes •  Ensembl Plants •  European Genome-phenome Archive •  Metagenomics portal •  GWAS Catalog browser

Expression •  ArrayExpress •  Expression Atlas •  Metabolights •  PRIDE

Protein sequences •  InterPro •  Pfam •  UniProt

Chemical biology •  ChEMBL •  ChEBI

Literature & ontology •  Europe PubMed Central •  Gene Ontology •  Experimental Factor

Ontology

Molecular structures •  Protein Data Bank in Europe •  Electron Microscopy Data Bank

Reactions, interactions & pathways •  IntAct •  Reactome •  MetaboLights

Systems •  BioModels •  Enzyme Portal •  BioSamples

Page 3:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Read

Alignment

Annotation

Sample/method

Assembly

European Nucleotide Archive -  Unrestricted data -  Pan-species and application -  http://www.ebi.ac.uk/ena/

Sequence data at EMBL-EBI

Read

Alignment

Sample/method

European Genome-phenome Archive -  Controlled access data -  Human data around molecular medicine -  http://www.ebi.ac.uk/ega/

Page 4:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Read

Alignment

Annotation

Sample/method

Assembly

European Nucleotide Archive -  Unrestricted data -  Pan-species and application -  http://www.ebi.ac.uk/ena/

Sequence data at EMBL-EBI

Read

Alignment

Sample/method

European Genome-phenome Archive -  Controlled access data -  Human data around molecular medicine -  http://www.ebi.ac.uk/ega/

Infrastructure provision -  BBSRC: RNAcentral, MG Portal -  MRC: 100k Genomes data

implementation -  EC: COMPARE, MicroB3, ESGI,

BASIS -  etc.

Page 5:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Challenges

•  Data have high volume and grow rapidly

•  Data are dynamic (continuous feed) and their application has urgency

•  Users require arbitrary and ad hoc access

Page 6:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Tara Oceans

Page 7:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Tara Oceans

Capacity

Page 8:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Infectious disease

•  Opportunity: A methodological revolution in clinical and public health towards shotgun sequencing-based methods

•  Scientific power: Sequence harbours rich information

•  Diagnostic: identification, typing, resistance profiling, etc.

•  Public health: outbreak detection, response strategy, vaccine development

•  Mechanistic: host interactions, pathogencity, virulence, transmission, anti-

microbial resistance

•  Informatics roles for EMBL:

•  COMPARE: Rapid global sharing of surveillance and outbreak data, systematic integrated analysis, compute provision (Embassy)

•  Standards for reporting, analysis and the communication of results

•  New algorithms and analysis methods

•  User interfaces for surveillance data reporting , across the domains

COMPARE: recently launched Horizon 2020 project in which EMBL-EBI is informatics provider

Global Microbial Identifier: Initiative with EMBL-EBI involvement supporting technologies, standards and data sharing for pathogen surveillance

Page 9:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

COMPARE platform

Embassy infrastructure DTU infrastructure Embassy virtual domain

COMPARE  workflow  engine   COMPARE  Portal  

‘Default’  tools  

‘Hosted  tools’  INSDC  data  exchange  

Sources  

COMPARE  Registry  

COMPARE  Data  Resource  

Processes   Portals  and  environments  

Public  data  

Managed  access  data  

Private  data  

API API

EBI infrastructure

Assembly  &  alignment  

AnnotaHon  

Typing  

Food  workflow  

development  

Clinical  workflow  

development  

Outbreak  workflow  

development  Workflow  integraHon  

API

INSDC  data  exchange  

Page 10:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

COMPARE platform

Embassy infrastructure DTU infrastructure Embassy virtual domain

COMPARE  workflow  engine   COMPARE  Portal  

‘Default’  tools  

‘Hosted  tools’  INSDC  data  exchange  

Sources  

COMPARE  Registry  

COMPARE  Data  Resource  

Processes   Portals  and  environments  

Public  data  

Managed  access  data  

Private  data  

API API

EBI infrastructure

Assembly  &  alignment  

AnnotaHon  

Typing  

Food  workflow  

development  

Clinical  workflow  

development  

Outbreak  workflow  

development  Workflow  integraHon  

API

INSDC  data  exchange   Urgency

Page 11:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Personalised medicine

•  Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation

•  As part of GA4GH, EMBL-EBI is working on

•  Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc.

•  CRAM compression supporting greater data fluidity and APIs to allow direct computational access

•  Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures

•  Past and current FP7 projects include SLING, BASIS, ESGI

Page 12:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Personalised medicine

•  Motivation: Personalised studies of variation, cancer mutation, epigenetics, regulation, expression require references for comparison and interpretation

•  As part of GA4GH, EMBL-EBI is working on

•  Resources serving reference human genomic and transcriptomic data, including Google read API, variant ‘Beacons’, etc.

•  CRAM compression supporting greater data fluidity and APIs to allow direct computational access

•  Delivery and synchronisation of high volume datasets to local Embassy and remote cloud infrastructures

•  Past and current FP7 projects include SLING, BASIS, ESGI

Arbitrary access

Page 13:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

ENA data (NFS)

FIRE1

ENA conventional read data delivery

ENA metadata

Conventional infrastructure (FTP, Aspera, GridFTP)

Page 14:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

ENA data (Cleversafe)

FIRE2

ENA Embassy read data delivery

ENA metadata

Conventional infrastructure (FTP, Aspera, GridFTP)

FUS

E

HTTP

Page 15:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

ENA data (Cleversafe)

FIRE2

ENA Embassy read data delivery

Embassy cloud infrastructure (VMWare -> OpenStack)

ENA metadata

Marine cache

Pathogen cache

CRAM cache

Tara Oceans Embassy

COMPARE Embassy

GA4GH Embassy

Conventional infrastructure (FTP, Aspera, GridFTP)

FUS

E

HTTP

Page 16:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

ENA external read data delivery

…phase II

Page 17:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

EMBL-EBI Embassy Cloud

Steven Newhouse

Head of Technical Services

Page 18:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

The Challenge Facing EMBL-EBI

•  Volume and variety of genomic data expanding

•  EMBL-EBI data doubling every year - replication is challenging

•  Infrastructure currently 50,000 CPUs & 60+PB

•  Need to support complex analysis scenarios

•  Web and programmatic access to services (3M unique users)

•  Access to both public and managed access data sets

•  Bespoke workflows and tools across a variety of domains

•  Hard for users to replicate data sets for local analysis

•  Use the ‘cloud’ to bring local analysis to EMBL-EBI data

18

Page 19:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

EMBL-EBI Embassy Cloud

•  Service hosted at EMBL-EBI data centres

•  Direct network access to public and managed data sets

•  Direct network to access public services

•  Expect both academic and commercial users

•  Technical Implementation

•  Logically isolated outside EMBL-EBI’s LANs

•  Secure flexible infrastructure for both tenant and host

•  Resources exposed using VMware’s vCloud Director & OpenStack

•  Provide isolated IaaS clouds to multiple users

19

Page 20:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Why ‘Embassy’ Cloud?

•  An embassy is sovereign territory in a host country

•  Host Country: EMBL-EBI Data Centre

•  Sovereign Territory: Host Country not allowed to enter

•  Virtualisation provides the protection for ‘tenant’ and ‘host’

•  Host puts boundaries in place to protect it from the tenant

•  Tenant has freedom and control within those boundaries

20

Page 21:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Embassy Cloud Concept

21

Virtualised EMBL-EBI Hardware

Pub

lic D

ata

Man

aged

Dat

a

Pub

lic S

ervi

ces

Em

bass

y C

loud

1

Em

bass

y C

loud

2

Em

bass

y C

loud

3

Priv

ate

Dat

a

PanCancer

Page 22:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

User Benefits for the IaaS Model

•  Tenant organisations get an empty virtual infrastructure

•  They establish their own virtual machines and networks

•  System administration performed by the tenant

•  EMBL-EBI staff have no access to the VMs

•  Added value from EMBL-EBI over other clouds

•  Machines and data hosted in known jurisdiction

•  Direct network data sets (public & managed access)

•  Direct network access to public EMBL-EBI services

22

Page 23:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Benefits to EMBL-EBI of the IaaS Model

•  A secure collaborative workspace

•  Work does not contend with main EMBL-EBI resources

•  Clearly define the committed IT resources and data

•  Explore how to build more data focused analysis services

•  Move the analysis to where the big data is located

•  Learn from and inform other big data scientific communities

23

Page 24:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Embassy Cloud: Typical Uses

•  Collaborative Environment

•  Neutral ground outside internal network

•  CTTV: Resources and VMs to host intranet, databases, …

•  Data Staging

•  Undertake submission from local machine (following data staging) rather from remote location

•  BRAEMBL: Remote submission unreliable due to file upload

•  Data Analysis

•  Large scale management and analysis of data

•  PanCancer: 1,000 cores, 2.5 TB RAM, 1.0 PB HDD

Page 25:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Issues

•  Object Store Storage Infrastructure

•  Essential for scalable high-performance storage

•  Applications need to adapt to flat model

•  Current caching strategy will have a limit

•  Sharing resources between sites/communities/clouds

•  Adopt a standards based model for federating resources

•  Solutions for uploading and distributing VMs (+containers?)

•  Replicating large data sets to ‘attract’ workloads to a cloud

25

Page 26:   Three data delivery cases for EMBL- EBI’s Embassyenvironmentalomics.org/wp-content/uploads/2015/06/...COMPARE platform Embassy infrastructure DTU infrastructure Embassy virtual

Gaps à Activities à Solutions?

•  Data Set Replication

•  Strategic pre-positioning of data into clouds

•  Leverage JANET/GEANT, GridFTP + Globus Transfers, …

•  Cloud federation for mobile computing

•  EGI has a federated cloud and VM distribution model

•  ELIXIR plans to build on existing infrastructure where possible

•  Wide-area file access needed for collaborative data analysis

•  High performance wide-area object-store

•  Need access control for human related data

•  Coordinated investment in infrastructure

•  Where is the UK coordination? What coordination is needed?

•  Integrating commercial resources where they add value

•  Integration with EU Infrastructure (ELIXIR)

26