2013-b_whitty-biomedical_cloud

7
Brett Whitty ICGC Data Coordination Center Curation Manager Ontario Institute for Cancer Research Open Cloud Consortium “Towards a Biomedical Commons Cloud” Working Group April, 2013 Some Considerations for Enabling Users of International Cancer Genome Consortium (ICGC) Data in a Biomedical Compute Cloud

Upload: brett-whitty

Post on 18-Feb-2017

21 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2013-B_Whitty-biomedical_cloud

Brett WhittyICGC Data Coordination Center Curation Manager

Ontario Institute for Cancer Research

Open Cloud Consortium“Towards a Biomedical Commons Cloud” Working Group

April, 2013

Some Considerations for Enabling Users ofInternational Cancer Genome Consortium (ICGC)

Data in a Biomedical Compute Cloud

Page 2: 2013-B_Whitty-biomedical_cloud

2

53 projects 16 countries/regions > 25,000 tumors committed

Page 3: 2013-B_Whitty-biomedical_cloud

ICGC Data

Current data:(represents ~1/3 of goal)

• ~100GB of gzipped analysis results (open access)◦ hosted via HTTP(S)/FTP at ICGC DCC data portal

• ~700TB raw sequencing and array datasets* (controlled access)◦ hosted at EBI EGA repository (and other public repos)

*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)

3

Page 4: 2013-B_Whitty-biomedical_cloud

ICGC Data Access

• Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)◦ Excludes TCGA data for which access is granted by the TCGA project

• DACO, ICGC.org & DCC support OpenID for authentication◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms

• ICGC datasets are presently distributed across several public repositories ◦ Presents a challenge to end users◦ Need to aggregate the data through a single access point, virtually if not physically

• Ideally a single user sign-on method would be recognized by all resources◦ May be impossible due to technical/organizational challenges

4

Page 5: 2013-B_Whitty-biomedical_cloud

ICGC Computes(1)

• No common ICGC data analysis centers (yet)

• No common ICGC workflow systems (yet)

• No common ICGC pipelines (yet)

5

Page 6: 2013-B_Whitty-biomedical_cloud

ICGC Computes(2)

• Who are the cloud-based data consumers?◦ What do they need/want?

• Sufficient to have ICGC simply provide datasets?

• Does ICGC need to also provide canned analysis pipelines?◦ Reproduce methods used in ICGC publications?◦ Who creates/maintains these?◦ Using which workflow system?

6

Page 7: 2013-B_Whitty-biomedical_cloud

Other Issues

• Can ICGC DACO assure authorization and compliance of cloud-based data consumers?◦ Auditing, revoking access, etc.◦ How is this achieved?

• What are the support needs of “ICGC Cloud” users? ◦ How much effort will they require?◦ From whom?

• What is the minimal metadata we need to collect to make the data useful? ◦ Who ensures this?

7