2013-b_whitty-biomedical_cloud
TRANSCRIPT
Brett WhittyICGC Data Coordination Center Curation Manager
Ontario Institute for Cancer Research
Open Cloud Consortium“Towards a Biomedical Commons Cloud” Working Group
April, 2013
Some Considerations for Enabling Users ofInternational Cancer Genome Consortium (ICGC)
Data in a Biomedical Compute Cloud
2
53 projects 16 countries/regions > 25,000 tumors committed
ICGC Data
Current data:(represents ~1/3 of goal)
• ~100GB of gzipped analysis results (open access)◦ hosted via HTTP(S)/FTP at ICGC DCC data portal
• ~700TB raw sequencing and array datasets* (controlled access)◦ hosted at EBI EGA repository (and other public repos)
*excluding data from TCGA projects (~50% of ICGC member projects are TCGA projects)
3
ICGC Data Access
• Blanket access to ICGC data granted by ICGC Data Access & Compliance Office (DACO)◦ Excludes TCGA data for which access is granted by the TCGA project
• DACO, ICGC.org & DCC support OpenID for authentication◦ Access to ICGC & TCGA data at NCBI, CGHub, EBI EGA use different authentication mechanisms
• ICGC datasets are presently distributed across several public repositories ◦ Presents a challenge to end users◦ Need to aggregate the data through a single access point, virtually if not physically
• Ideally a single user sign-on method would be recognized by all resources◦ May be impossible due to technical/organizational challenges
4
ICGC Computes(1)
• No common ICGC data analysis centers (yet)
• No common ICGC workflow systems (yet)
• No common ICGC pipelines (yet)
5
ICGC Computes(2)
• Who are the cloud-based data consumers?◦ What do they need/want?
• Sufficient to have ICGC simply provide datasets?
• Does ICGC need to also provide canned analysis pipelines?◦ Reproduce methods used in ICGC publications?◦ Who creates/maintains these?◦ Using which workflow system?
6
Other Issues
• Can ICGC DACO assure authorization and compliance of cloud-based data consumers?◦ Auditing, revoking access, etc.◦ How is this achieved?
• What are the support needs of “ICGC Cloud” users? ◦ How much effort will they require?◦ From whom?
• What is the minimal metadata we need to collect to make the data useful? ◦ Who ensures this?
7