slas2012 whoeck

19
Facilitating target candidate prioritization via integrated, interactive visualizations of molecular profiling data Wolfgang Hoeck, Ph.D., Research Informatics, Amgen Inc.

Upload: wolfgang-g-hoeck

Post on 14-Jul-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Facilitating target candidate prioritization via integrated, interactive visualizations of molecular profiling data Wolfgang Hoeck, Ph.D., Research Informatics, Amgen Inc.

Topics for today’s presentation

• What is Molecular Profiling Data?

• The problem of sharing large volume data

• Sending files isn’t working well

• Public molecular profiling efforts

• The Cancer Genome Atlas

• Sanger COSMIC

• Broad CCLE

• TARO - an integrated database plus interactive visualizations

• Identities and Standard Terminologies (Taxonomies)

• Commercial molecular profiling data repositories

• Leveraging internal and external efforts

• Pulling everything together

• Closing thoughts

2/5/2012 2 Wolfgang Hoeck

Molecular Profiling Data as a source of potential Targets

• What is Molecular Profiling Data?

• High volume data (millions of data points) measuring genomic or transcriptomic end points

• Gene Expression: How much of my gene is expressed under a certain condition?

• Comparing gene expression of two groups – Normal/Tumor or Tumor/Tumor

• Surveying a panel of normal tissues

• Gene Copy Number: How many copies of my gene are present in the genome?

• Which genes are contained in an amplified region of a chromosome?

• Is a gene or gene family amplified or deleted in a given tumor setting?

• Can we validate the copy number status in an independent dataset?

• Somatic Mutations: Is my gene normal or mutated?

• Is the gene clearly mutated or is there conflicting evidence?

• Are mutations affecting genes in the same pathway?

2/5/2012 3 Wolfgang Hoeck

2/5/2012 Wolfgang Hoeck 4

Lis

t o

f Ta

rge

ts (

Ta

rge

t C

lasse

s)

Prioritized Target List #1

Gene

Expression

Gene

Copy Number

Gene

Fusion Methylation

Micro-

Array

RNA-

seq

CGH

Array

SNP

Array

Gene

Mutation

Exome

Sequencing

RNA-

seq

ChIP

-seq

ChIP

-chip

Scores Scores Scores Scores Scores

Prioritized Target List #2

Multiple Genomic Data Types lead to a list of possible targets

Public and Commercial Molecular Profiling Efforts

Source Name Content Data Type Value

NCI The Cancer Genome Atlas (TCGA)

20+ tumor types, 500+ samples each

Gene Expression (uA, NGS), Copy Number, Clinical Data

Target Identification & Validation, Patient Stratification

Sanger Wellcome Trust

Cancer Genome Project (CGP)

COSMIC Somatic Mutation Data

Target Identification & Validation, Patient Stratification, Model Selection

Broad Institute Cancer Cell Line Encyclopedia (CCLE)

800+ Cancer Cell Lines

Gene Expression (uA), Copy Number (uA)

Target Identification & Validation, Model Selection

GSK-caBIG Wooster Cell Line Panel 300+ Cancer Cell Lines

Gene Expression (uA), Copy Number (uA)

Target Identification & Validation, Model Selection

RICERCA OncoPanel 240 Cancer Cell Lines

Gene Expression (uA)

Target Identification & Validation, Model Selection

2/5/2012 Wolfgang Hoeck 5

TARO Data Sharing Solution Strategy

• Data type focused

• Gene Expression, Copy Number and Somatic Mutations

• Technology Independent

• Data from Microarray, NextGen Sequencing, Sanger Sequencing, etc.

• Source Independent

• Data comes from multiple sources: Amgen, TCGA, Broad, Sanger, Publications

• Data Standardization enables integration at multiple Levels

• Gene, Tissue, Disease, Sample (Tissue Sample/ Cell Line Sample)

• Modular Development

• Independent Database

• Support Multiple User Interfaces

• Visualization UI

• Central Research Discovery Tool

• Web Services 2/5/2012 6 Wolfgang Hoeck

TARO Use Cases

• Target Identification: • Systematically identify targets via differential-expression and/or copy number in one or multiple tissue datasets

• Target Validation: • Validate target expression in independent tissue data sets

• Verify target expression across many normal and diseased tissue types to determine tissue specificity and potential off-target effects

• Model Selection • Identify cell line model that highly or lowly expresses target of interest

• Identify cell line model that contains target gene amplification

• Provide mutation data on typical genes within selected cell lines to highlight mutational background

• Identify cell lines with a specific mutation pattern (e.g.: EGFR mut and KRAS wt) 2/5/2012 7 Wolfgang Hoeck

Target Identification

Target Validation

Model Selection

2/5/2012 Wolfgang Hoeck 8

Ref

eren

ce

Dat

a Tr

ansa

ctio

nal

C

on

verg

ence

Query tools for Amgen scientists to search across internal and external data repositories

Centralizes and organizes the storage of ‘Omics data for bioinformaticists and biologists alike

Operational systems to handle the day-to-day execution of ‘Omics experiments and their initial analysis

Fulfills the baseline requirements for biology identity / reference data systems. W/o these systems none of the above is possible.

Research Gateway

TARO Data Mart Omics Repository

Omics Analysis Experiments

Normalize

Aggregate

Summarize

Gene Index

Tissue Disease Organism Cell Line

Research Taxonomy Foundation (RTF)

TARO-Guides Dec

isio

n

Sup

po

rt

Layering the Information Landscape

2/5/2012 Wolfgang Hoeck 9

Take it apart, standardize, then connect and integrate …

TARO Guide Collection – covering the spectrum from summaries to details

2/5/2012 10 Wolfgang Hoeck

Interactive Visualizations in Spotfire Client or Webplayer

• Gene Expression

– Gene-level or probe-set level

– Panels or Comparisons

• Copy Number

– Whole chromosome view

– Detail per sample

• Somatic Mutations

– 1700+ cancer cell lines

– COSMIC and other mutation data

2/5/2012 Wolfgang Hoeck 11

Surveying the mutation landscape in Cancer Cell Lines Standard Gene Symbols

Stan

dar

d C

ano

nic

al C

ell L

ine

Nam

e

Standard Mutation Nomenclature

2/5/2012 Wolfgang Hoeck 12

Integrating Expression and Mutation Data

Successes and Shortcomings of TARO

• Ideal for pointed questions

• Show me the expression, copy number and mutation status of Gene X

• Generate a list of differentially expressed genes for upload into NextBio

• Identify cell lines with a particular mutation profile

• Great for data important to Amgen

• Provides a foundation for accumulating knowledge

• Shortcomings

• Breadth of data is resource-limited

• Data isn’t always available immediately, curation takes time

• Complexity of data space, capability vs. simplicity

• There is still some learning involved for scientists

• Chosen technology doesn’t always allow the desired User Interface

2/5/2012 13 Wolfgang Hoeck

Commercial Molecular Profiling Data Repositories

• Oncomine and Oncomine Power Tools (OPT)

• Organizing and annotating oncology data in a consistent fashion

• Oncomine Enterprise: Web user interface, enabling customer data uploads

• OPT: Integrated Gene Browser - Bringing multiple data-types together in a summary view

• NextBio

• NextBio Enterprise: Web user interface, enabling customer data uploads

• Multiple Apps for variety of profiling data

• Includes literature data

• Provides Meta Analysis: Surveying studies across multiple sources

2/5/2012 14 Wolfgang Hoeck

2/5/2012 Wolfgang Hoeck 15

Integrated Gene Browser – Oncomine Power Tools

2/5/2012 Wolfgang Hoeck 16

BodyAtlas Cell Lines – NextBio

Where do we go from here? • Why do this in the first place?

• Better informed decisions

• Achieve higher throughput, consider more targets

• Help in understanding the complexity of the landscape

• We are starting to see the fruits of “semantic integration efforts” • Ad-hoc integration with stand-alone profiling data of different data types becomes much easier (e.g.: Phosphoprotein Arrays)

• Utilization of other public profiling datasets is easier (e.g.: from publications)

• Migrating into the “screening data” space (e.g.: compound-treated cell line panels) now becomes possible

• In-House Challenges: Domain knowledge for curation, presentation of complex data in limited space, Database Performance – can we make it good enough?

• Vendor Challenges: Interfaces for Integration,

• Balance knowledge management efforts: Are we just data collectors? But wait, there is more ….

2/5/2012 17 Wolfgang Hoeck

Acknowledgements

• Interdisciplinary team work

• Database Designers

• Database Administrators

• System Administrators

• Business Analysts

• Scientists

• Bioinformaticists

• Support Analysts

• Project Manager

2/5/2012 18 Wolfgang Hoeck

NONE OF THIS WOULD BE POSSIBLE WITHOUT TEAM WORK

THANK YOU FOR YOUR TIME

2/5/2012 Wolfgang Hoeck 19