exploratory gene association networks october 2009 jesse paquette helen diller family comprehensive...

53
Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Exploratory Gene Association Networks

October 2009

Jesse PaquetteHelen Diller Family Comprehensive Cancer Center

University of California, San Francisco

Page 2: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

What EGAN is

• Software that runs on a biologist’s computer– Java 6 and Java WebStart

– Utilizes Cytoscape libraries for graph rendering

• A searchable library of genes and gene annotation– Links out to web resources

(Entrez/PubMed/KEGG/Google/etc.)

• A visualization tool that shows how genes and annotation terms are related– User constructs dynamic hypergraphs using

experiment results and enrichment statistics

Page 3: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Why EGAN was made

• To accelerate exploratory assay analysis by providing a pre-compiled knowledge network

• As an alternative to presentation of exploratory assay results as gene lists

• To allow researchers to combine multiple analysis results from potentially different platforms

Page 4: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Exploratory assays

• AKA high-throughput experiments– Measure hundreds to millions of entities

• Empirical assays– Expression microarrays– aCGH– MS/MS proteomics– Yeast two-hybrid interaction assays– QTL/SNP associations– DNA Methylation– ChIP chips– Next-gen sequencing

• In-silico algorithms– Sequence– Structure– Literature

Page 5: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

The exploratory assay workflow

Page 6: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Post-computational analysis questions

• Given a set of entities (genes): S– How are the entities in S related to each other?

– What annotation terms/pathways are enriched in S?

– How are the entities in S and the annotation terms related?

– Are there any pertinent literature references?

– Are there any entities not in S that have relationships with multiple entities in S?

– How does S compare to the set published by Soandso et al.?

– What changes when entities are added to or removed from S? (e.g. when the p-value cutoff is changed)

Page 7: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

EGAN lets the biologist investigate results quickly and independently

• Point-and-click interface– Buttons– Context-specific pop-up menus– Spreadsheet-like data tables– Graph visualization

• All network information is pre-collated– No programming/scripting– No data transfer/download steps

• Automated gene-level integration of multiple experiment results

Page 8: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

How are computational analysis results commonly presented to the biologist?

Page 9: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

How are computational analysis results commonly presented to the biologist?

• Gene lists– Show gene annotation (but too much at once)

– Do not show gene-gene relationships

• Enriched annotation lists– Do not identify the genes annotated with each term– Do not show which genes share annotation terms

• Gene graphs– Show gene-gene relationships

– Do not adequately show annotation

Page 10: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Gene lists

Page 11: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Gene lists

Page 12: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Reducing information by significance cutoff

Page 13: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Reducing information by taking away genes

• Prevents the user from wasting time investigating actual negatives

• But what about genes that just missed a stringent cutoff?– These genes are likely to have some importance

– Biologists are often given the impression that genes that fail to pass the cutoff are negatives

• Valuable information is lost by only focusing on a “significant” set– See Gene Set Enrichment Analysis (GSEA),

Subramanian (2005)

Page 14: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Enriched annotation lists

Page 15: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

What is enrichment?

• Annotation terms/pathways define sets of genes

• Enrichment– Overrepresentation

• Set-based enrichment– Given a significant set, S of genes (or a cluster)

– Use hypergeometric distribution to compute overlap between each gene set, T and S

• Global empirical enrichment – Use generated statistics for each gene in the assay

– Summarize the statistics for all genes in each set, T

– Test to see if the statistics show a non-random trend

– GSEA

Page 16: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Enriched annotation lists

Page 17: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Gene graphs

Page 18: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Canonical pathway maps

• Start with fixed pathway graph• Color the gene nodes by empirical values (only significant genes?)• Enriched annotation terms not shown• Most useful when

– This pathway is expected to be affected in experiment– Little interest in other pathways/unassigned genes– Most genes in pathway graph have significant empirical data values– These conditions are rare in exploratory experiments

GenMAPP, Dahlquist (2002)

Page 19: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Association enrichment graphs

• Calculate enrichment of terms

• Nodes are annotation terms

• Edges are ontological relationships

• Color represents enrichment score

• What about other annotation types?

• Which genes are implicated?

BiNGO, Maere (2005)

Page 20: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Custom gene set graphs

• Start with significant set of genes or cluster

• Show gene-gene relationships as edges

• How is gene annotation shown?

– Hypergraphs Ingenuity IPA,www.ingenuity.com

PubGene, Jensen (2001)

Page 21: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Hypergraphs

• A graph is a collection of nodes and edges

• A hypergraph is a graph with hyperedges

• A hyperedge is a set of nodes– Annotation terms and pathways are hyperedges

• Choice of hypergraph visualization method (HVM) is critical as the number of nodes and hyperedges scales upwards

Page 22: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Hypergraph visualization methods

Page 23: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

HVM: Venn diagram

• Draw a curve around nodes in a set

• Shows hyperedge overlap effectively

• Limited to 3 hyperedges

• No legend required

Page 24: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

HVM: Clique

• Use edges to fully connect all nodes in a set

• Scales poorly

– For a hyperedge with n nodes, 0.5n2 – 0.5n edges must be used

• Layout algorithms use additional edges

• Legend required

Page 25: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

HVM: Node-coloring

• Give all nodes in a set the same color or shape (Ingenuity uses shapes)

• Scales poorly– Nodes associated with multiple hyperedges must be divided

– Hyperedge count limited to number of distinguishable colors

• Layout algorithms do not use hyperedges

• Shows hyperedge overlap poorly

• Legend required

Page 26: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

HVM: Association node

• Hyperedges as association nodes on the graph– Connect each association node to its node members

– Incomplete, semi-bipartite graph

– Association nodes given different shapes/colors

• Scales well– For a hyperedge with n nodes, 1 node and n edges must be used

• Extra association nodes/edges complicate dense graphs– Exploratory assay gene graphs are sparse

• Layout algorithms use hyperedges

• No legend required

Page 27: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

HVM comparison

Page 28: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

EGAN

Page 29: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

EGAN features

• Entire pre-collated hypergraph is available in memory

– Mostly defined by NCBI Entrez Gene– Allows dynamic selection of genes and genes sets

• Useful interface tools for finding genes and terms/pathways of interest

– Advanced queries using mouse clicks– Spreadsheet-like tables– Selective addition and removal of information

• Association node HVM– Thought-provoking display of genes and annotation

• Node and Edge references– Nodes link to NCBI/UCSC/AmiGO/KEGG/etc.– Edges can link to PubMed

Page 30: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Mockup from 12/2007

Page 31: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

EGAN as of 10/2009

Page 32: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Data in the default human gene association network as of 06/08/2009Node Type Source # Nodes # Edges Node Links Edge Links

Gene NCBI Entrez Gene 40556 0 Entrez Gene, UCSC N/A

MeSH NCBI PubMed 16204 1113983 MeSH PubMed ID

Conserved DomainNCBI Conserved Domain Database 17168 295287 CDD None

Gene Ontology Process NCBI Entrez Gene 6779 211391 AmiGO PubMed ID

MIM NCBI Entrez Gene 3951 5082 OMIM None

Gene Ontology Function NCBI Entrez Gene 3114 68240 AmiGO PubMed ID

Cytoband NCBI Entrez Gene 987 67422 None None

Gene Ontology Component NCBI Entrez Gene 937 41040 AmiGO PubMed ID

KEGG NCBI Entrez Gene 195 8017 KEGG None

NHGRI GWA Catalog NCBI Entrez Gene 214 1271 PubMed None

Reactome NCBI Entrez Gene 49 3594 Reactome None

PubMed Co-occurrence NCBI Entrez Gene 0 118596 N/A PubMed ID

Chromosomal Sequence NCBI Entrez Gene 0 42468 N/A PubMed ID

BioGRID NCBI Entrez Gene 0 24401 N/A PubMed ID

IntAct EBI IntAct 0 22229 N/A None

HPRD NCBI Entrez Gene 0 17380 N/A PubMed ID

MINT MINT 0 11903 N/A None

BIND NCBI Entrez Gene 0 3879 N/A PubMed ID

Total 90154 2056183

Page 33: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

The data is fully customizable

• The pre-collated network– Stored as flat, tab delimited text

– Users can specify alternative/supplemental data files

• Updates are easily pushed to the end users– Using Java WebStart

– Compressed in .jar files (.zip)

• Additional gene sets are already available at MSigDB– Broad Institute, non redistributable

– EGAN loads gene sets in .gmt and .gmx file formats

Page 34: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Using EGAN: The simple case

Page 35: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Three EGAN use cases

• 1) Characterize a gene using protein interaction neighbors

• 2) Characterize an pre-collated gene set

• 3) Characterize gene set defined by experiment results

Page 36: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Characterize a gene using protein interaction neighbors

• Find gene PPARG in the Entrez Gene Node Table

• Show PPARG and all gene neighbors

• Hide protein-protein interaction edges

• Calculate enrichment for all gene sets

• Use enrichment statistics to selectively show association nodes on the graph

Page 37: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

PPARG and all protein interaction neighbors

Page 38: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Characterize an pre-collated gene set

• Find the conserved domain DDHD in the Conserved Domain Node Table

• Show DDHD and all gene neighbors

• Hide DDHD association node

• Calculate enrichment for all gene sets

• Use enrichment statistics to selectively show association nodes on the graph

Page 39: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Genes with the DDHD domain

Page 40: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Characterize gene set from empirical data

Genes reported by Beier et al. (2007) • Format custom gene sets• Format empirical data (after computational analysis)• Load custom gene set file and empirical file in EGAN• Find custom gene sets in Custom Node Node Table• Show custom sets and all gene neighbors

– Border color shows statistic– Border width shows p-value

• Hide custom set association nodes• Calculate enrichment for all gene sets• Use enrichment statistics to selectively show

association nodes on the graph

Page 41: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Gene sets from Beier et al. (2007)

Page 42: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Additional functionality in EGAN

• Comparison of multiple experiments/gene sets– Different normalization methods

– Different analysis parameters

– Different platforms

– Published experiments/gene sets

• Discovery of third-party genes not present in S

• Characterization of sequence-derived gene sets– Transcription regulation motifs

– Translation regulation motifs

– Clusters

• Scripting for automatic network generation

Page 43: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Future plans

• More diverse, more complete, higher quality data– Species beyond H. sapiens

– Activation/inhibition/modification relationships

• Examples with non-microarray empirical data– SNP, aCGH, MS/MS

• Quantitative analysis of the hypergraph

• Mapping of samples into gene set space

• Restriction of edges by quality parameters

• Cytoscape 3.0 plug-in?

• Improved graph layout algorithms

Page 44: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Where to get EGAN

• http://akt.ucsf.edu/EGAN/– Downloads

• http://groups.google.com/group/ucsf-egan/– Documentation

– Discussion forum

• The EGAN manuscript is currently under review at Bioinformatics

Page 45: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Acknowledgements

• UCSF HDFCCC BCB– Taku Tokuyasu

– Adam Olshen

– Ajay Jain

• Use of Cytoscape libraries– David Quigley

– Scooter Morris

– Alex Pico

– Alan Kuchinsky

• Testing– Donna Albertson

– Antoine Snijders

– Ingrid Revet

– Stephan Gysin

– Ritu Roydasgupta

– Sook Wah Yee

– Scot Federman

– Mike Baldwin

• Interpretation of GBM stem cell experiments– Joachim Silber

• Figure editing– Ben Kopman

Page 46: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Methods

Page 47: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Example custom gene set file format

Page 48: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Example empirical file format

Page 49: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Mapping empirical data to genes

• Exploratory assays don’t directly measure genes

• Entities may map to multiple genes– EGAN adds the entity statistic/p-value to all genes

• Multiple entities may map to a single gene– EGAN generates summary statistics/p-values

• Statistic median (default)• P-value median• Maximum/minimum |statistic|• Minimum/maximum p-value• Statistic/p-value mean

• Entity-to-gene mapping is customizable– Tab-based text format

Page 50: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Set-based enrichment

Given a set of genes made visible on graph

Page 51: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Global empirical enrichment

• Set Enrichment by Empirical Data (SEED)

• ParaSEED– Take statistic for each gene in a set S

– Calculate summary statistics (s-mean, standard deviation, n)

– Two-tailed t-test• probability that S is drawn randomly from a normal

distribution centered on 0

Page 52: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Global empirical enrichment

• PermuSEED– Take statistic for each gene in a set S– Calculate summary statistics (s-mean, n)– Randomly sample n genes from background p times– Score is fraction of sample means were lower than s-mean

• Score of 0.001 (p = 1000) means 1 of the 1000 random sample means was lower than s-mean

• Score of 0.999 (p = 1000) means 999 of the 1000 random sample means were lower than s-mean

• PermuSEED absolute– Use |statistic| for each gene in S– Pathway gene sets are likely to have activators and

inhibitors– PermuSEED absolute finds gene sets that are strongly

affected– Parametric version might use variance

Page 53: Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

Multiple testing adjustment

• Set-based enrichment– Can’t use q-value due to non-uniform distribution of

p-values

– Optional permutation-based minP method• Westfall & Young (1993)

• When specifically requested by user

• Global empirical enrichment (SEED)– q-value

• Automatically generated