data mining: functional statistical genetics

35
S S G ection tatistical enetics ON Department of Biostatistics Laura Kelly Vaughan, Ph.D Assistant Professor Section on Statistical Genetics [email protected] Data Mining: Functional Statistical Genetics & Bioinformatics

Upload: tommy96

Post on 10-May-2015

824 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Mining: Functional Statistical Genetics

SSGection

tatistical

enetics

ON

Department of Biostatistics

Laura Kelly Vaughan, Ph.D.Assistant Professor

Section on Statistical [email protected]

Data Mining: Functional Statistical

Genetics & Bioinformatics

Page 2: Data Mining: Functional Statistical Genetics

NCBI (National Center for Biotechnology Information)

Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be discerned.

http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html

Page 3: Data Mining: Functional Statistical Genetics

Integrative Data Analysis

Genetic studies tend to focus on one data source Genetic variation RNA levels Blood biochemistry

This fails to utilize the information contained in the connections among these variables…

Page 4: Data Mining: Functional Statistical Genetics

Central Dogma of Molecular Biology

DNA RNA Protein Phenotype

StructuralGenomics

Functional Genomics(Transcriptomics)

ProteomicsPhenomics

TXN

Replication

TSN PTM

Metabolomics

Genetics

Page 5: Data Mining: Functional Statistical Genetics

Different sources of annotation data

Gene Ontology Pathways/Networks Protein/protein

interactions Literature Functional annotations Expression

Cross species Cellular localization Methylation ChIP Sequence similarity Promoter & Regulatory

Network Protein domains

Page 6: Data Mining: Functional Statistical Genetics

Gene Ontology

www.geneontology.org The GO project has developed three structured

controlled vocabularies (ontologies) that describe gene products in a species-independent manner. biological processes- series of events accomplished by

one or more ordered assemblies of molecular functions cellular components- parts of the cell molecular functions- activities, such as catalytic or

binding activities, that occur at the molecular level

Page 7: Data Mining: Functional Statistical Genetics

http://www.yeastgenome.org/help/images/cytokinesisDAGrels.jpg

Example of a GO annotation

Page 8: Data Mining: Functional Statistical Genetics

What is a Pathway?

Physical and functional interactions between genes and gene products Metabolic pathways Kinase based signaling cascades Transcriptional signaling pathways

Page 9: Data Mining: Functional Statistical Genetics

P

P

P

P

TNF Signaling

TNFR2TNFR1

ATFs

Elk1

NF-B

IBs

IBs Degradation

c-Jun

c-Fos

P

P

TNF TNF

/

FADD

RAIDD

I-TRAF

CIAPMADD

SODD

TRAF2

TRAF3SODD

Caspase9CytoC

Caspase9

APAF1

Caspase8

tBID

ApoptosisApoptosis

Caspa

se2

CytoC

BID

Caspases3,6,7

TRADD

RIP

Caspase1

NIK

TRAF2

RIP

IKKs

NF-B

MEKKs

ERKs

p38

Gene Expression and Cell Survival

P

P

JNKK1

JNK1

TAK1

Ceramides

C 2007-2009SABiosciences.comC 2007-2009SABiosciences.com

Page 10: Data Mining: Functional Statistical Genetics

What is a Network?

Graphical representation if relationship between genes, gene products, or other objects

Formed with information such as

Genes in interacting pathways Gene products that share protein-protein interactions Gene products protein-nucleotide relationships Regulatory relationships Metabolic interactions

Page 11: Data Mining: Functional Statistical Genetics

Metabolic Disease Network

Lee D. et.al. PNAS 2008;105:9880-9885©2008 by National Academy of Sciences

Page 12: Data Mining: Functional Statistical Genetics

Analysis tools

Numerous methods have been developed to aid in the interpretation of biological experiments

2 basic categories Pre-analysis methods where the raw data is

grouped together & the groups are tested Dimension reduction

Post-analysis methods where significant or interesting results are grouped together to identify trends

Page 13: Data Mining: Functional Statistical Genetics

Before you start…

There are many methods available for integrative data analysis

Before you chose one, you must properly define the questions you are trying to answer… What is your hypothesis?

Page 14: Data Mining: Functional Statistical Genetics

DBA ~10 mins

Page 15: Data Mining: Functional Statistical Genetics

Methods

Unsupervised, or data based methods Utilizes all the data to identify trends Hypothesis generating

Supervised, or prior information based Requires the user to provide a ‘training set’ of

genes Hypothesis testing

Page 16: Data Mining: Functional Statistical Genetics

Gene Set Analysis

Test statistic intended to measure the deviation of gene-set expression measurements from the null hypothesis of no association with the phenotype is calculated

The statistical significance (P-value) for each gene set is calculated based on permutation of samples

Page 17: Data Mining: Functional Statistical Genetics

Types of enrichment methods

Class 1- Singular enrichment (SEA) P-value calculated on each term from pre-selected list &

enrichment terms are listed

Class 2- Gene set enrichment (GSEA) All genes (without pre-selection) are included

No need to select list Experimental values integrated into P-value calculations Pairwise comparisons (e.g., disease vs. control) Most appropriate for expression data

Class 3- Modular enrichment (MEA) Predetermined list, with term-term or gene-gene

relationships included in enrichment P-value calculation Closest to nature of biological data structure

Page 18: Data Mining: Functional Statistical Genetics

DAVID

Provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes

Extensive annotation database Includes both pathways and GO

SEA and MEA algorithms Visualization tools http://david.abcc.ncifcrf.gov/

Page 19: Data Mining: Functional Statistical Genetics

DAVID and LVH gene expression

GO clustering of significant genes between different mouse treatment groups

Stansfield et al 2009 Cardiopulmonary Support and Physiology

Page 20: Data Mining: Functional Statistical Genetics

Babelomics Suite

Suite of web tools for the functional profiling of genome scale experiments Multiple annotation sources

Pathways, GO, regulation, text mining, interactions

Allows for functional enrichment Several gene set methods

Mostly SEA methods

http://babelomics.bioinfo.cipf.es/

Page 21: Data Mining: Functional Statistical Genetics

Babelomics and thyroid carcinoma

Montero-Conde et al 2008 Oncogene

Identified 1031 gene with differential expression Enriched pathways included

MAPkinase TGF-B Focal adhesion Cell motility Activation of actin

polymerization Cell cycle

Identified 30 genes that predict prognosis with 95% accuracy

Page 22: Data Mining: Functional Statistical Genetics

GSEA

Computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

http://www.broad.mit.edu/gsea/

Page 23: Data Mining: Functional Statistical Genetics

GSEA: Steps in the MethodologyGSEA: Steps in the Methodology

Define a Gene Set from prior knowledge Order the genes by correlation with phenotype Estimate the gene set’s Enrichment Score Assess Statistical Significance using permutation tests Adjust for Multiple Hypothesis

Subramanian et. al, PNAS, 2005

Page 24: Data Mining: Functional Statistical Genetics

Biological pathways involved in chemotherapy response in breast cancer

Tordai et al 2008 Breast Cancer Research

GSEA for ER+ breast cancer tumors chemotherapy responders and non-responders

Of >850 gene sets, 4 were significant

Page 25: Data Mining: Functional Statistical Genetics

Significance Analysis of Function and Expression (SAFE)

Generalization and extension of GSEA method 2 stage permutation based approach to asses significant

changes in gene expression across experimental conditions First computes gene-specific local statistics to test for

association between gene expression and the phenotype. Gene-specific statistics then used to estimate global

statistics that detects shifts in the local statistics within a gene category.

The significance of the global statistics is assessed by repeatedly permuting the response values.

SAFE implements a rank-based global statistics that enables a better use of marginally significant genes than those based on a p-value cutoff.

http://www.bioconductor.org/packages/bioc/1.6/src/contrib/html/safe.html

Page 26: Data Mining: Functional Statistical Genetics

Dietary resveratrol and aging in mice SAFE analysis based

on GO annotations

Overlap of classes with significant effect caloric restrictive response with low dose resveratrol

Barger et al 2008 PLoS One

Page 27: Data Mining: Functional Statistical Genetics

Supervised AnalysisEndeavour Web based prioritization of candidate genes

Infers models for the training set in each data source

Application of each model to the candidate geens to rank against profiles of training set

Merges rankings from each data source to give global ranking of genes

http://homes.esat.kuleuven.be/~bioiuser/endeavour/endeavour.php

Page 28: Data Mining: Functional Statistical Genetics

Copyright restrictions may apply.

Tranchevent, L.-C. et al. Nucl. Acids Res. 2008 36:W377-W384; doi:10.1093/nar/gkn325

ENDEAVOUR: the algorithm behind the wizard

Page 29: Data Mining: Functional Statistical Genetics

Genetic disorder prioritization using Endeavour

Page 30: Data Mining: Functional Statistical Genetics

Network Analysis

Dynamic representation of cellular process through the incorporation of annotation & experimental data Structures are not fixed and change with

context Many methods available…

Suderman & Hallett 2007 Bioinformatics

Page 31: Data Mining: Functional Statistical Genetics

Ingenuity IPA

Page 32: Data Mining: Functional Statistical Genetics

Pathway Analysis of WTCCC Type 2 Hypertension GWAs

No single SNP was significant at the genome wide level

High degree of relationship between pathways suggests multiple related mechanisms Large number of low

penetrance risk alleles

Pathway analysis with MetaCore

Torkamani et al. 2008 Genomics

Page 33: Data Mining: Functional Statistical Genetics

English, S. B. et al. Bioinformatics 2007 23:2910-2917; doi:10.1093/bioinformatics/btm483

The next stepTranslational Science

Integration of 49 genome wide experiments for the prediction of previously unknown obesity related genes Greatly outperforms individual experiments

Page 34: Data Mining: Functional Statistical Genetics

References Song & Black 2008. BMC Bioinformatics. 9:502 Huang et al 2009. NAR 37(1):1-13 Chen et al 2008 Nature 452(27)429-435 Dinu et al 207 Journal of Biomedical Info 40:75-760 Al-Shahrour et al NAR 36:W341-346 Barry et al 2005 Bioinformatics 21(9)1943-1949 Huang et al Nature Protocols 4(1)44-57 Tranchevent et al 2008 NAR 36:W377-384 Mehta et al 2006 Physiol Genomics 28:24-32 Suderman & Hallett Bioinformatics 23(20)2651-2659 Dinu et al 2008 Briefings in Bioinformaics Curtis et al 2005 Trends in Biotech 23(8) Price and Shmulevich 2007 Current Op in Biotech 18:365-370 Zhang et al 2008 BMC Systems Bio 2:5 Werner 2008 Current Op in Biotech 19:50-54 Lui et al 2007 BMC Bioinformatics 8:431 Goeman & Buhimann 2007 Bioinformatics 23(8)980-987 Rivals et al 2007 Bioinformatics 23(4)401-407 Nam & Kim 2008 Briefings in Bioinformatics 9(3) 89-97

Page 35: Data Mining: Functional Statistical Genetics