geneexpressiondataanalysis - university of...

47
Gene Expression Data Analysis Qin Ma, Ph.D. December 10, 2017 1

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Gene Expression Data Analysis

Qin Ma, Ph.D.December 10, 2017

1

Page 2: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Bioinformatics

• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.

– Temple Smith, Current Topics in Computational Molecular Biology (2002)

2

Bioinformatics

GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics

Interactomics…

Omics data

Systems biology

Page 3: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Characteristics of Biological Big Data

Big Small Data

v.s.

Small Big Data

3

• 36.8 million transactions per day on Amazon

Next Generation Sequencing Data

• Biomedical Data (behavioral outcomes in observational study)

Page 4: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

The Hierarchical Structure of Computational Techniques

4

Models

Algorithms

Programs Tools Software

Page 5: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Central Dogma

� DNA à RNA à Protein

Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/central-dogma-transcription/a/intro-to-gene-expression-central-dogma

5/46

Page 6: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

6

Information derivable from gene expression data

Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated

co-expression -> co-regulation

Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed

genome sequence

Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease – differentially expressed genes

genome sequence

Control

Treatment

Page 7: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Gene Expression Measurement

7

Read quality check (FastQC)

RNA-seq read mapping (BWA, Bowtie)

RNA-seq Assemblywith reference genome

(Cufflinks)

RNA-seq Assemblywithout reference genome (Trinity: De-novo assembly)

Microarray (GEO)

RNA-seq (SRA)

$ $

Page 8: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

RNA-seq

� Process

� Purpose� Analysis of Big Genomic Data� Gene Expression Estimation

� Variations� Differential Gene Expression

Analysis� Functional Enrichment Analysis� Network Analysis

Forde, B. M., & O’Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), 440-453.

8/46

Page 9: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Non-trivial RNA-seq Analysis Pipeline

9

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Page 10: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Non-trivial RNA-seq Analysis Tools

10

FastQC

HISAT

HtSeq

EdgeR/DeSeq DAVID/GO

RNA-seq Reads

Btrim

MCL/QUBIC NCA/GtrieScanner

CufflinksTrinity

DOORSeqTU

Page 11: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Existing RNA-seq Pipeline Tools

2009

2010

2011

2012

2013

2014

2015

2016

2017

HISAT2BridgerHtSeq

RSEMCufflinksCutadapt

NovoalignBowtie 2

BWABowti

eTopHa

tGNUM

ap

TopHat2STARTrinity

DESeq2

GSNAPedgeRFastQCFastX

kallisto

sleuth

11/46

Page 12: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

ViDGER

� Tool to assist in interpreting and analyzing count matrices

� PCA, MDS, Clustering

� DGEA� Visualizations

� Basic R package � Shiny

implementation

12/46

Page 13: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

ViDGERCompatibility� Count & condition

matrix

� Popular DGE tools by citation count

� Cuffdiff*� edgeR� DESeq2� DEGseq� limma� sleuth*

1202164%

520028%

1607 8%

DGE & Visualization Visualization Only None

13/46

Page 14: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Shiny Input� Count Matrix

� Generates basic figures from matrix

� Initial Analyses� PCA

� MDS

14/46

Page 15: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Differential Gene Expression� Select DGE tool to

analyze data

� Interactive results table

� DGE results visualizations for improved interpretation

� Interactivity between table & figures

15/46

Page 16: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall I: Popularity ≠High Performance

MapSplice2CRAC GSNAPNovoalignTopHat2

27

Human(97.8%)(86.1%)(98.9%)(90.3%)(12.5%)

Page 17: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall II: Gene expression estimation

28

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Mapping uncertainty!

Page 18: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall II: Gene expression estimation

RNA-seq reads mapping uncertainty

29

Page 19: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mapping Uncertainty Occurrences

� Plants� Highly duplicative nature of genome

� Animals� Alternative splicing

� Metagenomics� Sequencing of entire microbial communities simultaneously� Identical genes across different species� Similar, mutated or evolved genes� Currently other issues compounding mapping uncertainty

19/46

Page 20: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall II: How Serious?

30

Diploid plants Polyploid plants

Species Arabidopsisthaliana Vitis vinifera Solanum

lycopersicumSolanum

tuberosumTriticum aestivum

Unique-mapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69%

Multi-mapped 8%~17% 10%~25% 6%~34% 18%~26% 18%~25%

Un-mapped 2%~5% 8%~23% 5%~44% 12%~19% 9%~18%

Similar things happen in Human (transcript) and Metagenome

Page 21: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Diploid plants Polyploid plants Animal

TotalSpecies Arabidopsis

thaliana Vitis vinifera SolanumLycopersicum

PanicumVirgatum

TriticumAestivum

HumanGenome

HumanTranscriptome

Mus musculusGenome

Mus musculusTranscriptome

Datasets 10 10 10 10 13 11 11 10 10 95

Size(G) 153.7 152.3 151.8 385.7 348.1 249.9 249.9 129.9 129.9 1951

Unique-Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55%

Multi-Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22%

Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23%

(Multi-mapped)/(Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29%

Mapping Uncertainty in Real Data

21/46

Page 22: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mapping Uncertainty in Plant Data

22/46

Page 23: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mapping Uncertainty in Animal Data

23/46

Page 24: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall II: How to Proceed?

31

a) Ignore them: only consider unique mapping– 30%-70% of reads are discarded from further analysis in plants

b) Random mapping: If multiple equally best matches, choose one at random– TopHat

c) Report all: try to keep more information– Cufflinks: distribute these multiple mapping reads uniformly or

based on the expression level of unique mapping reads.

Page 25: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall II: How to Proceed?

32

It is an OPEN and challenge problem!

Page 26: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Quantifying Mapping Uncertainty

� Gene Expression Quality Check (GeneQC)� Computational program collecting relevant information from

datasets� Interprets information in meaningful way to provide quantification

of mapping uncertainty

� Two levels of observations� Genomic level: Sequence Similarity between two genomic locations� Transcriptomic level: Proportion of shared ambiguous reads

26/46

Page 27: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

GeneQC

97

0.5

A B

C

C D

27/46

Page 28: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

D-score� Allows for comparable

metric of mapping uncertainty

� Combines three statistics� Maximum proportion

of shared ambiguous reads

� Maximum base-pair similarity

� Number of gene pair interactions

� Normalized between 0 and 1 for each dataset

𝒊

0.07

0.18

0.65

0.95

0.84

28/46

Page 29: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Variables: 𝐷.

� 𝐷.: Sequence Similarity * Match Length� max

2{𝑠𝑠5,2 ∗ 𝑙5,2}

� 𝑠𝑠5,2 = 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠𝑖𝑚𝑖𝑙𝑖𝑟𝑡𝑦𝑜𝑓𝑔𝑒𝑛𝑒𝑖𝑎𝑛𝑑𝑔𝑒𝑛𝑒𝑦� 𝑙5,2 = 𝑚𝑎𝑡𝑐ℎ𝑙𝑒𝑛𝑔𝑡ℎ

� Additional Constraints for 𝐷.� e-value < 10KL

� SS*Match Length > 100� Mismatch < 5� Gap < 5

𝑔𝑒𝑛𝑒𝑦.: 𝑠𝑠5,. = 65%; 𝑙5,. = 100

𝑔𝑒𝑛𝑒𝑦P: 𝑠𝑠5,P = 85%; 𝑙5,. = 200

𝑔𝑒𝑛𝑒𝑖

𝑔𝑒𝑛𝑒𝑦R: 𝑠𝑠5,R = 85%; 𝑙5,R = 200

𝑔𝑒𝑛𝑒𝑦S: 𝑠𝑠5,S = 85%; 𝑙5,S = 350

29/46

Page 30: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Variables: 𝐷P

� 𝐷P: Max MMR percentage�

UV∩XUV

� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖� 𝑋 = argmax

]|𝐺5 ∩ 𝑌|

𝐺5𝑋

𝑌P

𝑌.

𝐺5∩ 𝑋

30/46

Page 31: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Variables: 𝐷S

� 𝐷S: Degree weight� log.b 𝑆5 ∪ 𝑀5 + 1 � 𝑆5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷. > 0}� 𝑀5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷P > 0}

� Separated into two populations� 𝐷P = 0� 𝐷P ≠ 0

31/46

Page 32: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Variables by Species

32/46

Page 33: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

D-score Development

� 𝐷., 𝐷P, 𝐷S combined into one distinct value

� Regression-based approach to optimize effect of each parameter

𝐷 = 𝛼.𝐷. + 𝛼P𝐷P + 𝛼S𝐷S + 𝛼R𝐷.𝐷P + 𝛼j𝐷.𝐷S + 𝛼L𝐷P𝐷S + 𝛼k𝐷.𝐷P𝐷S

𝑆𝐷 = 𝐷S(𝛼.𝐷. + 𝛼P𝐷P)

� 𝐷∗ used as dependent variable to represent mapping uncertainty� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (All matches)� 𝑈5 = 𝑟𝑒𝑎𝑑𝑠𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (Unique mapping)� Real alignment falls somewhere between

� |𝑈5| ≤ |𝑅5| ≤ |𝐺5|

� 𝐷∗ = UV K rVUV

= 1 − rVUV= 1 − UV t uV

P UV

�.P≤ 𝐷∗ ≤ 1

� 𝐷∗ regressed upon (𝐷., 𝐷P, 𝐷S) to determine optimized coefficients for each dataset

� Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty

33/46

Page 34: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

D-scores

34/46

Page 35: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Simplified D-score

35/46

Page 36: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Simplified D-score Distributions� Density plots appear to

show mixture distributions

� Individual distributions can help indicate categorizations for mapping uncertainty

36/46

Page 37: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Level of Mapping Uncertainty from D-scores

� Mixture model distributions fit to set of D-scores� Indicates level of mapping uncertainty for each annotated gene� Normal & Gamma distribution fitting� Variable number of distributions

� Mixture Model Fitting using Expectation-Maximization Algorithm

� 𝑃 𝑋 𝜃 = ∑ 𝛽z𝑌z 𝑋 𝜃z�z

� 𝑋 = 𝑥., 𝑥P, … , 𝑥~ represent the set of D-scores� 𝛽z represent the weight for the 𝑘�� component with ∑ 𝛽z�

z = 1� 𝑌z(𝑋|𝜃z) represent the distribution of the 𝑘�� component

� 𝜃z is the set of parameters for the 𝑘�� component

37/46

Page 38: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mixture Model Fitting: Initialization

� Assume 𝑌z(𝑋, 𝜃z) = 𝑁(𝑋; 𝜇z, 𝜎zP)

� Initial parameterization� K-means clustering to separate into k components� 𝜃z, 𝛽zcalculated for each component using MLE based on 𝑁z

� 𝑀𝐿𝐸(𝜇z) =∑ ��,����

��

� 𝑀𝐿𝐸 𝜎zP =∑ ��,�K��

����

��

� 𝛽z =���

, with 𝑁z = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑖𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑘 & ∑ 𝑁z�z = 𝑁

𝑘 = 4

38/46

Page 39: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mixture Model Fitting: Expectation & Maximization

� Posterior Probability of containment within each component for each D-score is calculated

𝑃 𝑥� ∈ 𝑘5 𝑥� =𝑃 𝑥� 𝑥� ∈ 𝑘5 𝑃 𝑘5

𝑃 𝑥�=𝑁 𝑥� 𝜇z, 𝜎z

𝑁z𝑁

∑ 𝛽z𝑁 𝑥� 𝜇z, 𝜎z�z

=𝛽z𝑁 𝑥� 𝜇z𝜎z

∑ 𝛽z𝑁 𝑥� 𝜇z𝜎z�z

� Parameters for each component calculated after Expectation Step

𝜇z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥����.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝜎zP = ∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥� − 𝜇z

P���.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝛽z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.

𝑁

39/46

Page 40: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mixture Model Fitting: Optimization

� Expectation and Maximization steps repeated until no significant improvement achieved after each iteration

� log likelihood fails to substantially increase

� Implementation in R with 𝑘 ∈ {1, … , 9}� Best model fitting determined by lowest Bayesian Information

Criterion (BIC)

𝑘 = 4

40/46

Page 41: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Mixture Model Fitting

𝑘 = 4The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level

41/46

Page 42: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Addressing Mapping Uncertainty

� Co-expression Modules (CEMs)� Genes typically co-expressed at certain rates with other genes

forming co-expression modules� Can use expression levels for known co-expressed genes (CEGs) to

predict likely expression levels for the gene locations� This information can be in turn used to determine which location is

most likely for any particular ambiguous read

� Can use existing information to gain insight into the likelihood of the correct location for alignment

� If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs.

42/46

Page 43: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall III: T-test for differentially expression analysis

Wilcoxon (nonparametric) test has better performance than T-test

(parametric)

Bioinformatics. 2002 Nov;18(11):1454-61.Cited by 308

P-value < 0.0134

Page 44: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Pitfall IV: co-expression correlation

chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10

Gene1 7.6 6.0 10.8 8.3 9.1 8.7 7.4 6.4 10.2 6.5

Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.5

Pearson Spearman

• Pearson benchmarks linear relationship• Spearman’s rank correlation benchmarks monotonic relationship

Pearson or Spearman?

35

Page 45: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

45

Pitfall V: Co-expression in LARGE data set

Genes are not necessarily co-expressed under all experimental conditions,when we have a large data set!

Gen

esConditions

One dimensional clustering (genes or conditions)

Bi-clustering (genes & conditions)more data!!

Page 46: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Computer Lab Requirement

• Recent version of following software– R– RStudio– MiKTeX (or TeXLive)

• Install the following R packages on yourpersonal computer– EdgeR– QUBIC– sand

46

Page 47: GeneExpressionDataAnalysis - University of Georgiacsbl.bmb.uga.edu/mirrors/JLU/BioinformaticsCourse/... · 2017. 12. 11. · Bioinformatics • This interdisciplinary science …

Final Report Presentation

• 12 teams, 3 person/team

• For each team, 15 mins team presentation– 12 mins presentation– 3 mins question-and-answer

• One score per team

47