exploring the package tophat-cuffdiff

33
Exploring the package TopHat-CuffDiff Jean-François Taly Bioinformatics Core Facilities Group meeting October 2 nd 2012 1

Upload: creola

Post on 24-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Exploring the package TopHat-CuffDiff. Jean-François Taly Bioinformatics Core Facilities Group meeting October 2 nd 2012. RNAseq expression data analysis. TopHat for mapping reads to the reference Reads directionality CuffDiff for the differential enrichment - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploring the package  TopHat-CuffDiff

1

Exploring the package TopHat-CuffDiff

Jean-François TalyBioinformatics Core Facilities

Group meetingOctober 2nd 2012

Page 2: Exploring the package  TopHat-CuffDiff

2

RNAseq expression data analysis

1. TopHat for mapping reads to the reference– Reads directionality

2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1

3. Enrichment threshold– Which transcripts are present in mitochondria?

Page 3: Exploring the package  TopHat-CuffDiff

3

PNPASE

miRNP

miRNAs

?

miRNAs

Regulation on mitochondrial translation

mRNAsMito

proteins

MitomiR_ EU0183Slide from

Question 1 : Are Nuclear DNA-encoded miRNAs imported to mitochondria ?

MitomiR project

Page 4: Exploring the package  TopHat-CuffDiff

4

miRNP ?

miRNAs

Regulation on mitochondrial translation

mRNAsproteins

miRNAs

MitomiR_ EU0183Slide from

Question 2 : Do miRNAs exist in the mitochondrial genome?

MitomiR project

Page 5: Exploring the package  TopHat-CuffDiff

5

One cell, two DNAs

-Circular DNA-human (ADNmt) mitochondria genome = 16.6 kb - 13 for subunits of respiratory complexes I, III, IV

and V - 22 for mitochondrial tRNA - 2 for rRNA

*One mitochondrion can contain two to ten copies of its DNA

* Exceptions to the universal genetic code (UGC) in mitochondria

From Lung et al. , 2006

Mitochondria

-23 chromosome pairs -human DNA : 2.9 billion DNA base pairs

-20,000 and 25,000 human protein-coding genes-»Junk » DNA or non coding DNA-Noncoding functional RNA (tRNA, rRNA,miRNA…)

The human genome may encode over 1000 miRNAs, which may target about 60% of mammalian gene

Nucleus

MitomiR_ EU0183

Page 6: Exploring the package  TopHat-CuffDiff

6

RNAseq libraries

• Short insert size: searching for miRNAs– No poly-A selection– No fragmentation– Size selected: 18-36 nt– stranded

• Long insert size: searching for lncRNAs– No poly-A selection– Fragmented– Size selected: 200 nt– stranded

Page 7: Exploring the package  TopHat-CuffDiff

7

2 Conditions

• Total fraction (tot)– Full cell lysate

• Mitochondrial fraction (mit)– RNA extracted from mitochondria

Page 8: Exploring the package  TopHat-CuffDiff

8

RNAseq expression data analysis

1. TopHat for mapping reads to the reference– Reads directionality

2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1

3. Enrichment threshold– Which transcripts are present in mitochondria?

Page 9: Exploring the package  TopHat-CuffDiff

9

Stranded RNAseq: Vocabulary

Forward

http://www.biostars.org/post/show/3423/forward-and-reverse-strand-conventions/

Reverse

5’ 3’

3’ 5’

Forward = 5’ end the closest from centromer in Human50% of the genes are coding in the forward strand

coding

coding

Forward / Reverse = Plus / MinusCoding / Template = Sense / Anti-sense

Page 10: Exploring the package  TopHat-CuffDiff

10

Orientation of reads?

5’ 3’3’ 5’

coding DNA

template DNA

RNA5’ 3’Transcription

3’ 5’cDNA

RNA5’ 3’

Reverse-transcription

First strand sequencingdUTP, NSR, NNSR

3’ 5’cDNA

coding DNA5’ 3’

DuplicationSecond strand sequencing

Directional Illumina (Ligation) Standard SOLiD

Page 11: Exploring the package  TopHat-CuffDiff

11

Proper TopHat option?

--library-type :• fr-unstranded: Default, Standard Illumina Reads • fr-firststrand: dUTP, NSR, NNSR• fr-secondstrand: Directional Illumina (Ligation),

Standard SOLiD

We mapped the reads using the unstranded and the secondstrand for comparisons

Page 12: Exploring the package  TopHat-CuffDiff

12

How can we evaluate directionality?

• Reads mapping in the F strand should be aligned with genes coding in F as well.

• Bitwise FLAG of the BAM file:– How many reads in forward?samtools view -c -F 16 accepted_hits.bam

– How many reads in reverse?samtools view -c -f 16 accepted_hits.bam

Total number of reads Percentage of Forward Mapping (PFM)

--library-typefr-secondstrand 173,219,584 55%

default 173,196,005 55%

Page 13: Exploring the package  TopHat-CuffDiff

13

How can we evaluate directionality? (2)

• Gene by gene– Bitwise FLAG + gene strand annotation

Transcripts in the (+) strand

Transcripts in the (-) strand

Transcripts in both strands

--library-

typefr-

secondstrand

Number of transcripts 82,782 80,648 163,430

Average PFM 77% 24% 51%

Median PFM 92% 1% 55%

default

Number of transcripts 82,868 80,693 163,561

Average PFM 77% 24% 51%

Median PFM 92% 1% 54%

A small number of genes received a huge amount of miss-mapped reads!

Page 14: Exploring the package  TopHat-CuffDiff

14

Example of miss-aligned reads

• AC097532.1: chr2:133038647-133038738– miRNA automatically annotated in E67 but retired

from E68;– CIGAR string of some reads is 26kb long;– 11,000,115 reads mapped (6% of total);– 8,205,667 mapped to the position 133,038,644;– NCBI blast of the major sequence:

• hit on the opposite strand but with 100% coverage and 100% identity to the 28S ribosomal RNA.

Page 15: Exploring the package  TopHat-CuffDiff

15

RNAseq expression data analysis

1. TopHat for mapping reads to the reference– Reads directionality

2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1

3. Enrichment threshold– Which transcripts are present in mitochondria?

Page 16: Exploring the package  TopHat-CuffDiff

16

CuffDiff needs a special GTF

• CuffDiff needs a GTF with the 2 following tags:– tss_id: The ID of this transcript's inferred start site. – p_id: The ID of the coding sequence this transcript

contains.• You can produce a compatible GTF with

CuffCompare:cuffcompare -s /path/to/genome_seqs.fa -CG -r

annotation.gtf

Page 17: Exploring the package  TopHat-CuffDiff

17

Effect of CuffCompare

CuffDiff V2.0.2 CuffDiff V2.0.2

CuffC

ompa

re +

Cuff

Diff

V2.0

.2

CuffC

ompa

re +

Cuff

Diff

V2.0

.2

Page 18: Exploring the package  TopHat-CuffDiff

18

Effect of CuffDiff Version

CuffDiff V2.0.1 CuffDiff V2.0.1

CuffD

iff V

2.0.

2

CuffD

iff V

2.0.

2

Page 19: Exploring the package  TopHat-CuffDiff

19

Highly sensible statistics

Reproducibility?

Version effect?

CuffCompare effect?

Genome annotation effect?

From 902 differentialy expressed genes with V2.0.1, we went to 15 with v2.0.2!!!

Page 20: Exploring the package  TopHat-CuffDiff

20

RNAseq expression data analysis

1. TopHat for mapping reads to the reference– Reads directionality

2. CuffDiff for the differential enrichment– Statistics with version 2.0.0 or 2.0.1

3. Enrichment threshold– Which transcripts are present in mitochondria?

Page 21: Exploring the package  TopHat-CuffDiff

21

Expression data reflects expectations

Ensembl Ids GeneLengthshortest

qPCR(tot)/qPCR(mit) 21-07-2011

qPCR 29-07-2011

RNA seq ShortIS

RNA seq LongIS

ENSG00000198899 MT-ATP6 681 0.600 0.500 - 0.18

ENSG00000198840 MT-ND3 346 0.400 0.400 - 0.21ENSG00000111640 GAPDH 390 416.000 362.000 - 7.1

ENSG00000089157 RLP0 402 611.000 446.000 - 8.6

Statistics may not be trustable but the fold change is! Define an enrichment threshold based on log2(FPKMtot/FPKMmit)

Cytosol Vincinity of mitochodria Mitochondrial genes

Page 22: Exploring the package  TopHat-CuffDiff

22

Compartimented genes

• Cytosolic genes:– UniProt: experimentaly observed in cytosol– Ensembl: no automatic annotations

• Vincinity of mitochondria:– Paper from Kang et al. 2012

• Mitochondrial genes– The 37 genes in the chromosome

Page 23: Exploring the package  TopHat-CuffDiff

23

Log2(Fold Change) distributions for the long insert library

Page 24: Exploring the package  TopHat-CuffDiff

24

Summary

All CytEnsembl67

CytUniProt Mitochondrial Kang2012

VicinityMit

SortIS DE Mean 1.7 0.41 - -0.6 -

DE Median 2.05 0.46 - -0.65 -

SeqNumb 2117 9 0 22 0

LongIS DE Mean 0.46 1.05 0.9 -2.21 1.94

DE Median 0.5 1.14 0.96 -2.27 2.2

SeqNumb 21030 1664 127 34 13

Page 25: Exploring the package  TopHat-CuffDiff

25

Significantly enriched genes

Method Short Insert Long InsertCuffDiff V2.0.1 988 908Threshold 309 714Intersection 22 99

Page 26: Exploring the package  TopHat-CuffDiff

26

Back Up slides

Page 27: Exploring the package  TopHat-CuffDiff

27

Mithochondrial genome

Page 28: Exploring the package  TopHat-CuffDiff

28

Mithochondrial genome – first 3 genes

Page 29: Exploring the package  TopHat-CuffDiff

29

Short Long

Ensembl Ids Gene Length FPKM mit FPKM tot log2(tot/mit) FPKM mit FPKM tot log2(tot/mit)ENSG00000198695 MT-ND6 525 81 23 -1.81 1820 377 -2.27ENSG00000198712 MT-CO2 684 459 169 -1.44 4063 764 -2.41ENSG00000198727 MT-CYB 1141 159 144 -0.15 2332 504 -2.21ENSG00000198763 MT-ND2 1042 172 59 -1.53 1559 285 -2.45ENSG00000198786 MT-ND5 1812 129 58 -1.15 2153 437 -2.30ENSG00000198804 MT-CO1 1542 154 58 -1.42 4186 766 -2.45ENSG00000198840 MT-ND3 346 226 66 -1.77 2890 610 -2.24ENSG00000198886 MT-ND4 1378 166 56 -1.56 3400 698 -2.28ENSG00000198888 MT-ND1 956 150 92 -0.71 1183 233 -2.35ENSG00000198899 MT-ATP6 681 94 26 -1.83 2357 431 -2.45ENSG00000198938 MT-CO3 784 270 269 -0.01 2037 401 -2.34ENSG00000209082 J01415.1 75 39041 34034 -0.20 56409 9045 -2.64ENSG00000210049 J01415.2 71 179164 80467 -1.15 257938 55524 -2.22ENSG00000210077 J01415.3 69 96298 67810 -0.51 2524440 682409 -1.89ENSG00000210082 J01415.4 1559 1546 642 -1.27 HIDATA 27286 0.00ENSG00000210100 J01415.5 69 10163 12512 0.30 63087 11058 -2.51ENSG00000210107 J01415.6 72 75946 35617 -1.09 2191 455 -2.27ENSG00000210112 J01415.7 68 171524 97116 -0.82 67897 22503 -1.59ENSG00000210117 J01415.8 68 11418 7479 -0.61 7944 2424 -1.71ENSG00000210127 J01415.9 69 1932 1427 -0.44 13615 3971 -1.78ENSG00000210135 J01415.10 73 20509 12667 -0.70 1864 196 -3.25ENSG00000210140 J01415.11 66 12550 7616 -0.72 77355 13629 -2.50ENSG00000210144 J01415.12 66 9804 5234 -0.91 74448 11999 -2.63ENSG00000210151 J01415.13 69 5078 1809 -1.49 NOTEST NOTEST NOTESTENSG00000210154 J01415.14 68 5943 3392 -0.81 1800 760 -1.24ENSG00000210156 J01415.15 70 28619 32650 0.19 1734 345 -2.33ENSG00000210164 J01415.16 68 5627 3232 -0.80 5572 1972 -1.50ENSG00000210174 J01415.17 65 7569 10780 0.51 11149 4206 -1.41ENSG00000210176 J01415.18 69 43092 28770 -0.58 150713 34863 -2.11ENSG00000210184 J01415.19 59 1175590 395027 -1.57 735380 208681 -1.82ENSG00000210191 J01415.20 71 67641 36817 -0.88 70081 14281 -2.29ENSG00000210194 J01415.21 69 157602 115972 -0.44 603010 124182 -2.28ENSG00000210195 J01415.22 66 71836 77279 0.11 19871 4777 -2.06ENSG00000210196 J01415.23 68 45761 30983 -0.56 121678 15826 -2.94ENSG00000211459 J01415.24 954 943 583 -0.69 HIDATA 29151 0.00ENSG00000212907 MT-ND4L 297 412 141 -1.54 9230 1991 -2.21ENSG00000228253 J01415.25 207 735 160 -2.20 36590 8531 -2.10

Page 30: Exploring the package  TopHat-CuffDiff

30

Ensembl Ids Gene Type Status LevelENSG00000198695 MT-ND6 protein_coding KNOWN 3ENSG00000198712 MT-CO2 protein_coding KNOWN 3ENSG00000198727 MT-CYB protein_coding KNOWN 3ENSG00000198763 MT-ND2 protein_coding KNOWN 3ENSG00000198786 MT-ND5 protein_coding KNOWN 3ENSG00000198804 MT-CO1 protein_coding KNOWN 3ENSG00000198840 MT-ND3 protein_coding KNOWN 3ENSG00000198886 MT-ND4 protein_coding KNOWN 3ENSG00000198888 MT-ND1 protein_coding KNOWN 3ENSG00000198899 MT-ATP6 protein_coding KNOWN 3ENSG00000198938 MT-CO3 protein_coding KNOWN 3ENSG00000209082 J01415.1 Mt_tRNA NOVEL 3ENSG00000210049 J01415.2 Mt_tRNA NOVEL 3ENSG00000210077 J01415.3 Mt_tRNA NOVEL 3ENSG00000210082 J01415.4 Mt_rRNA KNOWN 3ENSG00000210100 J01415.5 Mt_tRNA NOVEL 3ENSG00000210107 J01415.6 Mt_tRNA NOVEL 3ENSG00000210112 J01415.7 Mt_tRNA NOVEL 3ENSG00000210117 J01415.8 Mt_tRNA NOVEL 3ENSG00000210127 J01415.9 Mt_tRNA NOVEL 3ENSG00000210135 J01415.10 Mt_tRNA NOVEL 3ENSG00000210140 J01415.11 Mt_tRNA NOVEL 3ENSG00000210144 J01415.12 Mt_tRNA KNOWN 3ENSG00000210151 J01415.13 Mt_tRNA NOVEL 3ENSG00000210154 J01415.14 Mt_tRNA NOVEL 3ENSG00000210156 J01415.15 Mt_tRNA NOVEL 3ENSG00000210164 J01415.16 Mt_tRNA NOVEL 3ENSG00000210174 J01415.17 Mt_tRNA NOVEL 3ENSG00000210176 J01415.18 Mt_tRNA NOVEL 3ENSG00000210184 J01415.19 Mt_tRNA NOVEL 3ENSG00000210191 J01415.20 Mt_tRNA NOVEL 3ENSG00000210194 J01415.21 Mt_tRNA KNOWN 3ENSG00000210195 J01415.22 Mt_tRNA NOVEL 3ENSG00000210196 J01415.23 Mt_tRNA NOVEL 3ENSG00000211459 J01415.24 Mt_rRNA KNOWN 3ENSG00000212907 MT-ND4L protein_coding KNOWN 3ENSG00000228253 J01415.25 protein_coding KNOWN 3

Page 31: Exploring the package  TopHat-CuffDiff

31

Cellular metabolism regulation(E2C slide)

Warburg effect

Differentiation

Proliferative cellsUndifferentiated cellsBiosynthesis efficiency

Working cellsDifferentiated cellsEnergetic efficiency

Glucose

Pyruvate

Lactate

O2

CO2

Glycolysis

OXPHOS

2 ATP

36 ATP

Cells grown in different metabolic condition might represent aunique way to distinguish RNA subpopulation expressed in mitochondria (ncRNA and … miRNA?)

Mitochondrial dysfunctionGlucose

Pyruvate

Lactate

Glycolysis2 ATP

Aminoacidsnucleotides

Slide from

MCF7 MCF7 is a breast cancer cell line able to grow in OXPHOS conditions

Page 32: Exploring the package  TopHat-CuffDiff

32

MCF7

MCF7oxphosMilieu OXPHOS

Milieu HIGH Glucose

Min 3 weeks

MCF7High Gluc

Stable cell lines

MCF7oxphos

MCF7oxphos

MCF7Oxphos shift to

High Gluc

MCF7oxphos

OXPHOS

HIGH Glucose

J0 J1

Total cells and mito extraction

N= 3 to 4 independent batchesOXPHOS

MCF7High Gluc

MCF7High Gluc

MCF7High Gluc

MCF7High Gluc shit

to OXPHOS

HIGH Glucose

SHIFTS!!!

AGB:CH3854

ATCC:HTB-22

Total cells and mito extraction

TLDA

RNA-seq

TLDA = Microfluidic miRNA qPCR

High Glucose

Stable MCF-7 cell lines

OXPHOS 0mM glucose

Low Glucose

Experimental design

Page 33: Exploring the package  TopHat-CuffDiff

33

Exon

Exon 1 Exon2