cufflinks
DESCRIPTION
Cufflinks. Matt Paisner , Hua He, Steve Smith and Brian Lovett. The Vision. RNAseq can be used for transcript discovery and abundance estimation - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/1.jpg)
CufflinksMatt Paisner, Hua He, Steve Smith and
Brian Lovett
![Page 2: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/2.jpg)
The Vision• RNAseq can be used for transcript discovery and
abundance estimation• What’s missing: algorithms which aren’t
restricted by prior gene annotations (which are often incomplete) and account for alternative transcription and splicing.
• Hence, Cufflinks.
![Page 3: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/3.jpg)
The Need• Evidence of ambiguous assignment of isoforms.
TSS site/promoter changes and splice site changes were found previously by the authors
• Longer reads and pair end reads do not do enough
![Page 4: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/4.jpg)
The Biology• General assumption of randomization of reads• Central Dogma• Transcription Start Site (TSS)• Splice site• Isoform
![Page 5: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/5.jpg)
Central Dogma and Regulation
![Page 6: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/6.jpg)
Splicing•
Thisblahblahblahblahblahblahisblahblahimportant•
Thisblahblahblahblahblahblahisblahblahimportant• “This” “is” “important” - Exons• “blah” - Introns (Intrusions)
![Page 7: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/7.jpg)
Major Change 1
![Page 8: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/8.jpg)
Major Change 2
![Page 9: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/9.jpg)
Why it matters: Isoforms
• Not only different sizes, but different shapes• Shape determines function• Isoforms would map to the same section of the
genome: undetected without Cufflinks• Separating transcripts into isoforms elucidates a
more realistic representation of what is happening
![Page 10: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/10.jpg)
![Page 11: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/11.jpg)
TopHatMapping short reads
Trapnell et. al, Bioinformatics, 2009
![Page 12: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/12.jpg)
TopHat• No genome reference annotations are needed
• The output of TopHat is the input of Cufflinks.
• Input: Reads and genome
• Output: Read mappings
• Short reads present computational challengeso BOWTIE
![Page 13: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/13.jpg)
How does TopHat Work?!
Big Idea: “Exon Inference”!!
![Page 14: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/14.jpg)
Step 1: Initial Mapping via Bowtie
• Group 1: Mapped Reads (Segments)• Group 2: Initially Unmapped (IUM) Reads
o possibly intron-spanning read• Based on Group 1, we want to get intron-
spanning reads from Group 2
Reference
Mapped Reads
![Page 15: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/15.jpg)
Step 2: Generate Putative Exons
![Page 16: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/16.jpg)
Step 3: Look for Potential Splice Signals
Putative Exons
![Page 17: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/17.jpg)
Step 4: Seed-and-Extend
![Page 18: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/18.jpg)
![Page 19: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/19.jpg)
CufflinksIsoform/Transcript Detection and
Quantification
Trapnell et al, Nature Biotech, 2010
![Page 20: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/20.jpg)
Step 5: Identify Compatible Reads
Two reads are compatible if their overlap contains the exact same implied introns (or none). If two reads are not compatible they are incompatible.
![Page 21: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/21.jpg)
Step 6:Less BIOLOGY, and NOW it is the time for some GRAPH THEORIES…….
“We emphasize that the definition of a transcription locus is not biological……” - Authors
![Page 22: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/22.jpg)
Step 6: Create Overlap Graph
•
Connect compatible reads in order
Create a DAG
![Page 23: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/23.jpg)
A path in this graph correspondsto a transcript isoform
![Page 24: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/24.jpg)
Theory1. Solving minimum path cover (isoforms) in the
overlap graph implies the fewest transcripts necessary to explain the reads.
2. Solve minimum path cover by finding largest set of individual reads such that no two are compatible.
3. According to Dilworth Thereom, find a maximum matching in a bipartite graph
![Page 25: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/25.jpg)
Step 8: Convert a DAG into a Bipartite Graph
![Page 26: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/26.jpg)
Step 9: Looking for Maximum Matching inside a bipartite graph via Bipartite
Matching Algorithm
BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the paths into the matching until none can be added.
![Page 27: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/27.jpg)
A path in this graph correspondsto a transcript isoform
![Page 28: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/28.jpg)
28
Projective normalization underestimates expression
isoform aisoform b project all isoforms
into genome coordinates
R reads total, r reads for the gene:- ra for isoform a- rb for isoform b
but so
![Page 29: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/29.jpg)
29
How should expression levels be estimated?
• A-B are distinguished by the presence of splice junction (a) or (b).• A-C are distinguished by the presence of splice junction (a) and change in UTR• B-C are distinguished by the presence of splice junction (b) and change in UTR
(a)(b)
![Page 30: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/30.jpg)
30
How should expression levels be estimated?
• Longer transcripts contain more reads.• Reads that could have originated from multiple transcripts are informative.• Relative abundance estimation requires
“discriminatory reads”.
(a)(b)
![Page 31: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/31.jpg)
31
A model for RNA-Seq
• r = Transcript proportions for assignment of reads to transcripts
• L = Likelihood of this assignment
• R = all reads
![Page 32: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/32.jpg)
32
A model for RNA-Seq
• T = All transcripts
![Page 33: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/33.jpg)
33
A model for RNA-SeqDefine:
• Expected possible positions for an arbitrary fragment in Transcript t
• F(i) = pr(random fragment has length i)
• l(t) = Full length of transcript t
![Page 34: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/34.jpg)
34
A model for RNA-Seq
![Page 35: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/35.jpg)
35
A model for RNA-Seq
• It (r) = Implied length of r’s fragment if r is assigned to transcript t
• Recall: F(i) = pr(fragment length = i)
![Page 36: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/36.jpg)
36
Projective normalization underestimates expression
isoform aisoform b project all isoforms
into genome coordinates
R reads total, r reads for the gene:- ra for isoform a- rb for isoform b
but so
![Page 37: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/37.jpg)
37
A model for RNA-Seq
• Now we have a maximum likelihood function in terms of r, the distribution of reads among transcripts.
• Non-negative linear model
![Page 38: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/38.jpg)
38
Inference with the sequencing model
• Maximum likelihood function is concave - optimization using the EM algorithm.
• Asymptotic MLE theory leads to a covariance matrix for the estimator in the form of the inverse of the observed Fisher information matrix
• Importance sampling from the posterior distribution used for estimating the abundances from the posterior expectation, and 95% confidence intervals for the estimates.
• This approach extends the log linear model of H. Jiang and W. Wong, Bioinformatics 2009 to a linear model for paired end reads.
• For more background see Li et al., Bioinformatics, 2010 and Bullard et al., BMC Bioinformatics, 2010.
![Page 39: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/39.jpg)
Utility of Cufflinks• mRNA as proxy for gene expression & action• Control points
o transcriptional vs o post transcriptional
• Does isoform-level discovery & quantification matter? o Apparently, yeso Putatively discovered about 12K new isoforms while recovering about
13K knowno Plus other stuff…
![Page 40: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/40.jpg)
40
The skeletal myogenesis transcriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation
-24 hours 60
hours168 hours
differentiation(starting at 0 hours)
fusion
myotubemyoctyte
120 hours
Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832
•84,369,078 reads
•140,384,062reads
• 82,138,212reads
•123,575,666reads
•66,541,668alignments
•103,681,081alignments
•47,431,271alignments
•89,162,512alignments
•10,754,363to junctions
•19,194,697to junctions
•9,015,806to junctions
•17,449,848to junctions
•58,008transfrags •69,716
transfrags•55,241transfrags
•63,664transfrags
Slide courtesy of Hector Corrada Bravo
![Page 41: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/41.jpg)
Validation
![Page 42: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/42.jpg)
Validation
![Page 43: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/43.jpg)
43
Projective normalization underestimates expression
isoform aisoform b project all isoforms
into genome coordinates
R reads total, r reads for the gene:- ra for isoform a- rb for isoform b
but so
Slide courtesy of Hector Corrada Bravo
![Page 44: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/44.jpg)
44
Discovery is necessary for accurate abundance
estimates
Slide courtesy of Hector Corrada Bravo
![Page 45: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/45.jpg)
Some Questions…• Do isoforms of a given gene have interesting
temporal patterns?o Increasing, decreasing, more complex…
• What does this mean biologically?• What about transcriptional versus post
transcriptional regulation?o Differential transcriptiono Differential splicing
![Page 46: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/46.jpg)
46
Dynamics of Myc expression
Slide courtesy of Hector Corrada Bravo
![Page 47: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/47.jpg)
Overloading Metric using Jensen-Shannon Divergence
Metric:One-sided t-test under the null hypothesis that there is no difference in abundance;Type I errors controlled with Benjamini-Hotchberg correction (FDR)
Average EntropyEntropy of Average
![Page 48: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/48.jpg)
Regulatory Overloading
Differential splicing
Differential TSS preference
231
101
17
FibronectinTropomyosin 1Mef2d…
Fhl3Fhl1Myl1…
# Genes (FDR < 0.05)
![Page 49: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/49.jpg)
49
Dynamics of Myc expression
d( , )
Slide courtesy of Hector Corrada Bravo
![Page 50: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/50.jpg)
New TSS = New Points of Regulation
TSS=Transcription Start Site
What would a “collapsed” RNA-seq alignment look like? Microarray?
![Page 51: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/51.jpg)
Questions?
![Page 52: Cufflinks](https://reader035.vdocument.in/reader035/viewer/2022070421/56816142550346895dd0b474/html5/thumbnails/52.jpg)
I am the DNA, and Iwant a protein!
The DNA wants a protein.
Transcription
Translation
mRNAProtein