next generation sequencing (ngs)- rna sequencing...use of high-throughput sequencing technologies to...
TRANSCRIPT
-
Vijayachitra Modhukur BIIT
Next generation sequencing (NGS)-RNA sequencing
1 11/20/13 Bioinformatics course
-
NGS lectures
11/20/13 Bioinformatics course 2
Genomics
Transcriptomics
Protomics
Epigenomics
-
NGS lectures
11/20/13 Bioinformatics course 3
Genomics
Transcriptomics
Protomics
Epigenomics
-
Recap
11/20/13 Bioinformatics course 4
-
Sequencing
5 11/20/13 Bioinformatics course
-
Different generations sequencing
6 11/20/13 Bioinformatics course
-
Second generation sequencing
7 11/20/13 Bioinformatics course
-
NGS platforms
11/20/13 Bioinformatics course 8
Leading Platforms
454 Solexa/Illumina SOLiD (ABI)
Bp per run 400 Mb 2-3 Gb 3-6 Gb
Read length 250-400 bp 35-50 (70-100) bp 35-50 bp
run time 10 hr 2.5 days 5 days
Download 20 min 27 hr (44 min) ~1 day
Analysis 2-5 hr 2 days 2-3 days
Files 20-50 Gb 1T 1 T
With 3730s, ~60Mb per year Specifications as of summer 2008
-
Massive amount of sequenced data
Bioinformatics course 11/20/13 9
-
Sequence alignment De novo alignment Reference alignment
10 11/20/13 Bioinformatics course
-
Short read mapping (Denovo) - ssp
11
• Let f1,f2…fk be the words in Σ*. • We want to find shortest substring g εΣ* such that fi is
the substring of g • Example: Lets say we have set of strings f1 = ACGTA, f2
= CTTGA, f3 = ACTT, f4 = GTAAC • Find the shortest common superstring of these 4 string
•
So ⌃⇤ is the ”free language” on the alphabet ⌃. Note: if you like monoids, ⌃⇤ has the algebraic structureof a monoid.
Definition 1. Let f, g 2 ⌃⇤, so that we write
f = s1s2 · · · sn, g = t1t2 · · · tm
where si, tj 2 ⌃ for all i and j. Then f is a substring of g if there exists an index k � 1 such thats1s2 · · · sn = tktk+1 · · · tn+k�1. Conversely, g is a superstring of f .
Example 1. Let f = ACTG and g = AAACTGCA. Then f is substring of g, because g = AAACTGCA.
The Shortest Common Superstring Problem
Let f1, f2, · · · fk be words in ⌃⇤. We want to find the shortest string g 2 ⌃⇤ such that each fi is asubstring of g. This is knows as the Shortest Common Superstring Problem (SSP). We practicea solution to this problem informed by Ockham’s Razor: we assume that the best reconstruction is alsothe simplest.
Example 2. Say we have this set of reads: f1 = ACGTA, f2 = CTTGA, f3 = ACTT, f4 = GTAAC.Find the shortest common superstring of these four strings.
Figure 2: The SSP for f1, ..., f4
By treating these reads like puzzle pieces , we can put the four reads into this superstring, which inthis case is of shortest possible length (by inspection). We will soon see that this particular commonsuperstring can be constructed using an algorithm, although the algorithm has some issues.
It turns out that this problem is di�cult to solve in practice:
Theorem 3 (Gallant 1980). The SSP is NP-Complete.
While we will not deal with complexity theory in detail in this class, we can take this to mean thatthe SSP is provably hard in a nasty way. This problem is related to graph theory.
2
11/20/13 Bioinformatics course
-
Reference alignment
11/20/13 Bioinformatics course 12
Find locations where short read is identical to reference genome
-
NGS Analysis
13 11/20/13 Bioinformatics course
-
Data analysis
cpu/memory intensive
14 11/20/13 Bioinformatics course
-
Quality scores Each base from a sequencer comes with a quality score Base-calling error probabilities Phred quality score Q = 10 log10 P higher quality score indicates a smaller probability of error
15
http://www.illumina.com/truseq/quality_101/quality_scores.ilmn
11/20/13 Bioinformatics course
-
Quality scores
16
http://www.illumina.com/truseq/quality_101/quality_scores.ilmn
11/20/13 Bioinformatics course
-
File formats
17 11/20/13 Bioinformatics course
-
fastQ
Raw data
18 11/20/13 Bioinformatics course
-
Alignment methods
11/20/13 Bioinformatics course 19
Reference assembly Spaced seed BWT
Denovo assembly Greedy Assemblers Graph based –Overlap layout consensus Graph based –Debruign graph
-
RNA sequencing
11/20/13 Bioinformatics course 20
-
Transcription
11/20/13 Bioinformatics course 21
-
RNA world hypothesis
11/20/13 Bioinformatics course 22
-
What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample.
Journal of Biomedicine and Biotechnology 11
Exon
IntronSequence read
Signal from annoted exons
Non-exonic signal
Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.
In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.
Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced
reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.
In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.
Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power
slides from Halisha Holloway 11/20/13 Bioinformatics course 23
-
RNA-seq Microarray ID novel genes, transcripts, & exons
Well vetted QC and analysis methods
Greater dynamic range Well characterized biases
Less bias due to genetic variation Quick turnaround from established core facilities
Repeatable Currently less expensive
No species-specific primer/probe design
More accurate relative to qPCR
Many more applications
RNA-seq vs microarray
11/20/13 Bioinformatics course 24
-
RNA-Seq vs microarray
11/20/13 Bioinformatics course 25
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
11/20/13 Bioinformatics course 26
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
Skelly et al. 2011 11/20/13 Bioinformatics course 27
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
11/20/13 Bioinformatics course 28
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
11/20/13 Bioinformatics course 29
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
Pluripotent Stem Cell
Cardiomyocytes Cardiogenic Mesoderm
Cardiac Precursors
11/20/13 Bioinformatics course 30
-
Why do an RNA-seq experiment? Detect differential expression Assess allele-specific expression Quantify alternative transcript usage Discover novel genes/transcripts, gene
fusions Profile transcriptome Ribosome profiling to measure
translation
11/20/13 Bioinformatics course 31
-
RNA-seq protocol
11/20/13 Bioinformatics course 32
-
11/20/13 Bioinformatics course 33
RNA-seq protocol
11/21/12 35
RNA-Seq protocol
Sample RNA
Amplified cDNA
cDNA fragments
reverse transcription
+ PCR fragmentationsequencing
machine
readsCCTTCNCACTTCGTTTCCCAC
TTTTTNCAGAGTTTTTTCTTG
GAACANTCCAACGCTTGGTGA
GGAAANAAGACCCTGTTGAGC
CCCGGNGATCCGCTGGGACAA
GCAGCATATTGATAGATAACT
CTAGCTACGCGTACGCGATCG
CATCTAGCATCGCGTTGCGTT
CCCGCGCGCTTAGGCTACTCG
TCACACATCTCTAGCTAGCAT
CATGCTAGCTATGCCTATCTA
CACCCCGGGGATATATAGGAT
16
Bioinformatics course
-
RNA-seq data
11/20/13 Bioinformatics course 34
RNA –seq data
11/21/12 36
RNA-Seq data
@HWUSI-EAS1789_0001:3:2:1708:1305#0/1CCTTCNCACTTCGTTTCCCACTTAGCGATAATTTG+HWUSI-EAS1789_0001:3:2:1708:1305#0/1VVULVBVYVYZZXZZ\ee[a^b`[a\a[\\a^^^\@HWUSI-EAS1789_0001:3:2:2062:1304#0/1TTTTTNCAGAGTTTTTTCTTGAACTGGAAATTTTT+HWUSI-EAS1789_0001:3:2:2062:1304#0/1a__[\Bbbb`edeeefd`cc`b]bffff`ffffff@HWUSI-EAS1789_0001:3:2:3194:1303#0/1GAACANTCCAACGCTTGGTGAATTCTGCTTCACAA+HWUSI-EAS1789_0001:3:2:3194:1303#0/1ZZ[[VBZZY][TWQQZ\ZS\[ZZXV__\OX`a[ZZ@HWUSI-EAS1789_0001:3:2:3716:1304#0/1GGAAANAAGACCCTGTTGAGCTTGACTCTAGTCTG+HWUSI-EAS1789_0001:3:2:3716:1304#0/1aaXWYBZVTXZX_]Xdccdfbb_\`a\aY_^]LZ^@HWUSI-EAS1789_0001:3:2:5000:1304#0/1CCCGGNGATCCGCTGGGACAAGCAGCATATTGATA+HWUSI-EAS1789_0001:3:2:5000:1304#0/1aaaaaBeeeeffffehhhhhhggdhhhhahhhadh
namesequencequalities
read
1 Illumina (GAIIX) lane
~20 million reads
read1
read2
paired-end reads
17
?
?
Bioinformatics course
-
Coverage
11/20/13 Bioinformatics course 35
Coverage = Number of sequenced reads/Size of the original genome
The number of sequenced reads = Number of reads × length of the reads
-
Some things to consider in experimental design
11/20/13 Bioinformatics course 36
-
Plan it well Experimental design
Biological replicates Reference genome? Good gene annotation?
Read depth Read length Paired vs. single-end
Technical variation
Biological variation
11/20/13 Bioinformatics course 37
-
Plan it well Experimental design
Biological replicates Reference genome? Good gene annotation?
Read depth Read length Paired vs. single-end
11/20/13 Bioinformatics course 38
-
Plan it well Experimental design
Biological replicates Reference genome? Good gene annotation?
Read depth Read length Paired vs. single-end
●●●●
●●
●●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Robustness of transcript identification as input data are removed
Fraction of total number of reads in jackknifed data setFr
actio
n of
tran
scrip
ts w
ith n
on−z
ero
FPK
M (r
elat
ive
to 1
00%
)
10%
5%
2%
1%
0.1%
●●●●●●●●
●
●
●
●
●
●
CufflinksUSeq-DESeq
11/20/13 Bioinformatics course 39
-
How much data do we need? ~15-20K genes expressed in a tissue | cell line. Genes are on average 3KB For 1x coverage using 100 bp reads, would need 600K
sequence reads In reality, we need MUCH higher coverage to accurately
estimate gene expression levels. 30-50 million reads
11/20/13 Bioinformatics course 40
-
Plan it well Experimental design
Biological replicates Reference genome? Good gene annotation?
Read depth Read length Paired vs. single-end
Uniq seq = 4read length
Read length Unique seq
25 1.1x1015
50 1.3x1030
100 1.6x1060
~60 million coding bases in vertebrate genome
11/20/13 Bioinformatics course 41
-
Plan it well Experimental design
Biological replicates Reference genome? Good gene annotation?
Read depth Barcoding Read length Paired vs. single-end
11/20/13 Bioinformatics course 42
-
Power of paired-end reads Huge impact on read mapping
Pairs give two locations to determine whether read is unique Critical for estimating transcript-level abundance
Increases number of splice junction spanning reads
11/20/13 Bioinformatics course 43
-
Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones.
Auer P L , and Doerge R W Genetics 2010;185:405-416
Copyright © 2010 by the Genetics Society of America 11/20/13 Bioinformatics course 44
-
RNA-seq pipeline
11/20/13 Bioinformatics course 45
-
Typical RNA-seq experiment
11/20/13 Bioinformatics course 46
-
11/20/13 Bioinformatics course 47
RNA-seq informatics workflow
1. Qc and genome mapping 2. Splice junction fragments 3. Predict novel junctions/
exons 4. Counts 5. Normalize 6. Differential expression 7. Gene lists
-
Quality control
11/20/13 Bioinformatics course 48
-
QC: Raw Data Sequence call quality
11/20/13 Bioinformatics course 49
-
QC: Raw Data Sequence bias
11/20/13 Bioinformatics course 50
-
QC: Raw Data Duplication level
11/20/13 Bioinformatics course 51
-
Mapping
11/20/13 Bioinformatics course 52
-
Mapping
11/20/13 Bioinformatics course 53
Journal of Biomedicine and Biotechnology 11
Exon
IntronSequence read
Signal from annoted exons
Non-exonic signal
Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.
In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.
Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced
reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.
In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.
Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power
Align read to the genome • Simple for genomic sequences • Difficult for transcripts with splice junction
-
Junction reads
11/20/13 Bioinformatics course 54
-
Tophat-pipeline
11/20/13 Bioinformatics course 55
-
Alternative splicing
11/20/13 Bioinformatics course 56
-
Alternative splicing
11/20/13 Bioinformatics course 57
-
Cuff-links
11/20/13 Bioinformatics course 58
-
RNA-seq complete pipeline
11/20/13 Bioinformatics course 59
-
RNA seq-summarization
11/20/13 Bioinformatics course 60
-
Normalization aims
11/20/13 Bioinformatics course 61
Comparable across features (genes, isoforms etc.,)
Comparable across different samples (libraries) Between samples (libraries) Within sampes(libraries)
Easily interprettable
-
Within library normalization
11/20/13 Bioinformatics course 62
Allows quantification of expression levels of each gene relative to each other’s gene with in the library
Longer transcripts have higher read counts( with same expression level)
Widely used : RPKM (Reads per Kilobase per Million Base)
-
RPKM-example
11/20/13 Bioinformatics course 63
No.of mapped reads =3 lenth of transcript=300 bp Total no. of reads =10,000
RPK = 3/(300/1000) = 3/0.3 = 10
RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000
RPKM =1000
-
Between library normalization
11/20/13 Bioinformatics course 64
Adjust by total number of reads in the library Smaller number of highly expressed genes can
consume significant amount of sequences Solution: scaling factor Scaling the number of reads in a library to a
common value Quantile normalization
-
Differential expression
11/20/13 Bioinformatics course 65
List genes changed significantly in abundace across different experimental conditions
Not same as microarrays , since not log transformed If reads independently sampled from population, reads
would follow multinomial distribution appx by Poisson
Pr(X = k) =λke-k /k!
-
Several tools for differential expression…
11/20/13 Bioinformatics course 66
Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least
each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.
As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).
Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa
Seed methods Short-read mapping package (SHRiMP)41
Smith-Waterman extension Aligning reads to a reference transcriptome
Reads and reference transcriptome
Stampy39 Probabilistic modelBurrows-Wheeler transform methods
Bowtie43
BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced
alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions
Reads and reference genomeSpliceMap50
TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases
QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction
Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome
Alignments to reference genomeGenome-guided
assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms
Genome-independent reconstruction
Genome-independent assembly
Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome
ReadsTransABySS56
Expression quantificationExpression quantification
Gene quantification Alexa-seq47 Quantifies using differentially included exons
Quantifying gene expression Reads and transcript models
Enhanced read analysis of gene expression (ERANGE)20
Quantifies using union of exons
Normalization by expected uniquely mappable area (NEUMA)82
Quantifies using unique reads
Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression
Quantifying transcript isoform expression levels
Read alignments to isoformsMISO33
RNA-seq by expectaion maximization (RSEM)69
Differential expression
Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms
Read alignments and transcript models
DegSeq79 Uses a normal distributionEdgeR77
Differential Expression analysis of count data (DESeq)78
Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.
470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS
REVIEW
-
Analysis of differentially expressed gene list
11/20/13 Bioinformatics course 67
-
Gene ontology analysis
11/20/13 Bioinformatics course 68
The main input of g:Sorter is a single gene ID. The userselects an expression dataset, a mathematical measure ofdistance like the Pearson correlation or Euclidean distance,and the size of the desired result. The result of g:Sorter
analysis is a list of probes most similar (or dissimilar) tothe query gene in the selected dataset. Visualisation showsthe relative distances between probes. In case a geneis represented by several probes, a search is conducted
(A)
(B)
Figure 1. (A) A typical user input and output scenario of g:Profiler. User inserts a set of genes in the main text window and optionally adjusts queryparameters. Results are provided either graphically or in textual format. Genes are presented in columns, and significant functional categories inrows. The analysis of an ordered list shows the length of the most significant query head. GO annotation evidence codes are coloured likea heat map, showing the strength of evidence between a gene and GO term. The legend is provided at the top of the page. It is displayed when theuser clicks on the tree icon on the results page. The g:Orth, g:Convert and G:Sorter tools are directly linked to relevant genes from the current query.Additional examples are available in Supplementary Data. (B) Hierarchical relations between the resulting GO categories can be browsed by clickingon corresponding icons.
Nucleic Acids Research, 2007 3
-
Gene ontology –Gosummaries
11/20/13 Bioinformatics course 69
cell.line VS brainG1 > G2: 2168G1 < G2: 2132
cell cycle phasemitotic cell cyclecell cycle checkpoint
nuclear division
Cell Cycle, Mitotic
DNA replication
response to DNA damage stimuluscell division mRNA metabolic process
translation
Cell cycle
chromosome segregationanaphase−promoting complex−depen...
RNA processing
DNA Replication
Cell Cycle Checkpointscellular component biogenesis at...
ncRNA metabolic process
regulation of ubiquitin−protein ...
spindle organization
positive regulation of protein u...
cellular macromolecular complex ...positive regulation of ligase ac...
chromosome organization
RNA transport
interspecies interaction between...
negative regulation of ubiquitin...DNA recombination
DNA damage response, signal tran...
DNA conformation change
viral reproduction
regulation of mitosis
p53 signaling pathway
establishment of organelle local...
protein complex subunit organiza...
regulation of cellular amino aci...protein N−linked glycosylation
intracellular protein transport
protein N−linked glycosylation v...
DNA−dependent transcription, ter...
multicellular organismal signalingneuron development
neuron projection development
neuron projection morphogenesis
regulation of synaptic transmission
central nervous system development
regulation of membrane potential
behavior
axon guidance
regulation of nervous system dev...
regulation of neuron differentia...
ion transport
neurotransmitter transport
Glutamatergic synapse
transmembrane receptor protein t...
cytoskeleton organization
GABAergic synapse
synapse organization
generation of a signal involved ...
Retrograde endocannabinoid signa...
cognition
Dopaminergic synapse
ion transmembrane transport
purine nucleoside triphosphate m...
secretion by cell
Opioid Signalling
Long−term potentiation
vesicle−mediated transport
GTP catabolic process
regulation of transporter activity
Gastric acid secretion
Morphine addiction
positive regulation of cellular ...
Calcium signaling pathway
negative regulation of cellular ...
regulation of small GTPase media...
actin filament−based process
regulation of cellular localization
Salivary secretion
regulation of cell morphogenesis...
muscle VS hematopoietic.systemG1 > G2: 1527G1 < G2: 1159
cardiovascular system developmentmuscle structure development
muscle system process
cell adhesion
generation of precursor metaboli...
energy derivation by oxidation o...
muscle tissue development
Glucose Regulation of Insulin Se...
anatomical structure formation i...
circulatory system process
cell migration
actin filament−based process
organ morphogenesis
cell morphogenesis involved in d...
response to endogenous stimulusregulation of cell migration
Parkinson's disease
enzyme linked receptor protein s...
Dilated cardiomyopathy
neuron projection morphogenesis
regulation of system process
Focal adhesionCardiac muscle contraction
taxisacetyl−CoA metabolic process
wound healing
glucose metabolic process
Oxidative phosphorylation
regulation of anatomical structu...
Hypertrophic cardiomyopathy (HCM)
tissue morphogenesis
Alzheimer's disease
ECM−receptor interaction
Huntington's disease
extracellular matrix organization
response to inorganic substance
cell junction assembly
Arrhythmogenic right ventricular...
epithelium development
Glucose metabolism cell activationpositive regulation of immune sy...
regulation of immune response
hemopoiesis
immune effector process
response to other organism
leukocyte migration
innate immune response
cytokine production
cell chemotaxis
hemostasislymphocyte proliferation
blood coagulation
inflammatory response
adaptive immune response
response to cytokine stimulusinterspecies interaction between...
positive regulation of catalytic...
regulation of defense response
positive regulation of protein m...integrin−mediated signaling pathway
regulation of hydrolase activity
vesicle−mediated transportpeptidyl−tyrosine phosphorylation
regulation of protein phosphoryl...
positive regulation of cytokine ...
positive regulation of cytokine ...
actin polymerization or depolyme...positive regulation of lymphocyt...
cell adhesion
regulation of protein kinase act...
induction of apoptosis
negative regulation of programme...
Hematopoietic cell lineage
protein complex subunit organiza...
intracellular protein kinase cas...
Chemokine signaling pathway
Natural killer cell mediated cyt...
positive regulation of leukocyte...
Leukocyte transendothelial migra...
hematopoietic.system VS cell.lineG1 > G2: 1221G1 < G2: 1289
cell activationinnate immune responseregulation of immune response
positive regulation of immune sy...
response to other organism
immune effector process
cytokine production
leukocyte differentiation
leukocyte migration response to cytokine stimulusinflammatory response
regulation of defense response
cell chemotaxis
Signaling in Immune system
lymphocyte proliferation
adaptive immune response
positive regulation of cytokine ...
hemostasis blood coagulation
Measles
intracellular protein kinase cas...
interspecies interaction between...
Osteoclast differentiation
integrin−mediated signaling pathway
peptidyl−tyrosine modification
B cell receptor signaling pathway
positive regulation of cell deathNatural killer cell mediated cyt...
Chemokine signaling pathway
Hemostasis
negative regulation of immune sy...
Hematopoietic cell lineage
regulation of cytokine biosynthe...
cell adhesion
positive regulation of cytokine ...negative regulation of programme...
regulation of response to extern...
regulation of phosphorylation
positive regulation of lymphocyt...
nucleotide−binding domain, leuci... cell cycle phasemitotic cell cycle
regulation of cell cycle process
Cell Cycle, Mitotic
nuclear division
cell division
anaphase−promoting complex−depen...regulation of ubiquitin−protein ...
chromosome segregation
positive regulation of ubiquitin...
Cell Cycle Checkpoints
DNA ReplicationCell cycle
response to DNA damage stimulusnegative regulation of ubiquitin...
negative regulation of ligase ac...cytoskeleton organization
DNA replication
spindle organization
protein complex assembly
regulation of cellular amine met...
DNA damage response, signal tran...cellular amino acid metabolic pr...
sister chromatid segregation
Proteasome
Degradation multiubiquitinated C...
Ornithine metabolism
Degradation of beta−catenin by t...
Degradation of ubiquitinated CD4
APC/C:Cdh1−mediated degradation ...
regulation of mitosis
Regulation of activated PAK−2p34...
Proteasome mediated degradation ...
cell morphogenesis involved in d...
regulation of cyclin−dependent p...
p53 signaling pathway
gland morphogenesisinterspecies interaction between...cell migrationtissue morphogenesis
Tissue
brain
cell line
hematopoietic system
muscle
Enrichment P−value
10−70
10−35
1
A B
C
D
E
Figure 1: Elements of a GO summaries figure
3 Usage of GOsummaries
In most cases the GOsummaries figures can be created using only two commands: gosummaries tocreate the object that has all the necessary information for drawing the plot and plot.gosummariesto actually draw the plot.
The gosummaries function requires a set of gene lists as an input. It applies GO enrichmentanalysis to these gene lists using g:Profiler (http://biit.cs.ut.ee/gprofiler/) web toolkit and savesthe results into a gosummaries object. Then one can add experimental data and configure theslots for additional information.
However, this can be somewhat complicated. Therefore, we have provided several conveniencefunctions to that generate the gosummaries objects based on the output of the most common anal-yses. We have functions gosummaries.kmeans,gosummaries.prcomp and gosummaries.MArrayLM,for k-means clustering, principal component analysis (PCA) and linear models with limma. Thesefunctions extract the gene lists right from the corresponding objects, run the GO enrichment andoptionally add the experimental data in the right format.
The gosummaries can be plotted using the plot function. The figures might not fit into theplotting window, since the plot has to have rather strict layout to be readable. Therefore, it isadvisable to write it into a file (file name can be given as a parameter).
2
-
Pathway analysis
11/20/13 Bioinformatics course 70
Pathway analysis
11/21/12 69 Bioinformatics course
-
And many more…..
11/20/13 Bioinformatics course 71
And many more ……..
11/21/12 71 Bioinformatics course
-
Novel genomes
11/20/13 Bioinformatics course 72
How do we compute RNA-seq gene expression for novel genomes?
Must have complete genome sequence (or contigs). Use predicted gene models (all protein BLASTX or EST vs
genome data) to create an exon map or de novo assembly of transcripts from RNA-seq data Computationally huge problem: all-against-all similarity
searching and multiple overlapping transcripts.
-
11/20/13 Bioinformatics course 73
-
RNA –seq analysis programs
11/20/13 Bioinformatics course 74 Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least
each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.
As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).
Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa
Seed methods Short-read mapping package (SHRiMP)41
Smith-Waterman extension Aligning reads to a reference transcriptome
Reads and reference transcriptome
Stampy39 Probabilistic modelBurrows-Wheeler transform methods
Bowtie43
BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced
alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions
Reads and reference genomeSpliceMap50
TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases
QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction
Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome
Alignments to reference genomeGenome-guided
assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms
Genome-independent reconstruction
Genome-independent assembly
Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome
ReadsTransABySS56
Expression quantificationExpression quantification
Gene quantification Alexa-seq47 Quantifies using differentially included exons
Quantifying gene expression Reads and transcript models
Enhanced read analysis of gene expression (ERANGE)20
Quantifies using union of exons
Normalization by expected uniquely mappable area (NEUMA)82
Quantifies using unique reads
Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression
Quantifying transcript isoform expression levels
Read alignments to isoformsMISO33
RNA-seq by expectaion maximization (RSEM)69
Differential expression
Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms
Read alignments and transcript models
DegSeq79 Uses a normal distributionEdgeR77
Differential Expression analysis of count data (DESeq)78
Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.
470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS
REVIEW
-
Comparison of tools
11/20/13 Bioinformatics course 75
Comparison of tools
11/21/12 75 Bioinformatics course
-
Challenges
11/20/13 Bioinformatics course 76
Several sequencing technolgies Complex normalization Difficulty to achieve mappability Accurate detection of splice junction Proper summarization methods needed Most challenging for novel genomes Not many algorithms exist for denovo assembly when
compared to reference assembly.
-
Summary
11/20/13 Bioinformatics course 77
RNA-seq to study RNA content Quantitative than microarrays Can be used for studying different layers of transcription several factors to be considered in experimental design Mapping, transcript assembly, summarization, differential
expression and visualization are the major steps in RNA-seq Gene ontology analysis, pathway analysis, integrative study
followed by systems biology are the possible proceeding steps of RNA-seq gene lists.