next generation sequencing (ngs)- rna sequencing...use of high-throughput sequencing technologies to...

77
Vijayachitra Modhukur BIIT [email protected] Next generation sequencing (NGS)- RNA sequencing 1 11/20/13 Bioinformatics course

Upload: others

Post on 04-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Vijayachitra Modhukur BIIT

    [email protected]

    Next generation sequencing (NGS)-RNA sequencing

    1 11/20/13 Bioinformatics course

  • NGS lectures

    11/20/13 Bioinformatics course 2

    Genomics

    Transcriptomics

    Protomics

    Epigenomics

  • NGS lectures

    11/20/13 Bioinformatics course 3

    Genomics

    Transcriptomics

    Protomics

    Epigenomics

  • Recap

    11/20/13 Bioinformatics course 4

  • Sequencing

    5 11/20/13 Bioinformatics course

  • Different generations sequencing

    6 11/20/13 Bioinformatics course

  • Second generation sequencing

    7 11/20/13 Bioinformatics course

  • NGS platforms

    11/20/13 Bioinformatics course 8

    Leading Platforms

    454 Solexa/Illumina SOLiD (ABI)

    Bp per run 400 Mb 2-3 Gb 3-6 Gb

    Read length 250-400 bp 35-50 (70-100) bp 35-50 bp

    run time 10 hr 2.5 days 5 days

    Download 20 min 27 hr (44 min) ~1 day

    Analysis 2-5 hr 2 days 2-3 days

    Files 20-50 Gb 1T 1 T

    With 3730s, ~60Mb per year Specifications as of summer 2008

  • Massive amount of sequenced data

    Bioinformatics course 11/20/13 9

  • Sequence alignment   De novo alignment   Reference alignment

    10 11/20/13 Bioinformatics course

  • Short read mapping (Denovo) - ssp

    11

    •  Let f1,f2…fk be the words in Σ*. •  We want to find shortest substring g εΣ* such that fi is

    the substring of g •  Example: Lets say we have set of strings f1 = ACGTA, f2

    = CTTGA, f3 = ACTT, f4 = GTAAC •  Find the shortest common superstring of these 4 string

    • 

    So ⌃⇤ is the ”free language” on the alphabet ⌃. Note: if you like monoids, ⌃⇤ has the algebraic structureof a monoid.

    Definition 1. Let f, g 2 ⌃⇤, so that we write

    f = s1s2 · · · sn, g = t1t2 · · · tm

    where si, tj 2 ⌃ for all i and j. Then f is a substring of g if there exists an index k � 1 such thats1s2 · · · sn = tktk+1 · · · tn+k�1. Conversely, g is a superstring of f .

    Example 1. Let f = ACTG and g = AAACTGCA. Then f is substring of g, because g = AAACTGCA.

    The Shortest Common Superstring Problem

    Let f1, f2, · · · fk be words in ⌃⇤. We want to find the shortest string g 2 ⌃⇤ such that each fi is asubstring of g. This is knows as the Shortest Common Superstring Problem (SSP). We practicea solution to this problem informed by Ockham’s Razor: we assume that the best reconstruction is alsothe simplest.

    Example 2. Say we have this set of reads: f1 = ACGTA, f2 = CTTGA, f3 = ACTT, f4 = GTAAC.Find the shortest common superstring of these four strings.

    Figure 2: The SSP for f1, ..., f4

    By treating these reads like puzzle pieces , we can put the four reads into this superstring, which inthis case is of shortest possible length (by inspection). We will soon see that this particular commonsuperstring can be constructed using an algorithm, although the algorithm has some issues.

    It turns out that this problem is di�cult to solve in practice:

    Theorem 3 (Gallant 1980). The SSP is NP-Complete.

    While we will not deal with complexity theory in detail in this class, we can take this to mean thatthe SSP is provably hard in a nasty way. This problem is related to graph theory.

    2

    11/20/13 Bioinformatics course

  • Reference alignment

    11/20/13 Bioinformatics course 12

    Find locations where short read is identical to reference genome

  • NGS Analysis

    13 11/20/13 Bioinformatics course

  • Data analysis

    cpu/memory intensive

    14 11/20/13 Bioinformatics course

  • Quality scores   Each base from a sequencer comes with a quality score   Base-calling error probabilities   Phred quality score   Q = 10 log10 P   higher quality score indicates a smaller probability of error

    15

    http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

    11/20/13 Bioinformatics course

  • Quality scores

    16

    http://www.illumina.com/truseq/quality_101/quality_scores.ilmn

    11/20/13 Bioinformatics course

  • File formats

    17 11/20/13 Bioinformatics course

  • fastQ

    Raw data

    18 11/20/13 Bioinformatics course

  • Alignment methods

    11/20/13 Bioinformatics course 19

     Reference assembly   Spaced seed   BWT

     Denovo assembly   Greedy Assemblers   Graph based –Overlap layout consensus   Graph based –Debruign graph

  • RNA sequencing

    11/20/13 Bioinformatics course 20

  • Transcription

    11/20/13 Bioinformatics course 21

  • RNA world hypothesis

    11/20/13 Bioinformatics course 22

  • What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample.

    Journal of Biomedicine and Biotechnology 11

    Exon

    IntronSequence read

    Signal from annoted exons

    Non-exonic signal

    Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.

    In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.

    Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced

    reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.

    In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.

    Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power

    slides from Halisha Holloway 11/20/13 Bioinformatics course 23

  • RNA-seq Microarray ID novel genes, transcripts, & exons

    Well vetted QC and analysis methods

    Greater dynamic range Well characterized biases

    Less bias due to genetic variation Quick turnaround from established core facilities

    Repeatable Currently less expensive

    No species-specific primer/probe design

    More accurate relative to qPCR

    Many more applications

    RNA-seq vs microarray

    11/20/13 Bioinformatics course 24

  • RNA-Seq vs microarray

    11/20/13 Bioinformatics course 25

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    11/20/13 Bioinformatics course 26

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    Skelly et al. 2011 11/20/13 Bioinformatics course 27

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    11/20/13 Bioinformatics course 28

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    11/20/13 Bioinformatics course 29

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    Pluripotent Stem Cell

    Cardiomyocytes Cardiogenic Mesoderm

    Cardiac Precursors

    11/20/13 Bioinformatics course 30

  • Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

    fusions   Profile transcriptome   Ribosome profiling to measure

    translation

    11/20/13 Bioinformatics course 31

  • RNA-seq protocol

    11/20/13 Bioinformatics course 32

  • 11/20/13 Bioinformatics course 33

    RNA-seq protocol

    11/21/12 35

    RNA-Seq protocol

    Sample RNA

    Amplified cDNA

    cDNA fragments

    reverse transcription

    + PCR fragmentationsequencing

    machine

    readsCCTTCNCACTTCGTTTCCCAC

    TTTTTNCAGAGTTTTTTCTTG

    GAACANTCCAACGCTTGGTGA

    GGAAANAAGACCCTGTTGAGC

    CCCGGNGATCCGCTGGGACAA

    GCAGCATATTGATAGATAACT

    CTAGCTACGCGTACGCGATCG

    CATCTAGCATCGCGTTGCGTT

    CCCGCGCGCTTAGGCTACTCG

    TCACACATCTCTAGCTAGCAT

    CATGCTAGCTATGCCTATCTA

    CACCCCGGGGATATATAGGAT

    16

    Bioinformatics course

  • RNA-seq data

    11/20/13 Bioinformatics course 34

    RNA –seq data

    11/21/12 36

    RNA-Seq data

    @HWUSI-EAS1789_0001:3:2:1708:1305#0/1CCTTCNCACTTCGTTTCCCACTTAGCGATAATTTG+HWUSI-EAS1789_0001:3:2:1708:1305#0/1VVULVBVYVYZZXZZ\ee[a^b`[a\a[\\a^^^\@HWUSI-EAS1789_0001:3:2:2062:1304#0/1TTTTTNCAGAGTTTTTTCTTGAACTGGAAATTTTT+HWUSI-EAS1789_0001:3:2:2062:1304#0/1a__[\Bbbb`edeeefd`cc`b]bffff`ffffff@HWUSI-EAS1789_0001:3:2:3194:1303#0/1GAACANTCCAACGCTTGGTGAATTCTGCTTCACAA+HWUSI-EAS1789_0001:3:2:3194:1303#0/1ZZ[[VBZZY][TWQQZ\ZS\[ZZXV__\OX`a[ZZ@HWUSI-EAS1789_0001:3:2:3716:1304#0/1GGAAANAAGACCCTGTTGAGCTTGACTCTAGTCTG+HWUSI-EAS1789_0001:3:2:3716:1304#0/1aaXWYBZVTXZX_]Xdccdfbb_\`a\aY_^]LZ^@HWUSI-EAS1789_0001:3:2:5000:1304#0/1CCCGGNGATCCGCTGGGACAAGCAGCATATTGATA+HWUSI-EAS1789_0001:3:2:5000:1304#0/1aaaaaBeeeeffffehhhhhhggdhhhhahhhadh

    namesequencequalities

    read

    1 Illumina (GAIIX) lane

    ~20 million reads

    read1

    read2

    paired-end reads

    17

    ?

    ?

    Bioinformatics course

  • Coverage

    11/20/13 Bioinformatics course 35

     Coverage = Number of sequenced reads/Size of the original genome

     The number of sequenced reads = Number of reads × length of the reads

  • Some things to consider in experimental design

    11/20/13 Bioinformatics course 36

  • Plan it well   Experimental design

      Biological replicates  Reference genome?  Good gene annotation?

      Read depth   Read length   Paired vs. single-end

    Technical variation

    Biological variation

    11/20/13 Bioinformatics course 37

  • Plan it well   Experimental design

      Biological replicates  Reference genome?  Good gene annotation?

      Read depth   Read length   Paired vs. single-end

    11/20/13 Bioinformatics course 38

  • Plan it well   Experimental design

      Biological replicates  Reference genome?  Good gene annotation?

      Read depth   Read length   Paired vs. single-end

    ●●●●

    ●●

    ●●

    0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Robustness of transcript identification as input data are removed

    Fraction of total number of reads in jackknifed data setFr

    actio

    n of

    tran

    scrip

    ts w

    ith n

    on−z

    ero

    FPK

    M (r

    elat

    ive

    to 1

    00%

    )

    10%

    5%

    2%

    1%

    0.1%

    ●●●●●●●●

    CufflinksUSeq-DESeq

    11/20/13 Bioinformatics course 39

  • How much data do we need?  ~15-20K genes expressed in a tissue | cell line.  Genes are on average 3KB   For 1x coverage using 100 bp reads, would need 600K

    sequence reads   In reality, we need MUCH higher coverage to accurately

    estimate gene expression levels.   30-50 million reads

    11/20/13 Bioinformatics course 40

  • Plan it well   Experimental design

      Biological replicates  Reference genome?  Good gene annotation?

      Read depth   Read length   Paired vs. single-end

    Uniq seq = 4read length

    Read length   Unique seq  

    25   1.1x1015  

    50   1.3x1030  

    100   1.6x1060  

    ~60 million coding bases in vertebrate genome

    11/20/13 Bioinformatics course 41

  • Plan it well   Experimental design

      Biological replicates  Reference genome?  Good gene annotation?

      Read depth   Barcoding   Read length   Paired vs. single-end

    11/20/13 Bioinformatics course 42

  • Power of paired-end reads   Huge impact on read mapping

      Pairs give two locations to determine whether read is unique   Critical for estimating transcript-level abundance

      Increases number of splice junction spanning reads

    11/20/13 Bioinformatics course 43

  • Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones.

    Auer P L , and Doerge R W Genetics 2010;185:405-416

    Copyright © 2010 by the Genetics Society of America 11/20/13 Bioinformatics course 44

  • RNA-seq pipeline

    11/20/13 Bioinformatics course 45

  • Typical RNA-seq experiment

    11/20/13 Bioinformatics course 46

  • 11/20/13 Bioinformatics course 47

    RNA-seq informatics workflow

    1. Qc and genome mapping 2. Splice junction fragments 3. Predict novel junctions/

    exons 4. Counts 5. Normalize 6. Differential expression 7. Gene lists

  • Quality control

    11/20/13 Bioinformatics course 48

  • QC: Raw Data   Sequence call quality

    11/20/13 Bioinformatics course 49

  • QC: Raw Data   Sequence bias

    11/20/13 Bioinformatics course 50

  • QC: Raw Data   Duplication level

    11/20/13 Bioinformatics course 51

  • Mapping

    11/20/13 Bioinformatics course 52

  • Mapping

    11/20/13 Bioinformatics course 53

    Journal of Biomedicine and Biotechnology 11

    Exon

    IntronSequence read

    Signal from annoted exons

    Non-exonic signal

    Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.

    In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.

    Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced

    reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.

    In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.

    Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power

    Align read to the genome •  Simple for genomic sequences •  Difficult for transcripts with splice junction

  • Junction reads

    11/20/13 Bioinformatics course 54

  • Tophat-pipeline

    11/20/13 Bioinformatics course 55

  • Alternative splicing

    11/20/13 Bioinformatics course 56

  • Alternative splicing

    11/20/13 Bioinformatics course 57

  • Cuff-links

    11/20/13 Bioinformatics course 58

  • RNA-seq complete pipeline

    11/20/13 Bioinformatics course 59

  • RNA seq-summarization

    11/20/13 Bioinformatics course 60

  • Normalization aims

    11/20/13 Bioinformatics course 61

     Comparable across features (genes, isoforms etc.,)

     Comparable across different samples (libraries)   Between samples (libraries)   Within sampes(libraries)

     Easily interprettable

  • Within library normalization

    11/20/13 Bioinformatics course 62

       Allows quantification of expression levels of each gene relative to each other’s gene with in the library

     Longer transcripts have higher read counts( with same expression level)

     Widely used : RPKM (Reads per Kilobase per Million Base)

  • RPKM-example

    11/20/13 Bioinformatics course 63

      No.of mapped reads =3   lenth of transcript=300 bp   Total no. of reads =10,000

      RPK = 3/(300/1000) = 3/0.3 = 10

      RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000

      RPKM =1000

  • Between library normalization

    11/20/13 Bioinformatics course 64

     Adjust by total number of reads in the library  Smaller number of highly expressed genes can

    consume significant amount of sequences  Solution: scaling factor  Scaling the number of reads in a library to a

    common value  Quantile normalization

  • Differential expression

    11/20/13 Bioinformatics course 65

      List genes changed significantly in abundace across different experimental conditions

     Not same as microarrays , since not log transformed    If reads independently sampled from population, reads

    would follow multinomial distribution appx by Poisson

      Pr(X  =  k)  =λke-k /k!  

  • Several tools for differential expression…

    11/20/13 Bioinformatics course 66

    Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least

    each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.

    As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).

    Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa

    Seed methods Short-read mapping package (SHRiMP)41

    Smith-Waterman extension Aligning reads to a reference transcriptome

    Reads and reference transcriptome

    Stampy39 Probabilistic modelBurrows-Wheeler transform methods

    Bowtie43

    BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced

    alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions

    Reads and reference genomeSpliceMap50

    TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases

    QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction

    Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome

    Alignments to reference genomeGenome-guided

    assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms

    Genome-independent reconstruction

    Genome-independent assembly

    Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome

    ReadsTransABySS56

    Expression quantificationExpression quantification

    Gene quantification Alexa-seq47 Quantifies using differentially included exons

    Quantifying gene expression Reads and transcript models

    Enhanced read analysis of gene expression (ERANGE)20

    Quantifies using union of exons

    Normalization by expected uniquely mappable area (NEUMA)82

    Quantifies using unique reads

    Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression

    Quantifying transcript isoform expression levels

    Read alignments to isoformsMISO33

    RNA-seq by expectaion maximization (RSEM)69

    Differential expression

    Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms

    Read alignments and transcript models

    DegSeq79 Uses a normal distributionEdgeR77

    Differential Expression analysis of count data (DESeq)78

    Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.

    470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS

    REVIEW

  • Analysis of differentially expressed gene list

    11/20/13 Bioinformatics course 67

  • Gene ontology analysis

    11/20/13 Bioinformatics course 68

    The main input of g:Sorter is a single gene ID. The userselects an expression dataset, a mathematical measure ofdistance like the Pearson correlation or Euclidean distance,and the size of the desired result. The result of g:Sorter

    analysis is a list of probes most similar (or dissimilar) tothe query gene in the selected dataset. Visualisation showsthe relative distances between probes. In case a geneis represented by several probes, a search is conducted

    (A)

    (B)

    Figure 1. (A) A typical user input and output scenario of g:Profiler. User inserts a set of genes in the main text window and optionally adjusts queryparameters. Results are provided either graphically or in textual format. Genes are presented in columns, and significant functional categories inrows. The analysis of an ordered list shows the length of the most significant query head. GO annotation evidence codes are coloured likea heat map, showing the strength of evidence between a gene and GO term. The legend is provided at the top of the page. It is displayed when theuser clicks on the tree icon on the results page. The g:Orth, g:Convert and G:Sorter tools are directly linked to relevant genes from the current query.Additional examples are available in Supplementary Data. (B) Hierarchical relations between the resulting GO categories can be browsed by clickingon corresponding icons.

    Nucleic Acids Research, 2007 3

  • Gene ontology –Gosummaries

    11/20/13 Bioinformatics course 69

    cell.line VS brainG1 > G2: 2168G1 < G2: 2132

    cell cycle phasemitotic cell cyclecell cycle checkpoint

    nuclear division

    Cell Cycle, Mitotic

    DNA replication

    response to DNA damage stimuluscell division mRNA metabolic process

    translation

    Cell cycle

    chromosome segregationanaphase−promoting complex−depen...

    RNA processing

    DNA Replication

    Cell Cycle Checkpointscellular component biogenesis at...

    ncRNA metabolic process

    regulation of ubiquitin−protein ...

    spindle organization

    positive regulation of protein u...

    cellular macromolecular complex ...positive regulation of ligase ac...

    chromosome organization

    RNA transport

    interspecies interaction between...

    negative regulation of ubiquitin...DNA recombination

    DNA damage response, signal tran...

    DNA conformation change

    viral reproduction

    regulation of mitosis

    p53 signaling pathway

    establishment of organelle local...

    protein complex subunit organiza...

    regulation of cellular amino aci...protein N−linked glycosylation

    intracellular protein transport

    protein N−linked glycosylation v...

    DNA−dependent transcription, ter...

    multicellular organismal signalingneuron development

    neuron projection development

    neuron projection morphogenesis

    regulation of synaptic transmission

    central nervous system development

    regulation of membrane potential

    behavior

    axon guidance

    regulation of nervous system dev...

    regulation of neuron differentia...

    ion transport

    neurotransmitter transport

    Glutamatergic synapse

    transmembrane receptor protein t...

    cytoskeleton organization

    GABAergic synapse

    synapse organization

    generation of a signal involved ...

    Retrograde endocannabinoid signa...

    cognition

    Dopaminergic synapse

    ion transmembrane transport

    purine nucleoside triphosphate m...

    secretion by cell

    Opioid Signalling

    Long−term potentiation

    vesicle−mediated transport

    GTP catabolic process

    regulation of transporter activity

    Gastric acid secretion

    Morphine addiction

    positive regulation of cellular ...

    Calcium signaling pathway

    negative regulation of cellular ...

    regulation of small GTPase media...

    actin filament−based process

    regulation of cellular localization

    Salivary secretion

    regulation of cell morphogenesis...

    muscle VS hematopoietic.systemG1 > G2: 1527G1 < G2: 1159

    cardiovascular system developmentmuscle structure development

    muscle system process

    cell adhesion

    generation of precursor metaboli...

    energy derivation by oxidation o...

    muscle tissue development

    Glucose Regulation of Insulin Se...

    anatomical structure formation i...

    circulatory system process

    cell migration

    actin filament−based process

    organ morphogenesis

    cell morphogenesis involved in d...

    response to endogenous stimulusregulation of cell migration

    Parkinson's disease

    enzyme linked receptor protein s...

    Dilated cardiomyopathy

    neuron projection morphogenesis

    regulation of system process

    Focal adhesionCardiac muscle contraction

    taxisacetyl−CoA metabolic process

    wound healing

    glucose metabolic process

    Oxidative phosphorylation

    regulation of anatomical structu...

    Hypertrophic cardiomyopathy (HCM)

    tissue morphogenesis

    Alzheimer's disease

    ECM−receptor interaction

    Huntington's disease

    extracellular matrix organization

    response to inorganic substance

    cell junction assembly

    Arrhythmogenic right ventricular...

    epithelium development

    Glucose metabolism cell activationpositive regulation of immune sy...

    regulation of immune response

    hemopoiesis

    immune effector process

    response to other organism

    leukocyte migration

    innate immune response

    cytokine production

    cell chemotaxis

    hemostasislymphocyte proliferation

    blood coagulation

    inflammatory response

    adaptive immune response

    response to cytokine stimulusinterspecies interaction between...

    positive regulation of catalytic...

    regulation of defense response

    positive regulation of protein m...integrin−mediated signaling pathway

    regulation of hydrolase activity

    vesicle−mediated transportpeptidyl−tyrosine phosphorylation

    regulation of protein phosphoryl...

    positive regulation of cytokine ...

    positive regulation of cytokine ...

    actin polymerization or depolyme...positive regulation of lymphocyt...

    cell adhesion

    regulation of protein kinase act...

    induction of apoptosis

    negative regulation of programme...

    Hematopoietic cell lineage

    protein complex subunit organiza...

    intracellular protein kinase cas...

    Chemokine signaling pathway

    Natural killer cell mediated cyt...

    positive regulation of leukocyte...

    Leukocyte transendothelial migra...

    hematopoietic.system VS cell.lineG1 > G2: 1221G1 < G2: 1289

    cell activationinnate immune responseregulation of immune response

    positive regulation of immune sy...

    response to other organism

    immune effector process

    cytokine production

    leukocyte differentiation

    leukocyte migration response to cytokine stimulusinflammatory response

    regulation of defense response

    cell chemotaxis

    Signaling in Immune system

    lymphocyte proliferation

    adaptive immune response

    positive regulation of cytokine ...

    hemostasis blood coagulation

    Measles

    intracellular protein kinase cas...

    interspecies interaction between...

    Osteoclast differentiation

    integrin−mediated signaling pathway

    peptidyl−tyrosine modification

    B cell receptor signaling pathway

    positive regulation of cell deathNatural killer cell mediated cyt...

    Chemokine signaling pathway

    Hemostasis

    negative regulation of immune sy...

    Hematopoietic cell lineage

    regulation of cytokine biosynthe...

    cell adhesion

    positive regulation of cytokine ...negative regulation of programme...

    regulation of response to extern...

    regulation of phosphorylation

    positive regulation of lymphocyt...

    nucleotide−binding domain, leuci... cell cycle phasemitotic cell cycle

    regulation of cell cycle process

    Cell Cycle, Mitotic

    nuclear division

    cell division

    anaphase−promoting complex−depen...regulation of ubiquitin−protein ...

    chromosome segregation

    positive regulation of ubiquitin...

    Cell Cycle Checkpoints

    DNA ReplicationCell cycle

    response to DNA damage stimulusnegative regulation of ubiquitin...

    negative regulation of ligase ac...cytoskeleton organization

    DNA replication

    spindle organization

    protein complex assembly

    regulation of cellular amine met...

    DNA damage response, signal tran...cellular amino acid metabolic pr...

    sister chromatid segregation

    Proteasome

    Degradation multiubiquitinated C...

    Ornithine metabolism

    Degradation of beta−catenin by t...

    Degradation of ubiquitinated CD4

    APC/C:Cdh1−mediated degradation ...

    regulation of mitosis

    Regulation of activated PAK−2p34...

    Proteasome mediated degradation ...

    cell morphogenesis involved in d...

    regulation of cyclin−dependent p...

    p53 signaling pathway

    gland morphogenesisinterspecies interaction between...cell migrationtissue morphogenesis

    Tissue

    brain

    cell line

    hematopoietic system

    muscle

    Enrichment P−value

    10−70

    10−35

    1

    A B

    C

    D

    E

    Figure 1: Elements of a GO summaries figure

    3 Usage of GOsummaries

    In most cases the GOsummaries figures can be created using only two commands: gosummaries tocreate the object that has all the necessary information for drawing the plot and plot.gosummariesto actually draw the plot.

    The gosummaries function requires a set of gene lists as an input. It applies GO enrichmentanalysis to these gene lists using g:Profiler (http://biit.cs.ut.ee/gprofiler/) web toolkit and savesthe results into a gosummaries object. Then one can add experimental data and configure theslots for additional information.

    However, this can be somewhat complicated. Therefore, we have provided several conveniencefunctions to that generate the gosummaries objects based on the output of the most common anal-yses. We have functions gosummaries.kmeans,gosummaries.prcomp and gosummaries.MArrayLM,for k-means clustering, principal component analysis (PCA) and linear models with limma. Thesefunctions extract the gene lists right from the corresponding objects, run the GO enrichment andoptionally add the experimental data in the right format.

    The gosummaries can be plotted using the plot function. The figures might not fit into theplotting window, since the plot has to have rather strict layout to be readable. Therefore, it isadvisable to write it into a file (file name can be given as a parameter).

    2

  • Pathway analysis

    11/20/13 Bioinformatics course 70

    Pathway analysis

    11/21/12 69 Bioinformatics course

  • And many more…..

    11/20/13 Bioinformatics course 71

    And many more ……..

    11/21/12 71 Bioinformatics course

  • Novel genomes

    11/20/13 Bioinformatics course 72

      How do we compute RNA-seq gene expression for novel genomes?

      Must have complete genome sequence (or contigs).   Use predicted gene models (all protein BLASTX or EST vs

    genome data) to create an exon map or   de novo assembly of transcripts from RNA-seq data   Computationally huge problem: all-against-all similarity

    searching and multiple overlapping transcripts.

  • 11/20/13 Bioinformatics course 73

  • RNA –seq analysis programs

    11/20/13 Bioinformatics course 74 Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least

    each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.

    As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).

    Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa

    Seed methods Short-read mapping package (SHRiMP)41

    Smith-Waterman extension Aligning reads to a reference transcriptome

    Reads and reference transcriptome

    Stampy39 Probabilistic modelBurrows-Wheeler transform methods

    Bowtie43

    BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced

    alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions

    Reads and reference genomeSpliceMap50

    TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases

    QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction

    Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome

    Alignments to reference genomeGenome-guided

    assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms

    Genome-independent reconstruction

    Genome-independent assembly

    Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome

    ReadsTransABySS56

    Expression quantificationExpression quantification

    Gene quantification Alexa-seq47 Quantifies using differentially included exons

    Quantifying gene expression Reads and transcript models

    Enhanced read analysis of gene expression (ERANGE)20

    Quantifies using union of exons

    Normalization by expected uniquely mappable area (NEUMA)82

    Quantifies using unique reads

    Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression

    Quantifying transcript isoform expression levels

    Read alignments to isoformsMISO33

    RNA-seq by expectaion maximization (RSEM)69

    Differential expression

    Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms

    Read alignments and transcript models

    DegSeq79 Uses a normal distributionEdgeR77

    Differential Expression analysis of count data (DESeq)78

    Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.

    470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS

    REVIEW

  • Comparison of tools

    11/20/13 Bioinformatics course 75

    Comparison of tools

    11/21/12 75 Bioinformatics course

  • Challenges

    11/20/13 Bioinformatics course 76

      Several sequencing technolgies   Complex normalization   Difficulty to achieve mappability   Accurate detection of splice junction   Proper summarization methods needed   Most challenging for novel genomes   Not many algorithms exist for denovo assembly when

    compared to reference assembly.

  • Summary

    11/20/13 Bioinformatics course 77

      RNA-seq to study RNA content   Quantitative than microarrays   Can be used for studying different layers of transcription   several factors to be considered in experimental design   Mapping, transcript assembly, summarization, differential

    expression and visualization are the major steps in RNA-seq   Gene ontology analysis, pathway analysis, integrative study

    followed by systems biology are the possible proceeding steps of RNA-seq gene lists.