next generation sequencing (ngs)- rna sequencing...use of high-throughput sequencing technologies to...

Vijayachitra Modhukur BIIT

[email protected]

Next generation sequencing (NGS)-RNA sequencing

1 11/20/13 Bioinformatics course

NGS lectures

11/20/13 Bioinformatics course 2

Genomics

Transcriptomics

Protomics

Epigenomics

NGS lectures


Genomics

Transcriptomics

Protomics

Epigenomics

Recap


Sequencing


Different generations sequencing


Second generation sequencing


NGS platforms


Leading Platforms

454 Solexa/Illumina SOLiD (ABI)

Bp per run 400 Mb 2-3 Gb 3-6 Gb

Read length 250-400 bp 35-50 (70-100) bp 35-50 bp

run time 10 hr 2.5 days 5 days

Download 20 min 27 hr (44 min) ~1 day

Analysis 2-5 hr 2 days 2-3 days

Files 20-50 Gb 1T 1 T

With 3730s, ~60Mb per year Specifications as of summer 2008

Massive amount of sequenced data

Bioinformatics course 11/20/13 9

Sequence alignment   De novo alignment   Reference alignment


Short read mapping (Denovo) - ssp

11

•  Let f1,f2…fk be the words in Σ*. •  We want to find shortest substring g εΣ* such that fi is

the substring of g •  Example: Lets say we have set of strings f1 = ACGTA, f2

= CTTGA, f3 = ACTT, f4 = GTAAC •  Find the shortest common superstring of these 4 string

• 

So ⌃⇤ is the ”free language” on the alphabet ⌃. Note: if you like monoids, ⌃⇤ has the algebraic structureof a monoid.

Definition 1. Let f, g 2 ⌃⇤, so that we write

f = s1s2 · · · sn, g = t1t2 · · · tm

where si, tj 2 ⌃ for all i and j. Then f is a substring of g if there exists an index k � 1 such thats1s2 · · · sn = tktk+1 · · · tn+k�1. Conversely, g is a superstring of f .

Example 1. Let f = ACTG and g = AAACTGCA. Then f is substring of g, because g = AAACTGCA.

The Shortest Common Superstring Problem

Let f1, f2, · · · fk be words in ⌃⇤. We want to find the shortest string g 2 ⌃⇤ such that each fi is asubstring of g. This is knows as the Shortest Common Superstring Problem (SSP). We practicea solution to this problem informed by Ockham’s Razor: we assume that the best reconstruction is alsothe simplest.

Example 2. Say we have this set of reads: f1 = ACGTA, f2 = CTTGA, f3 = ACTT, f4 = GTAAC.Find the shortest common superstring of these four strings.

Figure 2: The SSP for f1, ..., f4

By treating these reads like puzzle pieces , we can put the four reads into this superstring, which inthis case is of shortest possible length (by inspection). We will soon see that this particular commonsuperstring can be constructed using an algorithm, although the algorithm has some issues.

It turns out that this problem is di�cult to solve in practice:

Theorem 3 (Gallant 1980). The SSP is NP-Complete.

While we will not deal with complexity theory in detail in this class, we can take this to mean thatthe SSP is provably hard in a nasty way. This problem is related to graph theory.

2

11/20/13 Bioinformatics course

Reference alignment


Find locations where short read is identical to reference genome

NGS Analysis


Data analysis

cpu/memory intensive


Quality scores   Each base from a sequencer comes with a quality score   Base-calling error probabilities   Phred quality score   Q = 10 log10 P   higher quality score indicates a smaller probability of error

15

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn


Quality scores

16

http://www.illumina.com/truseq/quality_101/quality_scores.ilmn


File formats


fastQ

Raw data


Alignment methods


 Reference assembly   Spaced seed   BWT

 Denovo assembly   Greedy Assemblers   Graph based –Overlap layout consensus   Graph based –Debruign graph

RNA sequencing


Transcription


RNA world hypothesis


What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample.

Journal of Biomedicine and Biotechnology 11

Exon

IntronSequence read

Signal from annoted exons

Non-exonic signal

Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.

In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.

Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced

reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.

In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.

Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power

slides from Halisha Holloway 11/20/13 Bioinformatics course 23

RNA-seq Microarray ID novel genes, transcripts, & exons

Well vetted QC and analysis methods

Greater dynamic range Well characterized biases

Less bias due to genetic variation Quick turnaround from established core facilities

Repeatable Currently less expensive

No species-specific primer/probe design

More accurate relative to qPCR

Many more applications

RNA-seq vs microarray


RNA-Seq vs microarray


Why do an RNA-seq experiment?   Detect differential expression   Assess allele-specific expression   Quantify alternative transcript usage   Discover novel genes/transcripts, gene

fusions   Profile transcriptome   Ribosome profiling to measure

translation




translation

Skelly et al. 2011 11/20/13 Bioinformatics course 27



translation




translation

Pluripotent Stem Cell

Cardiomyocytes Cardiogenic Mesoderm

Cardiac Precursors




translation


RNA-seq protocol



RNA-seq protocol

11/21/12 35

RNA-Seq protocol

Sample RNA

Amplified cDNA

cDNA fragments

reverse transcription

+ PCR fragmentationsequencing

machine

readsCCTTCNCACTTCGTTTCCCAC

TTTTTNCAGAGTTTTTTCTTG

GAACANTCCAACGCTTGGTGA

GGAAANAAGACCCTGTTGAGC

CCCGGNGATCCGCTGGGACAA

GCAGCATATTGATAGATAACT

CTAGCTACGCGTACGCGATCG

CATCTAGCATCGCGTTGCGTT

CCCGCGCGCTTAGGCTACTCG

TCACACATCTCTAGCTAGCAT

CATGCTAGCTATGCCTATCTA

CACCCCGGGGATATATAGGAT

16

Bioinformatics course

RNA-seq data


RNA –seq data

11/21/12 36

RNA-Seq data

@HWUSI-EAS1789_0001:3:2:1708:1305#0/1CCTTCNCACTTCGTTTCCCACTTAGCGATAATTTG+HWUSI-EAS1789_0001:3:2:1708:1305#0/1VVULVBVYVYZZXZZ\ee[a^b`[a\a[\\a^^^\@HWUSI-EAS1789_0001:3:2:2062:1304#0/1TTTTTNCAGAGTTTTTTCTTGAACTGGAAATTTTT+HWUSI-EAS1789_0001:3:2:2062:1304#0/1a__[\Bbbbèdeeefd`cc`b]bffff`ffffff@HWUSI-EAS1789_0001:3:2:3194:1303#0/1GAACANTCCAACGCTTGGTGAATTCTGCTTCACAA+HWUSI-EAS1789_0001:3:2:3194:1303#0/1ZZ[[VBZZY][TWQQZ\ZS\[ZZXV__\OXà[ZZ@HWUSI-EAS1789_0001:3:2:3716:1304#0/1GGAAANAAGACCCTGTTGAGCTTGACTCTAGTCTG+HWUSI-EAS1789_0001:3:2:3716:1304#0/1aaXWYBZVTXZX_]Xdccdfbb_\à\aY_^]LZ^@HWUSI-EAS1789_0001:3:2:5000:1304#0/1CCCGGNGATCCGCTGGGACAAGCAGCATATTGATA+HWUSI-EAS1789_0001:3:2:5000:1304#0/1aaaaaBeeeeffffehhhhhhggdhhhhahhhadh

namesequencequalities

read

1 Illumina (GAIIX) lane

~20 million reads

read1

read2

paired-end reads

17

?

?

Bioinformatics course

Coverage


 Coverage = Number of sequenced reads/Size of the original genome

 The number of sequenced reads = Number of reads × length of the reads

Some things to consider in experimental design


Plan it well   Experimental design

  Biological replicates  Reference genome?  Good gene annotation?

  Read depth   Read length   Paired vs. single-end

Technical variation

Biological variation





●●●●

●●

●●

●

●

●

●

●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Robustness of transcript identification as input data are removed

Fraction of total number of reads in jackknifed data setFr

actio

n of

tran

scrip

ts w

ith n

on−z

ero

FPK

M (r

elat

ive

to 1

00%

)

10%

5%

2%

1%

0.1%

●●●●●●●●

●

●

●

●

●

●

CufflinksUSeq-DESeq


How much data do we need?  ~15-20K genes expressed in a tissue | cell line.  Genes are on average 3KB   For 1x coverage using 100 bp reads, would need 600K

sequence reads   In reality, we need MUCH higher coverage to accurately

estimate gene expression levels.   30-50 million reads





Uniq seq = 4read length

Read length Unique seq

25 1.1x1015

50 1.3x1030

100 1.6x1060

~60 million coding bases in vertebrate genome




  Read depth   Barcoding   Read length   Paired vs. single-end


Power of paired-end reads   Huge impact on read mapping

  Pairs give two locations to determine whether read is unique   Critical for estimating transcript-level abundance

  Increases number of splice junction spanning reads


Comparison of two designs for testing differential expression between treatments A and B. Treatment A is denoted by red tones and treatment B by blue tones.

Auer P L , and Doerge R W Genetics 2010;185:405-416

Copyright © 2010 by the Genetics Society of America 11/20/13 Bioinformatics course 44

RNA-seq pipeline


Typical RNA-seq experiment



RNA-seq informatics workflow

1. Qc and genome mapping 2. Splice junction fragments 3. Predict novel junctions/

exons 4. Counts 5. Normalize 6. Differential expression 7. Gene lists

Quality control


QC: Raw Data   Sequence call quality


QC: Raw Data   Sequence bias


QC: Raw Data   Duplication level


Mapping


Mapping


Journal of Biomedicine and Biotechnology 11

Exon

IntronSequence read

Signal from annoted exons

Non-exonic signal

Figure 5: Mapping and quantification of the signal. RNA-seq experiments produce short reads sequenced from processed mRNAs. When areference genome is available the reads can be mapped on it using efficient alignment software. Classical alignment tools will accurately mapreads that fall within an exon, but they will fail to map spliced reads. To handle such problem suitable mappers, based either on junctionslibrary or on more sophisticated approaches, need to be considered. After the mapping step annotated features can be quantified.

In order to derive a quantitative expression for annotatedelements (such as exons or genes) within a genome, thesimplest approach is to provide the expression as the totalnumber of reads mapping to the coordinates of each anno-tated element. In the classical form, such method weightsall the reads equally, even though they map the genomewith different stringency. Alternatively, gene expression canbe calculated as the sum of the number of reads coveringeach base position of the annotated element; in this way theexpression is provided in terms of base coverage. In bothcases, the results depend on the accuracy of the used genemodels and the quantitative measures are a function of thenumber of mapped reads, the length of the region of interestand the molar concentration of the specific transcript. Astraightforward solution to account for the sample sizeeffect is to normalize the observed counts for the lengthof the element and the number of mapped reads. In [37],the authors proposed the Reads Per Kilobase per Million ofmapped reads (RPKM) as a quantitative normalized measurefor comparing both different genes within the same sampleand differences of expression across biological conditions.In [84], the authors considered two alternative measuresof relative expression: the fraction of transcripts and thefraction of nucleotides of the transcriptome made up by agiven gene or isoform.

Although apparently easy to obtain, RPKM values canhave several differences between software packages, hiddenat first sight, due to the lack of a clear documentation of theanalysis algorithms used. For example ERANGE [37] usesa union of known and new exon models to aggregate readsand determines a value for each region that includes spliced

reads and assigned multireads too, whereas [30, 40, 81, 90]are restricted to known or prespecified exons/gene models.However, as noticed in [91], several experimental issuesinfluence the RPKM quantification, including the integrityof the input RNA, the extent of ribosomal RNA remainingin the sample, the size selection steps and the accuracy of thegene models used.

In principle, RPKMs should reflect the true RNAconcentration; this is true when samples have relativelyuniform sequence coverage across the entire gene model.The problem is that all protocols currently fall short ofproviding the desired uniformity, see for example [37], wherethe Kolmogorov-Smirnov statistics is used to compare theobserved reads distribution on each selected exon modelwith the theoretical uniform one. Similar conclusions arealso illustrated in [57, 58], among others.

Additionally, it should be noted that RPKM measureshould not be considered as the panacea for all RNA-Seq experiments. Despite the importance of the issue,the expression quantification did not receive the necessaryattention from the community and in most of the cases thechoice has been done regardless of the fact that the mainquestion is the detection of differentially expressed elements.Regarding this point in [92] it is illustrated the inherent biasin transcript length that affect RNA-Seq experiments. In factthe total number of reads for a given transcript is roughlyproportional to both the expression level and the length ofthe transcript. In other words, a long transcript will havemore reads mapping to it compared to a short gene of similarexpression. Since the power of an experiment is proportionalto the sampling size, there will be more statistical power

Align read to the genome •  Simple for genomic sequences •  Difficult for transcripts with splice junction

Junction reads


Tophat-pipeline


Alternative splicing


Cuff-links


RNA-seq complete pipeline


RNA seq-summarization


Normalization aims


 Comparable across features (genes, isoforms etc.,)

 Comparable across different samples (libraries)   Between samples (libraries)   Within sampes(libraries)

 Easily interprettable

Within library normalization


  Allows quantification of expression levels of each gene relative to each other’s gene with in the library

 Longer transcripts have higher read counts( with same expression level)

 Widely used : RPKM (Reads per Kilobase per Million Base)

RPKM-example


  No.of mapped reads =3   lenth of transcript=300 bp   Total no. of reads =10,000

  RPK = 3/(300/1000) = 3/0.3 = 10

  RPKM = 10 / (10,000/1,000,000) = 10/ 0.01 = 1000

  RPKM =1000

Between library normalization


 Adjust by total number of reads in the library  Smaller number of highly expressed genes can

consume significant amount of sequences  Solution: scaling factor  Scaling the number of reads in a library to a

common value  Quantile normalization

Differential expression


  List genes changed significantly in abundace across different experimental conditions

 Not same as microarrays , since not log transformed   If reads independently sampled from population, reads

would follow multinomial distribution appx by Poisson

  Pr(X = k) =λke-k /k!

Several tools for differential expression…


Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least

each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.

As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).

Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa

Seed methods Short-read mapping package (SHRiMP)41

Smith-Waterman extension Aligning reads to a reference transcriptome

Reads and reference transcriptome

Stampy39 Probabilistic modelBurrows-Wheeler transform methods

Bowtie43

BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced

alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions

Reads and reference genomeSpliceMap50

TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases

QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction

Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome

Alignments to reference genomeGenome-guided

assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms

Genome-independent reconstruction

Genome-independent assembly

Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome

ReadsTransABySS56

Expression quantificationExpression quantification

Gene quantification Alexa-seq47 Quantifies using differentially included exons

Quantifying gene expression Reads and transcript models

Enhanced read analysis of gene expression (ERANGE)20

Quantifies using union of exons

Normalization by expected uniquely mappable area (NEUMA)82

Quantifies using unique reads

Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression

Quantifying transcript isoform expression levels

Read alignments to isoformsMISO33

RNA-seq by expectaion maximization (RSEM)69


Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms

Read alignments and transcript models

DegSeq79 Uses a normal distributionEdgeR77

Differential Expression analysis of count data (DESeq)78

Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.

470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS

REVIEW

Analysis of differentially expressed gene list


Gene ontology analysis


The main input of g:Sorter is a single gene ID. The userselects an expression dataset, a mathematical measure ofdistance like the Pearson correlation or Euclidean distance,and the size of the desired result. The result of g:Sorter

analysis is a list of probes most similar (or dissimilar) tothe query gene in the selected dataset. Visualisation showsthe relative distances between probes. In case a geneis represented by several probes, a search is conducted

(A)

(B)

Figure 1. (A) A typical user input and output scenario of g:Profiler. User inserts a set of genes in the main text window and optionally adjusts queryparameters. Results are provided either graphically or in textual format. Genes are presented in columns, and significant functional categories inrows. The analysis of an ordered list shows the length of the most significant query head. GO annotation evidence codes are coloured likea heat map, showing the strength of evidence between a gene and GO term. The legend is provided at the top of the page. It is displayed when theuser clicks on the tree icon on the results page. The g:Orth, g:Convert and G:Sorter tools are directly linked to relevant genes from the current query.Additional examples are available in Supplementary Data. (B) Hierarchical relations between the resulting GO categories can be browsed by clickingon corresponding icons.

Nucleic Acids Research, 2007 3

Gene ontology –Gosummaries


cell.line VS brainG1 > G2: 2168G1 < G2: 2132

cell cycle phasemitotic cell cyclecell cycle checkpoint

nuclear division

Cell Cycle, Mitotic

DNA replication

response to DNA damage stimuluscell division mRNA metabolic process

translation

Cell cycle

chromosome segregationanaphase−promoting complex−depen...

RNA processing

DNA Replication

Cell Cycle Checkpointscellular component biogenesis at...

ncRNA metabolic process

regulation of ubiquitin−protein ...

spindle organization

positive regulation of protein u...

cellular macromolecular complex ...positive regulation of ligase ac...

chromosome organization

RNA transport

interspecies interaction between...

negative regulation of ubiquitin...DNA recombination

DNA damage response, signal tran...

DNA conformation change

viral reproduction

regulation of mitosis

p53 signaling pathway

establishment of organelle local...

protein complex subunit organiza...

regulation of cellular amino aci...protein N−linked glycosylation

intracellular protein transport

protein N−linked glycosylation v...

DNA−dependent transcription, ter...

multicellular organismal signalingneuron development

neuron projection development

neuron projection morphogenesis

regulation of synaptic transmission

central nervous system development

regulation of membrane potential

behavior

axon guidance

regulation of nervous system dev...

regulation of neuron differentia...

ion transport

neurotransmitter transport

Glutamatergic synapse

transmembrane receptor protein t...

cytoskeleton organization

GABAergic synapse

synapse organization

generation of a signal involved ...

Retrograde endocannabinoid signa...

cognition

Dopaminergic synapse

ion transmembrane transport

purine nucleoside triphosphate m...

secretion by cell

Opioid Signalling

Long−term potentiation

vesicle−mediated transport

GTP catabolic process

regulation of transporter activity

Gastric acid secretion

Morphine addiction

positive regulation of cellular ...

Calcium signaling pathway

negative regulation of cellular ...

regulation of small GTPase media...

actin filament−based process

regulation of cellular localization

Salivary secretion

regulation of cell morphogenesis...

muscle VS hematopoietic.systemG1 > G2: 1527G1 < G2: 1159

cardiovascular system developmentmuscle structure development

muscle system process

cell adhesion

generation of precursor metaboli...

energy derivation by oxidation o...

muscle tissue development

Glucose Regulation of Insulin Se...

anatomical structure formation i...

circulatory system process

cell migration

actin filament−based process

organ morphogenesis

cell morphogenesis involved in d...

response to endogenous stimulusregulation of cell migration

Parkinson's disease

enzyme linked receptor protein s...

Dilated cardiomyopathy

neuron projection morphogenesis

regulation of system process

Focal adhesionCardiac muscle contraction

taxisacetyl−CoA metabolic process

wound healing

glucose metabolic process

Oxidative phosphorylation

regulation of anatomical structu...

Hypertrophic cardiomyopathy (HCM)

tissue morphogenesis

Alzheimer's disease

ECM−receptor interaction

Huntington's disease

extracellular matrix organization

response to inorganic substance

cell junction assembly

Arrhythmogenic right ventricular...

epithelium development

Glucose metabolism cell activationpositive regulation of immune sy...

regulation of immune response

hemopoiesis

immune effector process

response to other organism

leukocyte migration

innate immune response

cytokine production

cell chemotaxis

hemostasislymphocyte proliferation

blood coagulation

inflammatory response

adaptive immune response

response to cytokine stimulusinterspecies interaction between...

positive regulation of catalytic...

regulation of defense response

positive regulation of protein m...integrin−mediated signaling pathway

regulation of hydrolase activity

vesicle−mediated transportpeptidyl−tyrosine phosphorylation

regulation of protein phosphoryl...

positive regulation of cytokine ...


actin polymerization or depolyme...positive regulation of lymphocyt...

cell adhesion

regulation of protein kinase act...

induction of apoptosis

negative regulation of programme...

Hematopoietic cell lineage

protein complex subunit organiza...

intracellular protein kinase cas...

Chemokine signaling pathway

Natural killer cell mediated cyt...

positive regulation of leukocyte...

Leukocyte transendothelial migra...

hematopoietic.system VS cell.lineG1 > G2: 1221G1 < G2: 1289

cell activationinnate immune responseregulation of immune response

positive regulation of immune sy...

response to other organism

immune effector process

cytokine production

leukocyte differentiation

leukocyte migration response to cytokine stimulusinflammatory response

regulation of defense response

cell chemotaxis

Signaling in Immune system

lymphocyte proliferation

adaptive immune response


hemostasis blood coagulation

Measles

intracellular protein kinase cas...

interspecies interaction between...

Osteoclast differentiation

integrin−mediated signaling pathway

peptidyl−tyrosine modification

B cell receptor signaling pathway

positive regulation of cell deathNatural killer cell mediated cyt...

Chemokine signaling pathway

Hemostasis

negative regulation of immune sy...

Hematopoietic cell lineage

regulation of cytokine biosynthe...

cell adhesion

positive regulation of cytokine ...negative regulation of programme...

regulation of response to extern...

regulation of phosphorylation

positive regulation of lymphocyt...

nucleotide−binding domain, leuci... cell cycle phasemitotic cell cycle

regulation of cell cycle process

Cell Cycle, Mitotic

nuclear division

cell division

anaphase−promoting complex−depen...regulation of ubiquitin−protein ...

chromosome segregation

positive regulation of ubiquitin...

Cell Cycle Checkpoints

DNA ReplicationCell cycle

response to DNA damage stimulusnegative regulation of ubiquitin...

negative regulation of ligase ac...cytoskeleton organization

DNA replication

spindle organization

protein complex assembly

regulation of cellular amine met...

DNA damage response, signal tran...cellular amino acid metabolic pr...

sister chromatid segregation

Proteasome

Degradation multiubiquitinated C...

Ornithine metabolism

Degradation of beta−catenin by t...

Degradation of ubiquitinated CD4

APC/C:Cdh1−mediated degradation ...

regulation of mitosis

Regulation of activated PAK−2p34...

Proteasome mediated degradation ...

cell morphogenesis involved in d...

regulation of cyclin−dependent p...

p53 signaling pathway

gland morphogenesisinterspecies interaction between...cell migrationtissue morphogenesis

Tissue

brain

cell line

hematopoietic system

muscle

Enrichment P−value

10−70

10−35

1

A B

C

D

E

Figure 1: Elements of a GO summaries figure

3 Usage of GOsummaries

In most cases the GOsummaries figures can be created using only two commands: gosummaries tocreate the object that has all the necessary information for drawing the plot and plot.gosummariesto actually draw the plot.

The gosummaries function requires a set of gene lists as an input. It applies GO enrichmentanalysis to these gene lists using g:Profiler (http://biit.cs.ut.ee/gprofiler/) web toolkit and savesthe results into a gosummaries object. Then one can add experimental data and configure theslots for additional information.

However, this can be somewhat complicated. Therefore, we have provided several conveniencefunctions to that generate the gosummaries objects based on the output of the most common anal-yses. We have functions gosummaries.kmeans,gosummaries.prcomp and gosummaries.MArrayLM,for k-means clustering, principal component analysis (PCA) and linear models with limma. Thesefunctions extract the gene lists right from the corresponding objects, run the GO enrichment andoptionally add the experimental data in the right format.

The gosummaries can be plotted using the plot function. The figures might not fit into theplotting window, since the plot has to have rather strict layout to be readable. Therefore, it isadvisable to write it into a file (file name can be given as a parameter).

2

Pathway analysis


Pathway analysis

11/21/12 69 Bioinformatics course

And many more…..


And many more ……..


Novel genomes


  How do we compute RNA-seq gene expression for novel genomes?

  Must have complete genome sequence (or contigs).   Use predicted gene models (all protein BLASTX or EST vs

genome data) to create an exon map or   de novo assembly of transcripts from RNA-seq data   Computationally huge problem: all-against-all similarity

searching and multiple overlapping transcripts.

RNA –seq analysis programs

11/20/13 Bioinformatics course 74 Mapping short RNA-seq readsOne of the most basic tasks in RNA-seq analysis is the alignment of reads to either a reference transcriptome or genome. Alignment of reads is a classic problem in bioinformatics with several solutions spe-cifically for EST mapping8,9. RNA-seq reads, however, pose particular challenges because they are short (~36–125 bases), error rates are considerable and many reads span exon-exon junctions. Additionally, the number of reads per experiment is increasingly large, currently as many as hundreds of millions. There are two major algorithmic approaches to map RNA-seq reads to a reference transcriptome. The first, to which we collectively refer as ‘unspliced read align-ers’, align reads to a reference without allowing any large gaps. The unspliced read aligners fall into two main categories, ‘seed methods’ and ‘Burrows-Wheeler transform methods’. Seed methods31–38 such as mapping and assembly with quality (MAQ)33 and Stampy35 find matches for short subsequences, termed ‘seeds’, assuming that at least

each approach and their application to RNA-seq analysis. We also discuss how these different methodologies can impact the results and interpretation of the data. Although we discuss each of the three cat-egories as separate units, RNA-seq data analysis often requires using methods from all three categories. The methods described here are largely independent of the choice of library construction protocols, with the notable exception of ‘paired-end’ sequencing (reading from both ends of a fragment), which provides valuable information at all stages of RNA-seq analysis28–30.

As a reference for the reader, we provide a list of currently available methods in each category (Table 1). To provide a gen-eral indication of the compute resources and tradeoffs of dif-ferent methods, we selected a representative method from each category and applied it to a published RNA-seq dataset consisting of 58 million paired-end 76-base reads from mouse embryonic stem cell RNA28 (Supplementary Table 1).

Table 1 | Selected list of RNA-seq analysis programsClass Category Package Notes Uses InputRead mappingUnspliced alignersa

Seed methods Short-read mapping package (SHRiMP)41

Smith-Waterman extension Aligning reads to a reference transcriptome

Reads and reference transcriptome

Stampy39 Probabilistic modelBurrows-Wheeler transform methods

Bowtie43

BWA44 Incorporates quality scoresSpliced aligners Exon-first methods MapSplice52 Works with multiple unspliced

alignersAligning reads to a reference genome. Allows for the identification of novel splice junctions

Reads and reference genomeSpliceMap50

TopHat51 Uses Bowtie alignmentsSeed-extend methods GSNAP53 Can use SNP databases

QPALMA54 Smith-Waterman for large gapsTranscriptome reconstructionGenome-guided reconstruction

Exon identification G.Mor.Se Assembles exons Identifying novel transcripts using a known reference genome

Alignments to reference genomeGenome-guided

assemblyScripture28 Reports all isoformsCufflinks29 Reports a minimal set of isoforms

Genome-independent reconstruction

Genome-independent assembly

Velvet61 Reports all isoforms Identifying novel genes and transcript isoforms without a known reference genome

ReadsTransABySS56

Expression quantificationExpression quantification

Gene quantification Alexa-seq47 Quantifies using differentially included exons

Quantifying gene expression Reads and transcript models

Enhanced read analysis of gene expression (ERANGE)20

Quantifies using union of exons

Normalization by expected uniquely mappable area (NEUMA)82

Quantifies using unique reads

Isoform quantification Cufflinks29 Maximum likelihood estimation of relative isoform expression

Quantifying transcript isoform expression levels

Read alignments to isoformsMISO33

RNA-seq by expectaion maximization (RSEM)69


Cuffdiff29 Uses isoform levels in analysis Identifying differentially expressed genes or transcript isoforms

Read alignments and transcript models

DegSeq79 Uses a normal distributionEdgeR77

Differential Expression analysis of count data (DESeq)78

Myrna75 Cloud-based permutation methodaThis list is not meant to be exhaustive as many different programs are available for short-read alignment. Here we chose a representative set capturing the frequently used tools for RNA-seq or tools representing fundamentally different approaches.

470 | VOL.8 NO.6 | JUNE 2011 | NATURE METHODS

REVIEW

Comparison of tools


Comparison of tools


Challenges


  Several sequencing technolgies   Complex normalization   Difficulty to achieve mappability   Accurate detection of splice junction   Proper summarization methods needed   Most challenging for novel genomes   Not many algorithms exist for denovo assembly when

compared to reference assembly.

Summary


  RNA-seq to study RNA content   Quantitative than microarrays   Can be used for studying different layers of transcription   several factors to be considered in experimental design   Mapping, transcript assembly, summarization, differential

expression and visualization are the major steps in RNA-seq   Gene ontology analysis, pathway analysis, integrative study

followed by systems biology are the possible proceeding steps of RNA-seq gene lists.

next generation sequencing (ngs)- rna sequencing...use of high-throughput sequencing technologies to...

Documents