rna-seq experiments for bioinformaticians
DESCRIPTION
This presentation discusses about some quick facts on RNA-seq experiments and then the short-read alignment methods.TRANSCRIPT
RNA-seq experiment for Bioinformaticians
Ashis Kumer Biswas
BioMeCIS at CSE, UT Arlington
April 12, 2012
Ashis Kumer Biswas RNA-seq and Bioinformatics
Outline
1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps
2 RNA-seq for BioinformaticiansShort-Read Alignments
Ashis Kumer Biswas RNA-seq and Bioinformatics
Quick facts about RNA-seq
It’s a massively parallel sequencing method for transcriptomeanalysis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Quick facts about RNA-seq
It’s a massively parallel sequencing method for transcriptomeanalysis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Quick facts about RNA-seq
It’s a massively parallel sequencing method for transcriptomeanalysis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Quick facts about RNA-seq
It’s a massively parallel sequencing method for transcriptomeanalysis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Figure: The Cell[1]
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Figure: DNA vs. RNA[1]
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Figure: RNA secondary structure for the RNA sequence(5’end)–ACCCCCUCCUUCCUUGGAUCAAGGGGCUCAA–(3’end)
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Figure: RNA secondary structure for the RNA sequence(5’end)–ACCCCCUCCUUCCUUGGAUCAAGGGGCUCAA–(3’end)
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Types of RNA:
mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.
tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.
rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Types of RNA:
mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.
tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.
rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Types of RNA:
mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.
tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.
rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNA
rRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNA
snoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.
miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.
siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).
piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.
lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
More types of RNA:
ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:
tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is RNA
Roles of RNA in the “central dogma of molecular biology”:
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Genome
The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Genome
The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Genome
The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is transcriptome
Transcriptome T is set of RNA molecules.
In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Why analyze the transcriptome?
The research branch “transcriptomics” deals with:
Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.
Interpreting the functional elements of the genome..
Revealing the molecular constituents of cells, tissues
Understanding the disease
The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Why analyze the transcriptome?
The research branch “transcriptomics” deals with:
Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.
Interpreting the functional elements of the genome..
Revealing the molecular constituents of cells, tissues
Understanding the disease
The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Why analyze the transcriptome?
The research branch “transcriptomics” deals with:
Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.
Interpreting the functional elements of the genome..
Revealing the molecular constituents of cells, tissues
Understanding the disease
The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Why analyze the transcriptome?
The research branch “transcriptomics” deals with:
Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.
Interpreting the functional elements of the genome..
Revealing the molecular constituents of cells, tissues
Understanding the disease
The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Why analyze the transcriptome?
The research branch “transcriptomics” deals with:
Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.
Interpreting the functional elements of the genome..
Revealing the molecular constituents of cells, tissues
Understanding the disease
The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Massively Parallel Sequencing
This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.
This technology emerged in late 1996, and commerciallyavailable since 2005.
Sequencing cost decreased: ultimate goal— $1000/genomesequencing.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Massively Parallel Sequencing
This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.
This technology emerged in late 1996, and commerciallyavailable since 2005.
Sequencing cost decreased: ultimate goal— $1000/genomesequencing.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What is Massively Parallel Sequencing
This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.
This technology emerged in late 1996, and commerciallyavailable since 2005.
Sequencing cost decreased: ultimate goal— $1000/genomesequencing.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Outline
1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps
2 RNA-seq for BioinformaticiansShort-Read Alignments
Ashis Kumer Biswas RNA-seq and Bioinformatics
RNA-seq steps
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 1
The RNAs having Poly-A (i.e., many Adenine (A)) tail are isolatedfrom sample cell cytoplasm.
Ashis Kumer Biswas RNA-seq and Bioinformatics
m-mRNA
Mature mRNA:
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 2
The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 2
The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Reverse Transcription
It is the opposite of transcription.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Reverse Transcription
It is the opposite of transcription.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Transcription
It’s the process of producing single-stranded mRNA from adouble-stranded DNA sequence.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Reverse Transcription
It is the opposite of transcription.
It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.
After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Reverse Transcription
It is the opposite of transcription.
It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.
After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Reverse Transcription
It is the opposite of transcription.
It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.
After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 2
The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 3
The cDNAs are subject to random fragmentation of size 35-400base pairs.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 4
Using the massively parallel high throughput sequencing machines(e.g, Illumina, SOLiD, Roche etc), the library of the short cDNAfragments are sequenced.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
Suppose this is one short-read sequnce:
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
The second section of the file contains the quality of eachcharacters of the sequences.
Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.
In other words, P = 10−Q10
For example, if Q = 30 => P = 10−3010 = 10−3 = 1
1000
So, Base call accuracy would be =
(1− P) = (1− 1
1000) = 99.9%
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
The second section of the file contains the quality of eachcharacters of the sequences.
Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.
In other words, P = 10−Q10
For example, if Q = 30 => P = 10−3010 = 10−3 = 1
1000
So, Base call accuracy would be =
(1− P) = (1− 1
1000) = 99.9%
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
The second section of the file contains the quality of eachcharacters of the sequences.
Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.
In other words, P = 10−Q10
For example, if Q = 30 => P = 10−3010 = 10−3 = 1
1000
So, Base call accuracy would be =
(1− P) = (1− 1
1000) = 99.9%
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
The second section of the file contains the quality of eachcharacters of the sequences.
Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.
In other words, P = 10−Q10
For example, if Q = 30 => P = 10−3010 = 10−3 = 1
1000
So, Base call accuracy would be =
(1− P) = (1− 1
1000) = 99.9%
Ashis Kumer Biswas RNA-seq and Bioinformatics
Sequenced files
The range of Phred scores Q is [0, 93]The Phred scores Q are converted to ASCII characters using shiftof 33 (ASCII Letter = Q + 33).The ASCII letter ranges [33, 126] [!,˜]
Here is the scores after the conversion:
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 5
Align the short-read sequences to exonic reference sequences.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Types of short-reads
Types of short-reads:
Ashis Kumer Biswas RNA-seq and Bioinformatics
Types of short-reads
Types of short-reads:
Ashis Kumer Biswas RNA-seq and Bioinformatics
Step 6
Quantify the expression levels.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.
RPKM: # of reads per kilobase per million mapped reads
Suppose from an RNA-seq experiment we have,
10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.
From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.
So, RPKM score for that transcript =1000
1× 8= 125
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.
RPKM: # of reads per kilobase per million mapped reads
Suppose from an RNA-seq experiment we have,
10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.
From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.
So, RPKM score for that transcript =1000
1× 8= 125
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.
RPKM: # of reads per kilobase per million mapped reads
Suppose from an RNA-seq experiment we have,
10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.
From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.
So, RPKM score for that transcript =1000
1× 8= 125
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.
RPKM: # of reads per kilobase per million mapped reads
Suppose from an RNA-seq experiment we have,
10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.
From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.
So, RPKM score for that transcript =1000
1× 8= 125
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.
RPKM: # of reads per kilobase per million mapped reads
Suppose from an RNA-seq experiment we have,
10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.
From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.
So, RPKM score for that transcript =1000
1× 8= 125
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.
Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.
If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.
So, in RPKM calculation the read counts were normalized byeach transcript’s length.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.
Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.
If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.
So, in RPKM calculation the read counts were normalized byeach transcript’s length.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.
Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.
If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.
So, in RPKM calculation the read counts were normalized byeach transcript’s length.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Units of measurements
# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.
Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.
If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.
So, in RPKM calculation the read counts were normalized byeach transcript’s length.
Ashis Kumer Biswas RNA-seq and Bioinformatics
RNA-seq steps
So, this is RNA-seq:
Ashis Kumer Biswas RNA-seq and Bioinformatics
What we can get from an RNA-seq experiment
To quantify (count) the mRNA abundance.
To quantify the changes of expression levels of each transcriptduring the development stages of cells, or under differentconditions.
Ashis Kumer Biswas RNA-seq and Bioinformatics
What we can get from an RNA-seq experiment
To quantify (count) the mRNA abundance.
To quantify the changes of expression levels of each transcriptduring the development stages of cells, or under differentconditions.
Ashis Kumer Biswas RNA-seq and Bioinformatics
RNA-seq for Bioinformaticians
Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.
Identifying the non-coding RNA.
and many more...
Ashis Kumer Biswas RNA-seq and Bioinformatics
RNA-seq for Bioinformaticians
Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.
Identifying the non-coding RNA.
and many more...
Ashis Kumer Biswas RNA-seq and Bioinformatics
RNA-seq for Bioinformaticians
Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.
Identifying the non-coding RNA.
and many more...
Ashis Kumer Biswas RNA-seq and Bioinformatics
Outline
1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps
2 RNA-seq for BioinformaticiansShort-Read Alignments
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem
Input: m l-bp (base-pair) size short-reads S1,S2, ...,Sm and anapproximate reference genome R.Output: What are the positions x1, x2, ..., xm along R where eachshort read matches?In human genome example,m is usually 107 − 108
Length of each short-reads l is typically 50-200 bp.Length of the genome |R| is 3× 109 bp.
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 1
Naive Algorithm:
Scan the reference genome R for each short-reads Si ,
Matching the read at each position p and picking the bestmatch.
Time Complexity: O(ml |R|)For human genome example: if m = 108, l = 50, |R|=3× 109,then complexity = 5× 1018. This is clearly impractical.
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 2
KMP (Knuth-Morris-Pratt) Algorithm:
Time Complexity: O(m(l + |R|)) = O(ml + m|R|)For human genome example: if m = 108, l = 50,|R|=3× 109, then complexity ≈ 1017. This is also impractical.
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 3
Using Suffix Tree:
First build a suffix tree for R.
Once the tree is built, for each Si we can find matches bytraversing the tree from the root efficiently.
Time Complexity: O(ml + |R|)For human genome example: if m = 108, l = 50,|R|=3× 109, then complexity ≈ 109. This looks practical.
But saving the tree requires O(|R|log |R|) bits, i.e., ∼ 64GBwhich is impractical for most of today’s desktop computers.
This approach also allows only EXACT matching.
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform):
It’s invented by Burrows & Wheeler in 1994.
It’s used in data compression technique — bzip2.
The transformation does not change the character’s value. Itchanges the order of the characters.
Frequent substrings in the original text will repeat multipletimes in a row in the transformed text.
This kind of transformed text can be easily compressed byother algorithm — “Move-To-Front Transform” or“Run-Length-Encoding”.
Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) :
This shows the BWT is reversible!
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
BWT (Burrows-Wheeler Transform) : Can we answer thesequestions now?
Is the letter “B” followed by “A” or vice versa?
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem: Solution 4
Does the substring “ANA” present in the original text?
Ashis Kumer Biswas RNA-seq and Bioinformatics
The Mapping Problem
The popular program TopHat [2] uses the BWT algorithm todo the mapping of short-reads to the reference genome.
To store the transformed text of the human genome we needonly 2× 3× 109 bits.
Ashis Kumer Biswas RNA-seq and Bioinformatics
Questions?
Ashis Kumer Biswas RNA-seq and Bioinformatics
Thanks!
Ashis Kumer Biswas RNA-seq and Bioinformatics
References
Introduction and basic molecular biology. [Online]. Available:http://compbio.pbworks.com/w/page/16252897/Introduction%20and%20Basic%20Molecular%20Biology
C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. Kelley,H. Pimentel, S. Salzberg, J. Rinn, and L. Pachter,“Differential gene and transcript expression analysis of rna-seqexperiments with tophat and cufflinks,” Nature Protocols,vol. 7, no. 3, pp. 562–578, 2012.
Ashis Kumer Biswas RNA-seq and Bioinformatics