rna-seq experiments for bioinformaticians

97
RNA-seq experiment for Bioinformaticians Ashis Kumer Biswas BioMeCIS at CSE, UT Arlington April 12, 2012 Ashis Kumer Biswas RNA-seq and Bioinformatics

Upload: ashisbiswas

Post on 13-Apr-2015

10 views

Category:

Documents


1 download

DESCRIPTION

This presentation discusses about some quick facts on RNA-seq experiments and then the short-read alignment methods.

TRANSCRIPT

Page 1: RNA-seq experiments for bioinformaticians

RNA-seq experiment for Bioinformaticians

Ashis Kumer Biswas

BioMeCIS at CSE, UT Arlington

April 12, 2012

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 2: RNA-seq experiments for bioinformaticians

Outline

1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps

2 RNA-seq for BioinformaticiansShort-Read Alignments

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 3: RNA-seq experiments for bioinformaticians

Quick facts about RNA-seq

It’s a massively parallel sequencing method for transcriptomeanalysis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 4: RNA-seq experiments for bioinformaticians

Quick facts about RNA-seq

It’s a massively parallel sequencing method for transcriptomeanalysis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 5: RNA-seq experiments for bioinformaticians

Quick facts about RNA-seq

It’s a massively parallel sequencing method for transcriptomeanalysis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 6: RNA-seq experiments for bioinformaticians

Quick facts about RNA-seq

It’s a massively parallel sequencing method for transcriptomeanalysis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 7: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 8: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 9: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 10: RNA-seq experiments for bioinformaticians

What is RNA

Figure: The Cell[1]

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 11: RNA-seq experiments for bioinformaticians

What is RNA

Figure: DNA vs. RNA[1]

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 12: RNA-seq experiments for bioinformaticians

What is RNA

Figure: RNA secondary structure for the RNA sequence(5’end)–ACCCCCUCCUUCCUUGGAUCAAGGGGCUCAA–(3’end)

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 13: RNA-seq experiments for bioinformaticians

What is RNA

Figure: RNA secondary structure for the RNA sequence(5’end)–ACCCCCUCCUUCCUUGGAUCAAGGGGCUCAA–(3’end)

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 14: RNA-seq experiments for bioinformaticians

What is RNA

Types of RNA:

mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.

tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.

rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 15: RNA-seq experiments for bioinformaticians

What is RNA

Types of RNA:

mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.

tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.

rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 16: RNA-seq experiments for bioinformaticians

What is RNA

Types of RNA:

mRNA — messenger RNA: it carries the code from the DNAin nucleus for synthesis of one/more proteins into thecytoplasm where the protein manufacturing takes place in theorganelle — “Ribosome”.

tRNA — transfer RNA: it brings amino acids to the ribosome,where the translation of mRNA into Amino Acid sequences.

rRNA — ribosomal RNA: the rRNA and some proteinscombine to form a nucleoprotein called “ribosome” whichserves as the site and carries the necessary enzymes for proteinsynthesis.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 17: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 18: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNA

rRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 19: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNA

snoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 20: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.

miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 21: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.

siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 22: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).

piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 23: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.

lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 24: RNA-seq experiments for bioinformaticians

What is RNA

More types of RNA:

ncRNA — non-coding RNAs: They are not translated intoprotein. Examples:

tRNArRNAsnoRNA—small nucleolar RNA: it guides the chemicalmodifications of other RNAs.miRNA—microRNA: it’s a post-transcriptional regulators.siRNA—small interfering RNA: it is involved in RNAinterference pathway (i.e., in certain gene expression pathway).piRNA—piwi-interacting RNA: it forms RNA-proteincomplexes which regulates some post-transcriptional geneexpression.lncRNA — long ncRNA: non-coding RNA longer than 200nucleotides.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 25: RNA-seq experiments for bioinformaticians

What is RNA

Roles of RNA in the “central dogma of molecular biology”:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 26: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 27: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 28: RNA-seq experiments for bioinformaticians

What is Genome

The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 29: RNA-seq experiments for bioinformaticians

What is Genome

The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 30: RNA-seq experiments for bioinformaticians

What is Genome

The full set of DNA sequences of an organism is called its genome.Humans have 23 pairs of chromosomes.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 31: RNA-seq experiments for bioinformaticians

What is transcriptome

Transcriptome T is set of RNA molecules.

In contrast, a genome does not change in a living cell exceptfor mutation; but a transcriptome varies according to differentexternal environmental conditions or in different stages of cellcycles, or in disease conditions.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 32: RNA-seq experiments for bioinformaticians

Why analyze the transcriptome?

The research branch “transcriptomics” deals with:

Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.

Interpreting the functional elements of the genome..

Revealing the molecular constituents of cells, tissues

Understanding the disease

The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 33: RNA-seq experiments for bioinformaticians

Why analyze the transcriptome?

The research branch “transcriptomics” deals with:

Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.

Interpreting the functional elements of the genome..

Revealing the molecular constituents of cells, tissues

Understanding the disease

The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 34: RNA-seq experiments for bioinformaticians

Why analyze the transcriptome?

The research branch “transcriptomics” deals with:

Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.

Interpreting the functional elements of the genome..

Revealing the molecular constituents of cells, tissues

Understanding the disease

The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 35: RNA-seq experiments for bioinformaticians

Why analyze the transcriptome?

The research branch “transcriptomics” deals with:

Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.

Interpreting the functional elements of the genome..

Revealing the molecular constituents of cells, tissues

Understanding the disease

The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 36: RNA-seq experiments for bioinformaticians

Why analyze the transcriptome?

The research branch “transcriptomics” deals with:

Examining expression profiles (i.e., expression levels) ofmRNAs in a given cell population.

Interpreting the functional elements of the genome..

Revealing the molecular constituents of cells, tissues

Understanding the disease

The transcriptome can be seen as a precursor for theproteome,i.e., the entire set of proteins expressed by agenome.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 37: RNA-seq experiments for bioinformaticians

What is Massively Parallel Sequencing

This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.

This technology emerged in late 1996, and commerciallyavailable since 2005.

Sequencing cost decreased: ultimate goal— $1000/genomesequencing.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 38: RNA-seq experiments for bioinformaticians

What is Massively Parallel Sequencing

This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.

This technology emerged in late 1996, and commerciallyavailable since 2005.

Sequencing cost decreased: ultimate goal— $1000/genomesequencing.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 39: RNA-seq experiments for bioinformaticians

What is Massively Parallel Sequencing

This technique allows to simultaneously sequence 1 million toseveral hundred millions of short reads (50-400bases) fromamplified DNA clones.

This technology emerged in late 1996, and commerciallyavailable since 2005.

Sequencing cost decreased: ultimate goal— $1000/genomesequencing.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 40: RNA-seq experiments for bioinformaticians

Outline

1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps

2 RNA-seq for BioinformaticiansShort-Read Alignments

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 41: RNA-seq experiments for bioinformaticians

RNA-seq steps

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 42: RNA-seq experiments for bioinformaticians

Step 1

The RNAs having Poly-A (i.e., many Adenine (A)) tail are isolatedfrom sample cell cytoplasm.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 43: RNA-seq experiments for bioinformaticians

m-mRNA

Mature mRNA:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 44: RNA-seq experiments for bioinformaticians

Step 2

The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 45: RNA-seq experiments for bioinformaticians

Step 2

The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 46: RNA-seq experiments for bioinformaticians

Reverse Transcription

It is the opposite of transcription.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 47: RNA-seq experiments for bioinformaticians

Reverse Transcription

It is the opposite of transcription.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 48: RNA-seq experiments for bioinformaticians

Transcription

It’s the process of producing single-stranded mRNA from adouble-stranded DNA sequence.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 49: RNA-seq experiments for bioinformaticians

Reverse Transcription

It is the opposite of transcription.

It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.

After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 50: RNA-seq experiments for bioinformaticians

Reverse Transcription

It is the opposite of transcription.

It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.

After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 51: RNA-seq experiments for bioinformaticians

Reverse Transcription

It is the opposite of transcription.

It is a way of acquiring a gene sequence—the double strandedDNA fragment from which the mRNA was transcribed.

After reverse transcription, the produced double strandedDNA is called: cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 52: RNA-seq experiments for bioinformaticians

Step 2

The Poly-A RNAs are reverse transcribed to produce adouble-stranded cDNA (complementary DNA).

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 53: RNA-seq experiments for bioinformaticians

Step 3

The cDNAs are subject to random fragmentation of size 35-400base pairs.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 54: RNA-seq experiments for bioinformaticians

Step 4

Using the massively parallel high throughput sequencing machines(e.g, Illumina, SOLiD, Roche etc), the library of the short cDNAfragments are sequenced.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 55: RNA-seq experiments for bioinformaticians

Sequenced files

Suppose this is one short-read sequnce:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 56: RNA-seq experiments for bioinformaticians

Sequenced files

The second section of the file contains the quality of eachcharacters of the sequences.

Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.

In other words, P = 10−Q10

For example, if Q = 30 => P = 10−3010 = 10−3 = 1

1000

So, Base call accuracy would be =

(1− P) = (1− 1

1000) = 99.9%

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 57: RNA-seq experiments for bioinformaticians

Sequenced files

The second section of the file contains the quality of eachcharacters of the sequences.

Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.

In other words, P = 10−Q10

For example, if Q = 30 => P = 10−3010 = 10−3 = 1

1000

So, Base call accuracy would be =

(1− P) = (1− 1

1000) = 99.9%

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 58: RNA-seq experiments for bioinformaticians

Sequenced files

The second section of the file contains the quality of eachcharacters of the sequences.

Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.

In other words, P = 10−Q10

For example, if Q = 30 => P = 10−3010 = 10−3 = 1

1000

So, Base call accuracy would be =

(1− P) = (1− 1

1000) = 99.9%

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 59: RNA-seq experiments for bioinformaticians

Sequenced files

The second section of the file contains the quality of eachcharacters of the sequences.

Phred Quality Score Q = −10.log10P, where P is thebase-calling error probability measured by the sequencingmachine.

In other words, P = 10−Q10

For example, if Q = 30 => P = 10−3010 = 10−3 = 1

1000

So, Base call accuracy would be =

(1− P) = (1− 1

1000) = 99.9%

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 60: RNA-seq experiments for bioinformaticians

Sequenced files

The range of Phred scores Q is [0, 93]The Phred scores Q are converted to ASCII characters using shiftof 33 (ASCII Letter = Q + 33).The ASCII letter ranges [33, 126] [!,˜]

Here is the scores after the conversion:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 61: RNA-seq experiments for bioinformaticians

Step 5

Align the short-read sequences to exonic reference sequences.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 62: RNA-seq experiments for bioinformaticians

Types of short-reads

Types of short-reads:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 63: RNA-seq experiments for bioinformaticians

Types of short-reads

Types of short-reads:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 64: RNA-seq experiments for bioinformaticians

Step 6

Quantify the expression levels.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 65: RNA-seq experiments for bioinformaticians

Units of measurements

For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.

RPKM: # of reads per kilobase per million mapped reads

Suppose from an RNA-seq experiment we have,

10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.

From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.

So, RPKM score for that transcript =1000

1× 8= 125

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 66: RNA-seq experiments for bioinformaticians

Units of measurements

For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.

RPKM: # of reads per kilobase per million mapped reads

Suppose from an RNA-seq experiment we have,

10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.

From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.

So, RPKM score for that transcript =1000

1× 8= 125

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 67: RNA-seq experiments for bioinformaticians

Units of measurements

For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.

RPKM: # of reads per kilobase per million mapped reads

Suppose from an RNA-seq experiment we have,

10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.

From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.

So, RPKM score for that transcript =1000

1× 8= 125

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 68: RNA-seq experiments for bioinformaticians

Units of measurements

For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.

RPKM: # of reads per kilobase per million mapped reads

Suppose from an RNA-seq experiment we have,

10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.

From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.

So, RPKM score for that transcript =1000

1× 8= 125

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 69: RNA-seq experiments for bioinformaticians

Units of measurements

For each transcript, the measure of expression level is quantifiedusing a metric — RPKM.

RPKM: # of reads per kilobase per million mapped reads

Suppose from an RNA-seq experiment we have,

10 million short-reads, but out of which only 8 million readscould be mapped to the reference genome.

From those mapped reads, 1000 alignments maps to atranscript of size 1 kilobases.

So, RPKM score for that transcript =1000

1× 8= 125

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 70: RNA-seq experiments for bioinformaticians

Units of measurements

# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.

Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.

If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.

So, in RPKM calculation the read counts were normalized byeach transcript’s length.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 71: RNA-seq experiments for bioinformaticians

Units of measurements

# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.

Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.

If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.

So, in RPKM calculation the read counts were normalized byeach transcript’s length.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 72: RNA-seq experiments for bioinformaticians

Units of measurements

# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.

Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.

If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.

So, in RPKM calculation the read counts were normalized byeach transcript’s length.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 73: RNA-seq experiments for bioinformaticians

Units of measurements

# of RNA-seq reads generated from a transcript ∝ thattranscript’s relative abundance in the sample.

Suppose a sample has 2 transcripts A and B, both of whichare present at the same abundance.

If B is twice as long as A, an RNA-seq will contain twice asmany reads from B as from A.

So, in RPKM calculation the read counts were normalized byeach transcript’s length.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 74: RNA-seq experiments for bioinformaticians

RNA-seq steps

So, this is RNA-seq:

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 75: RNA-seq experiments for bioinformaticians

What we can get from an RNA-seq experiment

To quantify (count) the mRNA abundance.

To quantify the changes of expression levels of each transcriptduring the development stages of cells, or under differentconditions.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 76: RNA-seq experiments for bioinformaticians

What we can get from an RNA-seq experiment

To quantify (count) the mRNA abundance.

To quantify the changes of expression levels of each transcriptduring the development stages of cells, or under differentconditions.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 77: RNA-seq experiments for bioinformaticians

RNA-seq for Bioinformaticians

Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.

Identifying the non-coding RNA.

and many more...

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 78: RNA-seq experiments for bioinformaticians

RNA-seq for Bioinformaticians

Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.

Identifying the non-coding RNA.

and many more...

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 79: RNA-seq experiments for bioinformaticians

RNA-seq for Bioinformaticians

Each RNA-seq experiment (“lane”) produces more than 10million short-reads. These are required to be aligned to thereference genome.

Identifying the non-coding RNA.

and many more...

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 80: RNA-seq experiments for bioinformaticians

Outline

1 Basics of RNA-seq technologyQuick facts about RNA-seqRNA-seq steps

2 RNA-seq for BioinformaticiansShort-Read Alignments

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 81: RNA-seq experiments for bioinformaticians

The Mapping Problem

Input: m l-bp (base-pair) size short-reads S1,S2, ...,Sm and anapproximate reference genome R.Output: What are the positions x1, x2, ..., xm along R where eachshort read matches?In human genome example,m is usually 107 − 108

Length of each short-reads l is typically 50-200 bp.Length of the genome |R| is 3× 109 bp.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 82: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 1

Naive Algorithm:

Scan the reference genome R for each short-reads Si ,

Matching the read at each position p and picking the bestmatch.

Time Complexity: O(ml |R|)For human genome example: if m = 108, l = 50, |R|=3× 109,then complexity = 5× 1018. This is clearly impractical.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 83: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 2

KMP (Knuth-Morris-Pratt) Algorithm:

Time Complexity: O(m(l + |R|)) = O(ml + m|R|)For human genome example: if m = 108, l = 50,|R|=3× 109, then complexity ≈ 1017. This is also impractical.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 84: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 3

Using Suffix Tree:

First build a suffix tree for R.

Once the tree is built, for each Si we can find matches bytraversing the tree from the root efficiently.

Time Complexity: O(ml + |R|)For human genome example: if m = 108, l = 50,|R|=3× 109, then complexity ≈ 109. This looks practical.

But saving the tree requires O(|R|log |R|) bits, i.e., ∼ 64GBwhich is impractical for most of today’s desktop computers.

This approach also allows only EXACT matching.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 85: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform):

It’s invented by Burrows & Wheeler in 1994.

It’s used in data compression technique — bzip2.

The transformation does not change the character’s value. Itchanges the order of the characters.

Frequent substrings in the original text will repeat multipletimes in a row in the transformed text.

This kind of transformed text can be easily compressed byother algorithm — “Move-To-Front Transform” or“Run-Length-Encoding”.

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES

Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 86: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 87: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 88: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 89: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 90: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 91: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) :

This shows the BWT is reversible!

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 92: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

BWT (Burrows-Wheeler Transform) : Can we answer thesequestions now?

Is the letter “B” followed by “A” or vice versa?

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 93: RNA-seq experiments for bioinformaticians

The Mapping Problem: Solution 4

Does the substring “ANA” present in the original text?

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 94: RNA-seq experiments for bioinformaticians

The Mapping Problem

The popular program TopHat [2] uses the BWT algorithm todo the mapping of short-reads to the reference genome.

To store the transformed text of the human genome we needonly 2× 3× 109 bits.

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 95: RNA-seq experiments for bioinformaticians

Questions?

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 96: RNA-seq experiments for bioinformaticians

Thanks!

Ashis Kumer Biswas RNA-seq and Bioinformatics

Page 97: RNA-seq experiments for bioinformaticians

References

Introduction and basic molecular biology. [Online]. Available:http://compbio.pbworks.com/w/page/16252897/Introduction%20and%20Basic%20Molecular%20Biology

C. Trapnell, A. Roberts, L. Goff, G. Pertea, D. Kim, D. Kelley,H. Pimentel, S. Salzberg, J. Rinn, and L. Pachter,“Differential gene and transcript expression analysis of rna-seqexperiments with tophat and cufflinks,” Nature Protocols,vol. 7, no. 3, pp. 562–578, 2012.

Ashis Kumer Biswas RNA-seq and Bioinformatics