rna sequencing, transcriptome and expression quantification henrik lantz, bils/scilifelab

67
RNA sequencing, transcriptome and expression quantification nrik Lantz, BILS/SciLifeLab

Upload: clare-barber

Post on 18-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

RNA sequencing, transcriptome and expression quantification

Henrik Lantz, BILS/SciLifeLab

Page 2: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Lecture synopsis

• What is RNA-seq?• Basic concepts• Mapping-based transcriptomics (genome -

based)• De novo based transcriptomics (genome-free)• Expression counts and differential expression• Transcript annotation

Page 3: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

RNA-seq

DNAIntron Intron IntronExon ExonExon Exon

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Pre-mRNA

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Transcription

Splicing

UTR UTR

TAG, TAA, TGAStop codon

ATGStart codon

mRNA

Translation

GT GT GT

AAAAAAA

AAAAAAAAA

AG AG AG

Page 4: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Overview of RNA-Seq

From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html

Page 5: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Common Data Formats for RNA-Seq

>61DFRAAXX100204:1:100:10494:3070/1AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT

FASTA format:

FASTQ format:

@61DFRAAXX100204:1:100:10494:3070/1AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT+ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

Quality values in increasing order: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

You might get the data in a .sff or .bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats!

Page 6: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Paired-End

Page 7: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Insert size

Read 1

Read 2

DNA-fragment

Inner mate distance

Adapter+primer

Insert size

Page 8: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Paired-end gives you two files

FASTQ format (old):

@61DFRAAXX100204:1:100:10494:3070/1AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT+ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA

@61DFRAAXX100204:1:100:10494:3070/2ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA+_^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad

New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number>

Example:@SIM:1:FCX:1:15:6329:1045 1:N:0:2TCGCACTCAACGCCCTGCATATGACAAGACAGAATC+<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

Page 9: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Nature Biotech, 2010

Page 10: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

TopHat

Page 11: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Cufflinks

TopHat

Page 12: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

GMAPCufflinks

TopHat The Tuxedo Suite:End-to-end Genome-basedRNA-Seq Analysis Software Package

Page 13: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

Cufflinks

TopHat

Page 14: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

GMAPCufflinks

TopHat

Page 15: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

GMAP

End-to-end Transcriptome-basedRNA-Seq Analysis Software Package

Page 16: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Basic concepts of mapping-based RNA-seq - Spliced reads

DNAIntron Intron IntronExon ExonExon Exon

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Pre-mRNA

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Transcription

Splicing

UTR UTR

TAG, TAA, TGAStop codon

ATGStart codon

mRNA

Translation

GT GT GT

AAAAAAA

AAAAAAAAA

AG AG AG

Page 17: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

RNA-seq - Spliced reads

Page 18: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Pre-mRNA

DNAIntron Intron IntronExon ExonExon Exon

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Pre-mRNA

UTR UTR

ATGStart codon

TAG, TAA, TGAStop codon

Transcription

Splicing

UTR UTR

TAG, TAA, TGAStop codon

ATGStart codon

mRNA

Translation

GT GT GT

Page 19: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Pre-mRNA

Page 20: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Pre-mRNA

Page 21: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Stranded rna-seq

Page 22: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Overview of the Tuxedo Software Suite

Bowtie (fast short-read alignment)

TopHat (spliced short-read alignment)

Cufflinks (transcript reconstruction from alignments)

Cuffdiff (differential expression analysis)

CummeRbund (visualization & analysis)

Page 23: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Slide courtesy of Cole Trapnell

Page 24: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Tophat-mapped reads

Page 25: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

0 61G9EAAXX100520:5:100:10095:164771 832 chr13 519864 385 46M6 =7 517898 -2649 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG11 MD:Z:6712 NH:i:113 HI:i:114 NM:i:015 SM:i:3816 XQ:i:4017 X2:i:0

Alignments are reported in a compact representation: SAM format

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

Page 26: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

0 61G9EAAXX100520:5:100:10095:164771 832 chr13 519864 385 46M6 =7 517898 -2649 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG11 MD:Z:6712 NH:i:113 HI:i:114 NM:i:015 SM:i:3816 XQ:i:4017 X2:i:0

Alignments are reported in a compact representation: SAM format

(read name)(FLAGS stored as bit fields; 83 = 00001010011 )

(alignment target)(position alignment starts)

(Compact description of the alignment in CIGAR format)

(Metadata)

(read sequence, oriented according to the forward alignment)

(base quality values)

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

Page 27: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

0 61G9EAAXX100520:5:100:10095:164771 832 chr13 519864 385 46M6 =7 517898 -2649 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG11 MD:Z:6712 NH:i:113 HI:i:114 NM:i:015 SM:i:3816 XQ:i:4017 X2:i:0

Alignments are reported in a compact representation: SAM format

(read name)(FLAGS stored as bit fields; 83 = 00001010011 )

(alignment target)(position alignment starts)

(Compact description of the alignment in CIGAR format)

(Metadata)

(read sequence, oriented according to the forward alignment)

(base quality values)

Still not compact enough… Millions to billions of reads takes up a lot of space!!

Convert SAM to binary – BAM format.

Still not compact enough… Millions to billions of reads takes up a lot of space!!

Convert SAM to binary – BAM format.

SAM format specification: http://samtools.sourceforge.net/SAM1.pdf

Page 28: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Samtools• Tools for

– converting SAM <-> BAM– Viewing BAM files (eg. samtools view file.bam | less )– Sorting BAM files, and lots more:

Page 29: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

There is also CRAM…• CRAM compression rate File format File size (GB)• SAM 7.4• BAM 1.9• CRAM lossless 1.4• CRAM 8 bins 0.8• CRAM no quality scores 0.26

Page 30: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Visualizing Alignments of RNA-Seq reads

Page 31: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Text-based Alignment Viewer% samtools tview alignments.bam target.fasta

Page 32: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

IGV

Page 33: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

IGV: Viewing Tophat Alignments

Page 34: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction Using Cufflinks

From Martin & Wang. Nature Reviews in Genetics. 2011

Page 35: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction Using Cufflinks

From Martin & Wang. Nature Reviews in Genetics. 2011

Page 36: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

From Martin & Wang. Nature Reviews in Genetics. 2011

Transcript Reconstruction Using Cufflinks

Page 37: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

GFF file format

Page 38: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

GFF3 file format

Seqid source type start end score strand phase attributes

Chr1 Snap gene 234 3657 . + . ID=gene1; Name=Snap1;

Chr1 Snap mRNA 234 3657 . + . ID=gene1.m1; Parent=gene1;

Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1; Parent=gene1.m1;

Chr1 Snap CDS 577 1543 . + 0 ID=gene1.m1.CDS1; Parent=gene1.m1;

Chr1 Snap exon 1822 2674 . + . ID=gene1.m1.exon2; Parent=gene1.m1;

Chr1 Snap CDS 1822 2674 . + 2 ID=gene1.m1.CDS2; Parent=gene1.m1;

start_codon

Alias, note, ontology_term …

stop_codon

Page 39: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

GTF file format

Page 40: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

GTF file formatSeqid source type start end score strand phase attributes

Chr1 Snap exon 234 1543 . + . gene_id “gene1”; transcript_id “transcript1”;

Chr1 Snap CDS 577 1543 . + 0 gene_id “gene1”; transcript_id “transcript1”;

Chr1 Snap exon 1822 2674 . + . gene_id “gene1”; transcript_id “transcript1”;

Chr1 Snap CDS 1822 2674 . + 2 gene_id “gene1”; transcript_id “transcript1”;

start_codon

stop_codon

Page 41: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

GMAPCufflinks

TopHat The Tuxedo Suite:End-to-end Genome-basedRNA-Seq Analysis Software Package

Page 42: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Transcript Reconstruction from RNA-Seq Reads

Trinity

GMAP

End-to-end Transcriptome-basedRNA-Seq Analysis Software Package

Page 43: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

De novo transcriptome assembly

No genome required

Empower studies of non-model organisms– expressed gene content– transcript abundance– differential expression

Page 44: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

The General Approach to De novo RNA-Seq Assembly

Using De Bruijn Graphs

Page 45: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Sequence Assembly via De Bruijn Graphs

From Martin & Wang, Nat. Rev. Genet. 2011

Page 46: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

From Martin & Wang, Nat. Rev. Genet. 2011

Page 47: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

From Martin & Wang, Nat. Rev. Genet. 2011

Page 48: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Contrasting Genome and Transcriptome Assembly

Genome Assembly Transcriptome Assembly

• Uniform coverage• Single contig per locus• Double-stranded

• Exponentially distributed coverage levels• Multiple contigs per locus (alt splicing)• Strand-specific

Page 49: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Trinity Aggregates Isolated Transcript Graphs

Genome AssemblySingle Massive Graph

Trinity Transcriptome AssemblyMany Thousands of Small Graphs

Ideally, one graph per expressed gene.Entire chromosomes represented.

Page 50: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

RNA-Seqreads

Linearcontigs

de-Bruijngraphs

Transcripts+

Isoforms

Trinity – How it works:

Thousands of disjoint graphs

Page 51: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Trinity output: A multi-fasta file

Page 52: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Can align Trinity transcripts to genome scaffolds to examine intron/exon structures(Trinity transcripts aligned using GMAP)

Page 53: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

An alternative: Pacific Biosciences (PacBio)

• Pros: Long reads (average 4.5 kbp), can give you full length transcripts in one read

• Cons: High error rate on longer fragments (15%), expensive

Page 54: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Abundance Estimation(Aka. Computing Expression Values)

Page 55: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Slide courtesy of Cole Trapnell

Expr

essi

on V

alue

Page 56: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Slide courtesy of Cole Trapnell

Expr

essi

on V

alue

Page 57: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Normalized Expression Values

•Transcript-mapped read counts are normalized for both length of the transcript and total depth of sequencing.

•Reported as: Number of RNA-Seq Fragments Per Kilobase of transcript per total Million fragments mapped

FPKM

Page 58: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Differential Expression Analysis Using RNA-Seq

Page 59: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Differential expression

Genome

Mapped reads - condition 1

Mapped reads - condition 2

Page 60: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Diff. Expression Analysis Involves• Counting reads• Statistical significance testing

Gene A

Sample_A Sample_B

Gene B

Fold_Change Significant?

1 2 2-fold

100 200 2-fold

No

Yes

Page 61: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Beware of concluding fold change from small numbers of counts

From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for

Poisson distributions for counts based on 2-fold expression differences

No confidence in 2-fold difference. Likely observed by chance.

High confidence in 2-folddifference. Unlikely observed by chance.

Page 62: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

More Counts = More Statistical Power

SampleA

Example: 5000 total reads per sample.

Sample B

geneA 1 2

Fisher’s Exact Test(P-value)

geneB 10 20

1.00

0.098

100 200 < 0.001geneC

Observed 2-fold differences in read counts.

Page 63: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Tools for DE analysis with RNA-Seq

See: http://www.biomedcentral.com/1471-2105/14/91

ShrinkSeqNoiSeqbaySeqVsfVoomSAMseqTSPMDESeqEBSeqNBPSeqedgeR

+ other (not-R)including CuffDiff

Page 64: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Use of transcripts

• Transcripts can be assembled de novo or from mapped reads and then used in gene expression/differential expression studies

• Can be functionally anntoated

Page 65: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Functional annotation

• Take transcripts from Cufflinks or Trinity• Annotate the sequences functionally in

Blast2GO

Page 66: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

Blast2GO

Page 67: RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab

KEGG-mapping