part 3 of rna-seq for de analysis: the mapping process

44
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof. RNA-seq for DE analysis training Mapping to assign reads to genes Joachim Jacob 22 and 24 January 2014

Upload: joachim-jacob

Post on 10-May-2015

339 views

Category:

Science


5 download

DESCRIPTION

Third part of the training session 'RNA-seq for Differential expression analysis'. We explain the details of the mapping process of RNA-seq data that allow us to detect differential expression. Interested in following this session? Please contact http://www.jakonix.be/contact.html

TRANSCRIPT

Page 1: Part 3 of RNA-seq for DE analysis: the mapping process

This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.

RNA-seq for DE analysis training

Mapping to assign reads to genes

Joachim Jacob22 and 24 January 2014

Page 2: Part 3 of RNA-seq for DE analysis: the mapping process

2 of 44

Goal of mapping

We want to assign reads to genes they were derived from.

The result of the mapping will be used to construct a summary of the counts: the count table.

GeneA: 12GeneB: 5...

1. Read

2. Mapping

3. Goal

Genome sequence

Page 3: Part 3 of RNA-seq for DE analysis: the mapping process

3 of 44

2 main scenarios in RNA-seq

Reference genome sequenceavailable

NO reference genome sequence available● De novo assembly of the reads (trinity) (transcriptome construction)● Map the reads to the assembly (RSEM mapper)● Extract count table(note:no removal of polyA is required.

Computationally expensive!)

Page 4: Part 3 of RNA-seq for DE analysis: the mapping process

4 of 44

Reads mapped to reference genome

The mapping algorithm takes reads and aligns them to the right location in the reference sequence. Some key issues with this process:

1. Reference is haplotype: mixture of alleles, leads to mismatches.

2. Reads contain sequencing errors

3. Reads derived from mRNA, genome is DNA.

35 for 2 alleles→together

If we compare samples within the same specimen, this effect is similar for all samples.

Page 5: Part 3 of RNA-seq for DE analysis: the mapping process

5 of 44

mRNA reads: some reads span introns

● We need gapped alignment algorithms

mRNA

etc.

Many reads span introns: they need to be aligned with gaps. This can be used to

detect intron-exon junctions

exon

intron

One isoform!

http://www.ensembl.org

Reads

Page 6: Part 3 of RNA-seq for DE analysis: the mapping process

6 of 44

Isoforms complicate things

● Isoforms are transcribed at different levels, contributing differently to the number of reads.

http://www.ensembl.org

Page 7: Part 3 of RNA-seq for DE analysis: the mapping process

7 of 44

Gapped read mapping (1)

● Exon-first approach: TopHat (popular)

TopHat: discovering splice junctions with RNA-SeqVol. doi:10.1093/bioinformatics/btp12025 no. 9 2009, pages 1105–1111

Junction database constructed to try to map unmapped reads.

Page 8: Part 3 of RNA-seq for DE analysis: the mapping process

8 of 44

Gapped read mapping (2)

● STAR: fast and suited for longer reads

STAR: ultrafast universal RNA-seq alignerAlexander Dobin et al. Bioinformatics

Page 9: Part 3 of RNA-seq for DE analysis: the mapping process

9 of 44

What you need for mapping

1. A reference genome sequence (fasta), to be indexed by the mapper.

2. A genome annotation file (GFF3 or GTF), with indication of currently known annotations (you can sometimes do without: you can let the mapper detect new genes)

3. The cleaned (preprocessed) reads (fastq)

Page 10: Part 3 of RNA-seq for DE analysis: the mapping process

10 of 44

Getting your reference genome sequence

● Genomes to be used by TopHat can be fetched from iGenomes and for STAR here

● If your genome is not

listed above, check

http://ensembl.org

and

http://ensemblgenomes.org ; and follow indexing software

● If still no luck, try a specialized species website, e.g.

Page 11: Part 3 of RNA-seq for DE analysis: the mapping process

11 of 44

Indexing a genome

● Mapping reads is fairly fast, because the heavy lifting is done beforehand: the reference genome sequence is preprocessed by indexing (taking a lot of time), making mapping fast.

● On Galaxy, the indexing has already been performed for you. Just choose your genome from the list.

Page 12: Part 3 of RNA-seq for DE analysis: the mapping process

12 of 44

Getting genome annotation information

● Annotation info is stored in text files formatted as GTF or GFF3 files.

● If sequencing is deep enough, the complete transcriptome structure can be derived from the mapping: splice junctions, isoforms, variants,... CuffLinks for example reconstructs the annotation from an alignment, and generates a GFF file, to be used. Potentially novel transcripts are included in this file. But remember, this is NOT OUR GOAL.

● We will use a GTF file from an respected genome database to assist the mapping of reads.

http://cufflinks.cbcb.umd.edu/

Page 13: Part 3 of RNA-seq for DE analysis: the mapping process

13 of 44

Using genome annotation information

Page 14: Part 3 of RNA-seq for DE analysis: the mapping process

14 of 44

GTF example

TIP: You can look at an example in Galaxy!

Page 15: Part 3 of RNA-seq for DE analysis: the mapping process

15 of 44

Mapping in Galaxy

Mapping in Galaxy

Basic settings

!

Page 16: Part 3 of RNA-seq for DE analysis: the mapping process

16 of 44

Mapping in Galaxy

!!!

Page 17: Part 3 of RNA-seq for DE analysis: the mapping process

17 of 44

Mapping in Galaxy

Page 18: Part 3 of RNA-seq for DE analysis: the mapping process

18 of 44

Mapping tip

● PRO TIP: align a subsample of reads in Galaxy. Play with the settings, and determine the best outcome.

● Set the mapping fairly liberal: map as much as possible, and let the mapper assign mapping qualities. Ideally, every read maps once ('uniquely mapped'). In the following step, we will discard reads mapped to multiple locations ('multi reads').

Page 19: Part 3 of RNA-seq for DE analysis: the mapping process

19 of 44

Mapping quality control

The outcome of the alignment is a SAM or a BAM format, which you can visualize in Galaxy (or with a stand-alone viewer such as GenomeView or IGV.

Let's visualize

Check whether this visualization matches:- paired end- splice junctions- strandedness- ...

Page 20: Part 3 of RNA-seq for DE analysis: the mapping process

20 of 44

Practical tips

GTF

These are the reads, 2 coloursbecause of the sense and

antisense strand. (obviouslythis library was not stranded!)

Position

Some reads span an intron

Page 21: Part 3 of RNA-seq for DE analysis: the mapping process

21 of 44

Mapping QC

Quality control of the mapping is done by checking the SAM file.

1. is the fraction of reads mapping OK?2. is the number of uniquely mapped reads OK?3. is the number of mismatches OK?4. in case of PE reads: do the reads fall into the same gene?

Tools:RseQCBAMQC

http://rseqc.sourceforge.net/

Page 22: Part 3 of RNA-seq for DE analysis: the mapping process

22 of 44

Mapping QC - RSeQC

After checking the mapping visually, determine more metrics with RseQC.

http://rseqc.sourceforge.net/

Page 23: Part 3 of RNA-seq for DE analysis: the mapping process

23 of 44

Mapping QC - RSeQC

Duplication rate observed in the RNA-seq data.

http://rseqc.sourceforge.net/

Page 24: Part 3 of RNA-seq for DE analysis: the mapping process

24 of 44

Mapping QC - RSeQC

Read quality of aligned reads

http://rseqc.sourceforge.net/

Page 25: Part 3 of RNA-seq for DE analysis: the mapping process

25 of 44

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Early flattening points to saturation

Q1 Q4: from low→count genes

to high count genes

Page 26: Part 3 of RNA-seq for DE analysis: the mapping process

26 of 44

Mapping QC - RSeQC

Sequence depth saturation

http://rseqc.sourceforge.net/

Page 27: Part 3 of RNA-seq for DE analysis: the mapping process

27 of 44

Mapping QC - RSeQC

The 'read coverage' per gene, averaged over all genes.

http://rseqc.sourceforge.net/

Page 28: Part 3 of RNA-seq for DE analysis: the mapping process

28 of 44

Mapping QC - RSeQC

The gene body coverage graph, also per sample.

http://rseqc.sourceforge.net/

Deviating!

Page 29: Part 3 of RNA-seq for DE analysis: the mapping process

29 of 44

Mapping QC - BamQC

Another useful tool is BamQC of the Qualimap Suite. Be aware however: mainly useful for DNA-seq!

http://qualimap.bioinfo.cipf.es/

Page 30: Part 3 of RNA-seq for DE analysis: the mapping process

30 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 31: Part 3 of RNA-seq for DE analysis: the mapping process

31 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 32: Part 3 of RNA-seq for DE analysis: the mapping process

32 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 33: Part 3 of RNA-seq for DE analysis: the mapping process

33 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/ Fraction of genome sequence not covered

Page 34: Part 3 of RNA-seq for DE analysis: the mapping process

34 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 35: Part 3 of RNA-seq for DE analysis: the mapping process

35 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 36: Part 3 of RNA-seq for DE analysis: the mapping process

36 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 37: Part 3 of RNA-seq for DE analysis: the mapping process

37 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 38: Part 3 of RNA-seq for DE analysis: the mapping process

38 of 44

Mapping QC: BamQC

http://qualimap.bioinfo.cipf.es/

Page 39: Part 3 of RNA-seq for DE analysis: the mapping process

39 of 44

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 40: Part 3 of RNA-seq for DE analysis: the mapping process

40 of 44

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 41: Part 3 of RNA-seq for DE analysis: the mapping process

41 of 44

Mapping QC: BamQC

Some examples to watch out for.

http://qualimap.bioinfo.cipf.es/

Page 42: Part 3 of RNA-seq for DE analysis: the mapping process

42 of 44

Keywordshaplotype

Gapped mapping

GTF

duplication

isoforms

strandedness

coverage

Write in your own words what the terms mean

Page 44: Part 3 of RNA-seq for DE analysis: the mapping process

44 of 44

Break