de novo assembly of rna steve kelly slides are available at: example data

43
De novo assembly of RNA Steve Kelly www.stevekelly.eu Slides are available at: http://bioinformatics.plants.ox.ac.uk/EBI_2014 /EBI_2014.ppt Example data is available at: http://bioinformatics.plants.ox.ac.uk/EBI_2014 /reads.zip

Upload: jaime-canner

Post on 14-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

De novo assembly of RNA

Steve Kelly

www.stevekelly.eu

Slides are available at:

http://bioinformatics.plants.ox.ac.uk/EBI_2014/EBI_2014.ppt

Example data is available at:

http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip

Page 2: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What is a de novo transcriptome?

Set of nucleotide sequences obtained by sequencing reverse transcribed RNA

- Poly A selected

- Ribsomal RNA depleted

- Random hexamer primed

- Long reads/short reads

- Strand specific

Sequencing technology: Read size: Throughput per run:

Illumina ~120bp ~800Gbp

454 ~1000bp ~700Mbp

PacBio ~2700bp ~100Mbp

GridION ~100kbp ~100Gbp/day

Select for mRNA

Page 3: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Why do de novo transcriptomes

Good way of generating new data from non-model species

- It can tell you about gene presence

- It can tell you about gene expression

- It can tell you about evolution

Cheaper than you might think!

2Gb of paired end sequence data is sufficient to get 90% of expressed genes!

A de novo plant transcriptome from Illumina sequence: ~£300

Typical computing resources (standard PC desktop)

Linux desktop: cost £800-1000.

at least 1 quad core processor

at least 16gb of ram

at least 3 TB of disk space

Page 4: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Why do de novo transcriptomes

Example data for this talk is available at:http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip

Paired end Illumina sequence data from Sorghum bicolor leaves R1.fq is a file containing the first end reads.R2.fq is contains the second end reads.The first read in R2.fq is the pair of the first read in R1.fq so the order of the reads is very important in paired-end read files!

All software used is available for Linux- Web addresses for current versions are given- Software takes many hours to run of large datasets

All software is command line only- Command line examples for every step are given

Page 5: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastx_quality_stats –i R1.fq –o R1_output.txt~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png

Command line examples

The name of the software being used

Where you can get it for free

How to use software to generate the data shown on the slide

Shorthand for the full path to the directory where the program is located

Page 6: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

De novo assembly protocol outline

1) Pre-assembly read processing

Remove poor quality data before assembly

2) Assembly

Using different parameters and merging

3) Post-assembly processing

Identifying protein sequences and orthologous protein groups

4) Quantification

Mapping raw reads back onto assembly to determine expression level

Page 7: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What do sequence files look like?

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…+Fefefggggggfggggggggfgggegggggggaggggggggdg…

Each read has a name that starts with an @ symbol

Page 8: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What do sequence files look like?

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…+Fefefggggggfggggggggfgggegggggggaggggggggdg…

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…+gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…

Each read has a name that starts with an @ symbol

If the reads are paired, then the first in pair will end with a /1 and the

second with a /2

Page 9: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What do sequence files look like?

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…+Fefefggggggfggggggggfgggegggggggaggggggggdg…

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…+gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…

Each read has a name that starts with an @ symbol

If the reads are paired, then the first in pair will end with a /1 and the

second with a /2

The read sequence is on the line after the read name

Page 10: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What do sequence files look like?

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…+Fefefggggggfggggggggfgggegggggggaggggggggdg…

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…+gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…

Each read has a name that starts with an @ symbol

If the reads are paired, then the first in pair will end with a /1 and the

second with a /2

The read sequence is on the line after the read name

The sequence read is followed by a +

symbol on a new line

Page 11: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

What do sequence files look like?

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…+Fefefggggggfggggggggfgggegggggggaggggggggdg…

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…+gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…

Each read has a name that starts with an @ symbol

If the reads are paired, then the first in pair will end with a /1 and the

second with a /2

The read sequence is on the line after the read name

The base-by-base qualities are on line #4

The sequence read is followed by a +

symbol on a new line

Page 12: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Understanding read quality scores

Quality scores are the probability that the base is called incorrectly

Q = -10 Log10(p)

where p is the probability that base is incorrect

Quality score Probability of sequencing error Accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

Read quality is more important for assembly than for read mapping!

Page 13: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example of read quality scores

Reads in Illumina 1.5+ encoding (Quality 2 - 40)

What does quality score of B mean?

ASCII value for B = 66 (http://www.asciitable.com/)

Illumina 1.5+ offset = 64

B = 66 – 64 = 2

2 = -10 Log10(p)

p = 10^(-2/10)

p = 0.63

i.e. 63% chance that the base call is wrong

Page 14: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Visualising read quality data

Important to visualise/summarise the raw data before using it!

Most common errors can be spotted in general summary statistics!

Use FASTQC for nice visual output:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Use FASTX or Trimmomatic for simple processing tasks:

http://hannonlab.cshl.edu/fastx_toolkit/

http://www.usadellab.org/cms/?page=trimmomatic

Page 15: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastx_quality_stats –i R1.fq –o R1_output.txt~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png

Page 16: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastx_quality_stats –i R1.fq –o R1_output.txt~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png

Normal 5’ composition bias caused by non-random hexamer used during cDNA synthesis

Page 17: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastx_trimmer –f 6 –i R1.fq –o trim_R1.fq

If there is unusual composition bias at either end it is likely to cause problems. It’s best to remove these bases!

Page 18: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastx_clipper –a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG –i R1.fq –o filt_R1.fqjava -jar trimmomatic-0.30.jar PE -threads 4 -phred64 R1.fq R2.fq trimmed_R1.fq unpaired_R1.fq trimmed_R2.fq unpaired_R2.fq ILLUMINACLIP:all_adaptors.fasta:2:30:10 LEADING:10 TRAILING:10 SLIDINGWINDOW:5:15 MINLEN:50

Non-uniform composition in the reads due to sequencing adaptor sequences.

Remove the appropriate adapter sequence using fastx_clipper

Page 19: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png

Pre-assembly read processing

Page 20: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Normal 3’ degradation of read quality scores.

Sequence at 3’end has higher proportion of errors

Before assembly good idea to remove poor quality sequences!Command line:~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png

Page 21: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Understanding read quality scores

Quality scores are the probability that the base is called incorrectly

Q = -10 Log10(p)

where p is the probability that base is incorrect

Quality score Probability of sequencing error Accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

Read quality is more important for assembly than for read mapping!

Quantification

SNP calling or assembly

Page 22: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Three ways to deal with poor quality reads:

1) Discard reads based on quality threshold

Most stringent

Lose lots of data

2) Remove poor quality sections from reads

Middle stringency

Keep most of the read data, just loosing the unreliable bits

1) Try to find and fix the errors in poor quality reads

Least stringent

Error correction may introduce errors

Keep all the data

Page 23: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Command line:~/fastq_quality_filter –q 30 –p 50 –i R1.fq –o filt_R1.fq~/fastx_quality_stats –i filt_R1.fq –o filt_R1_output.txt~/fastq_quality_boxplot_graph.sh –i filt_R1_output.txt –o filt_R1_box_plot.png

Removed reads which have quality < 30 for more than 50% of the read length

22.5 million -> 14.3 million (64%)

Pre-assembly read processing

Page 24: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using fastx toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

Pre-assembly read processing

Command line:~/fastq_quality_trimmer –t 30 –l 30 –i R1.fq –o trim_R1.fq~/fastx_quality_stats –i trim_R1.fq –o trim_R1_output.txt~/fastq_quality_boxplot_graph.sh –i trim_R1_output.txt –o trim_R1_box_plot.png

Clip reads from both ends which have quality until quality > 30. Discard if read length < 30

22.5 million -> 22.2 million reads (99%)

Page 25: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Command line:~/sortmerna --I interleaved_trimmed_reads.fq -n 4 --db “the full paths to the four included ribosomal RNA libraries” --paired-in --accept ribosomal_RNA_reads --other non_ribosomal_reads --log rRNA_filter_log -a 4

Example: Using SortMeRNA

http://bioinfo.lifl.fr/RNA/sortmerna/

Even after polyA selection or ribosomal RNA depletion you will have 1-30% ribosomal RNA contamination in your RNAseq. Best to remove it before assembly!

You will need to interleave your read files to run SortMeRNA.

@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAA+Fefefggggggfggggggggfgggegggggggaggggggggd@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACA+gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^

One file containing all read pairs interleaved consecutively.

Page 26: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using ALLPATHS-LG

http://www.broadinstitute.org/software/allpaths-lg/blog/

Pre-assembly read processing

- Splits all reads into all possible sub-sequences of length k

- Makes stacks of near identical sub-sequences

- Bigger the value of k the more conservative

- Small value of k can introduce more errors than it fixes!

Page 27: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using ALLPATHS-LG

http://www.broadinstitute.org/software/allpaths-lg/blog/

Pre-assembly read processing

- Splits all reads into all possible sub-sequences of length k

- Makes stacks of near identical sub-sequences

- Bigger the value of k the more conservative

- Small value of k can introduce more errors than it fixes!

Page 28: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using ALLPATHS-LG

http://www.broadinstitute.org/software/allpaths-lg/blog/

Pre-assembly read processing

- Splits all reads into all possible >25bp sub-sequences (p < 1e-6)

- Makes stacks of near identical sub-sequences (no gaps)

Command line:~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Page 29: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using ALLPATHS-LG

http://www.broadinstitute.org/software/allpaths-lg/blog/

Pre-assembly read processing

- Splits all reads into all possible >25bp sub-sequences (p < 1e-6)

- Makes stacks of near identical sub-sequences (no gaps)

- Correct singletons and low frequency bases

Keep all 22.5 million reads (100%)

Command line:~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Page 30: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Example: Using ALLPATHS-LG

http://www.broadinstitute.org/software/allpaths-lg/blog/

Pre-assembly read processing

Join paired end reads into long reads prior to assembly.

Command line:~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Assembly contains up to 50% less reads

Assembly program doesn’t incorrectly guess insert size!

Prevents some common assembly artefacts

Steve Kelly
Page 31: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Pre-assembly read processing

Example “good practice” pre-assembly processing

- Visually inspect nucleotide composition and quality scores

- Clip off 5’ or 3’ ends containing unexpected/unusual nucleotide bias

- Filter out any reads containing adaptor sequences

- Trim back reads from both ends based on quality score

- Filter out ribosomal RNA reads

- Error correct and join overlapping remaining reads using k >= 25bp

- (Optional step here would be to normalise reads to even out coverage depth for example see program called “khmer”)

- Assemble!

Page 32: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Assembly packages:

All use the same underlying idea of assembling using de Bruijn graphs

1) Velvet/Oases - Good all round and very fast

2) Trinity - Good at getting full length transcripts

3) TransAbyss - Good at getting lots of splice variants

4) SOAPdenovo-Trans - Very fast with low memory footprint

Results of all assembly programs are quite similar.

As each of them comes up with a new idea they tend to be built into the others quickly too so that they don’t get left behind!

I recommend you start with velvet/oases as its easy to install and use!

Which assembler to use?

Page 33: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Simplified overview of assembly process

Page 34: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Simplified overview of assembly process

Page 35: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Command line example:~/oases_pipeline.py –m 31 –M 61 –s 10 –g 41 –o merged –d “ –fastq –shortPaired –separate R1.fq R2.fq“ –d –ins_length 170 –cov_cutoff 6 –min_pair_count 4 –min_trans_lgth 200

Assembling the transcriptome

Example: Using Velvet/Oases

http://www.ebi.ac.uk/~zerbino/velvet/ & http://www.ebi.ac.uk/~zerbino/oases/ Assembly only needs two parameters

- Need to specify the insert size

- Need to specify the k-mer size for constructing the de Bruijn graph

- Larger k more specific

- Smaller k more sensitive

- Best to use a range of k-mer sizes and merge the results

- For example k = 31, 41, 51 and 61

- Set a minimum transcript length

- Set a minimum coverage cut off

Page 36: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Assessing the transcriptome

Effect of kmer size of transcript length part 1

Page 37: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Assessing the transcriptome

Effect of kmer size of transcript length part 2

Page 38: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Assessing the transcriptome

Effect of kmer size on number of transcripts

Page 39: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Command line example:~/rsem-prepare-reference --bowtie-path “path-to-bowtie” transcripts.fa~/rsem-calculate-expression --bowtie-path “path-to-bowtie” --paired-end R1.fq R2.fq transcripts.fa

Quantifying your transcriptome

Example: Using RSEM & bowtie

http://deweylab.biostat.wisc.edu/rsem/ & http://bowtie-bio.sourceforge.net/index.shtml

Use processed reads for quantification

- Cant use programs such as cufflinks to estimate abundance from transcripts

- Current best method is RSEM

gene_id transcript_id(s) length effective_length expected_count TPM FPKM

transcript_1 transcript_1 2021 1936.59 519.77 5190.23 6378.74

transcript_100 transcript_100 887 802.59 96 2313.29 2843.01

transcript_101 transcript_101 880 795.59 1856.65 45132.93 55467.89

transcript_102 transcript_102 876 791.59 370 9039.72 11109.72

transcript_103 transcript_103 876 791.59 173.95 4249.96 5223.15

transcript_104 transcript_104 875 790.59 53 1296.52 1593.4

Use expected count for DE testing

Page 40: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Summary

- A de novo plant transcriptome from Illumina sequence: ~£300

- Important to check the raw data before assembly

- Correcting and removing errors before assembly improves quality

- Most assembly methods have similar performance

- Try it!

Page 41: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Assemble your own mini-transcriptome

http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip

Download and extract some pre-trimmed read data from:

Command line:wget http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip ./

unzip reads.zip

Command line:ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True

Error correct and join overlapping paired end reads:

These reads come from 5 genes detected in a sorghum RNAseq experimentSb09g020820, Sb02g034570, Sb07g000980, Sb07g023920, Sb06g016090

Page 42: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Command line:velveth kmer_31 31 -fastq -shortPaired -separate R1.fq R2.fq

velvetg kmer_31 -read_trkg yes -scaffolding yes -ins_length 170

oases kmer_31 -ins_length 170 -scaffolding yes

Assemble uncorrected (trimmed) reads

Assemble your own mini-transcriptome

Command line:velveth kmer_31ec 31 -fastq -shortPaired -separate corrected_reads.paired.A.fastq corrected_reads.paired.B.fastq

velvetg kmer_31ec -read_trkg yes -scaffolding yes -ins_length 170

oases kmer_31ec -ins_length 170 -scaffolding yes

Assemble corrected reads

Page 43: De novo assembly of RNA Steve Kelly  Slides are available at:  Example data

Command line:velveth kmer_31ecj 31 -fastq -short corrected_reads.filled.fastq

velvetg kmer_31ecj -read_trkg yes

oases kmer_31ecj

Assemble corrected “joined” reads (*MUCH* better for larger read datasets)

Assemble your own mini-transcriptome

Look in the “transcripts.fa” output files to see what transcripts you have assembled

BLAST transcripts online at NCBI BLAST

Are each of the 5 expected genes assembled?Sb09g020820, Sb02g034570, Sb07g000980, Sb07g023920 & Sb06g016090

If you assemble at k = 21 do you generate any chimeric transcripts?

Try the oases pipeline script on slide 35.