de novo assembly - oregon state universitypeople.oregonstate.edu/~knausb/rna_seq/10_denovo.pdfbasic...
TRANSCRIPT
De novo assembly Brian J. Knaus
USDA Forest Service
Pacific Northwest Research Station
Kelly Vining and Katherine Pillman
Oregon State University
1
Calendar
December meeting cancelled.
January meeting organizational.
◦ Introductions.
◦ Projects we’re involved with.
◦ What we want to get out of this group?
February meeting: Cufflinks (pending)
2
Overview
What are de Bruijn graphs?
Software:
◦ Velvet
◦ ABySS
◦ Trinity
3
Genomic vs. transcriptomic
Genomic expectations:
◦ Long contigs (i.e., chromosomes).
◦ Uniform coverage.
Transcriptomic expectations:
◦ Short contigs (i.e., genes or operons).
◦ Non-uniform coverage.
4
Two types of computer error
Errors which crash the program and
generate an error message.
Errors where the software appears to
exit successfully but has actually done
something bad.
5
Essential bioinformatic software Database/query applications
◦ Blast
◦ Blat
◦ MAQ
◦ Bowtie
◦ BWA
De novo assembly
◦ Velvet
◦ ABySS
Soap 6
De novo assembly
Compiled code (usually C; Dennis Ritchie
R.I.P).
Require LOTS of memory!
Implement de Bruijn graphs.
7
8
ACTGATTG
9
ACTGATTG
k-mer = 5
10
ACTGATTG
k-mer = 5
11
ACTGATTG
k-mer = 5
12
ACTGATTG
k-mer = 5
13
ACTGATTG
k-mer = 5
Schematic representation of our implementation of the de Bruijn graph.
Zerbino D R , Birney E Genome Res. 2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Schematic representation of our implementation of the de Bruijn graph.
Zerbino D R , Birney E Genome Res. 2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Example of Tour Bus error correction.
Zerbino D R , Birney E Genome Res. 2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Example of Tour Bus error correction.
Zerbino D R , Birney E Genome Res. 2008;18:821-829
©2008 by Cold Spring Harbor Laboratory Press
Issues:
• Bubbles.
• Tips.
• Errors.
Velvet
Kelly Vining
Oregon State University
Forestry
18
Velvet • Current version: 1.1.06 (1.1.05 at CGRB cluster)
• Zerbino DR, Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18: 821–829.
• Zerbino DR, McEwen GK, Margulies EH, Birney E, 2009 Pebble and Rock Band: Heuristic Resolution of Repeats and Scaffolding in the Velvet Short-Read de Novo Assembler. PLoS ONE 4(12): e8407. doi:10.1371/journal.pone.0008407
Basic Velvet de novo assembly
Run velveth Velveth takes in a number of sequence files, produces a hashtable, then
outputs two files in an output directory (creating it if necessary),
Sequences and Roadmaps, which are necessary to velvetg.
Run velvetg Velvetg is the core of Velvet where the de Bruijn graph is built then
manipulated.
Note that although velvetg saves some files during the process to avoid
useless recalculations, the parameters are not saved from one run to the
next.
On the command line…
/local/cluster/velvet
[viningk@needleman velvet]$ ll
total 32
drwxr-xr-x 2 root root 34 Sep 24 2007 VELVET/
drwxr-xr-x 8 root root 4096 Jul 7 2009 velvet_0.7.42/
drwxr-xr-x 8 root root 4096 Dec 23 2009 velvet_0.7.55/
drwxr-xr-x 8 root root 4096 Feb 16 2010 velvet_0.7.58/
drwxr-sr-x 8 4205 1304 4096 Oct 4 2010 velvet_1.0.12/
drwxr-xr-x 8 root root 4096 May 18 10:29 velvet_1.1.02/
drwxr-xr-x 8 boyda spatafojlab 4096 Sep 7 14:39 velvet_1.1.05/
lrwxrwxrwx 1 root root 13 Sep 22 14:15 velvet -> velvet_1.1.05/
drwxr-xr-x 9 root root 152 Sep 22 14:15 ./
drwxrwxr-x 123 root cgrb 4096 Nov 1 13:34 ../
/local/cluster/velvet/velvet_1.1.05
[viningk@needleman velvet_1.1.05]$ ll
total 884
drwxr-xr-x 3 boyda spatafojlab 23 Jul 1 2009 third-party/
drwxr-xr-x 3 boyda spatafojlab 23 Dec 9 2009 doc/
drwxr-xr-x 2 boyda spatafojlab 90 Dec 21 2010 data/
-rwxr-xr-x 1 boyda spatafojlab 26 Jan 11 2011 update_velvet.sh*
-rwxr-xr-x 1 boyda spatafojlab 479 Jan 11 2011 shuffleSequences_fastq.pl*
-rwxr-xr-x 1 boyda spatafojlab 876 Jan 11 2011 shuffleSequences_fasta.pl*
-rw-r--r-- 1 boyda spatafojlab 17988 Jan 11 2011 LICENSE.txt
-rw-r--r-- 1 boyda spatafojlab 1117 Mar 8 2011 README.txt
-rw-r--r-- 1 boyda spatafojlab 456 Mar 8 2011 For_MAC_or_SPARC_users.txt
-rw-r--r-- 1 boyda spatafojlab 1956 Mar 29 2011 CREDITS.txt
-rw-r--r-- 1 boyda spatafojlab 3834 Mar 29 2011 globals.h
drwxr-xr-x 15 boyda spatafojlab 4096 Mar 29 2011 contrib/
drwxr-xr-x 2 boyda spatafojlab 4096 Jul 27 14:35 src/
-rw-r--r-- 1 boyda spatafojlab 6118 Jul 28 13:38 Makefile
-rw-r--r-- 1 boyda spatafojlab 10871 Jul 28 13:38 ChangeLog
drwxr-xr-x 3 boyda spatafojlab 4096 Sep 7 14:39 obj/
-rwxr-xr-x 1 boyda spatafojlab 64819 Sep 7 14:39 velveth*
-rwxr-xr-x 1 boyda spatafojlab 246830 Sep 7 14:39 velvetg*
-rw-r--r-- 1 boyda spatafojlab 439712 Sep 7 14:39 Manual.pdf
-rw-r--r-- 1 boyda spatafojlab 61416 Sep 7 14:39 Columbus_manual.pdf
drwxr-xr-x 8 boyda spatafojlab 4096 Sep 7 14:39 ./
drwxr-xr-x 9 root root 152 Sep 22 14:15 ../
Velveth Designate file type (fastq, fasta, etc)
Designate read category
Designate hash size
Supported file formats are:
fasta (default)
fastq
fasta.gz
fastq.gz
sam
bam
eland
gerald
Read categories are:
short (default)
shortPaired**
short2 (same as short, but for a separate
insert-size library)
shortPaired2** (see above)
long (for Sanger, 454 or even reference
sequences)
longPaired**
**Paired-end reads must be in a single file (ShuffleSequences.pl)
Sample commands: velveth
> ./velveth output_directory hash_length
[[-file_format][-read_type] filename]
> /deepspace/strauss/viningk/velvet_1.1.02/velveth velvet_041411 29
-fastq -shortPaired s_5_seqsTrim6.txt
> /local/cluster/velvet/velvet_1.1.02/velveth velvet_091410b 31
-fastq -short s_1_seqsMergedTrim6.txt -fasta -long contigs_09.fa
Velvetg Designate library insert size
Designate minimum contig length
Designate coverage cutoff
Need to optimize according to genome size, expected genome coverage depth,
etc.
Examples: > ./velvetg output_directory/ -min_contig_lgth 100
> /deepspace/strauss/viningk/velvet_1.1.02//velvetg velvet_041411/
-ins_length 350 -min_contig_lgth 500
> /deepspace/strauss/viningk/velvet_1.1.02/velvetg velvet_051811b
-ins_length 350 -exp_cov 20 -cov_cutoff 7.9 -min_contig_lgth 500
Velvet Optimiser
Tries a range of k-mer lengths with velveth, then runs velvetg
with range of coverage cutoffs
Lives here on the CGRB cluster:
/local/cluster/velvet/velvet_1.1.05/contrib/VelvetOptimiser-2.1.7
Sample command:
/deepspace/strauss/viningk/velvet_1.1.02/contrib/VelvetOptimiser-
2.1.7/VelvetOptimiser.pl -s 27 -e 31 -f '-shortPaired -fastq s_5_seqsTrim6.txt
s_1_seqsMergedTrim6.txt' > s_15_VOtest3.txt
Sample VelvetOptimiser output ********************************************************
Assembly id: 1
Velveth timestamp: May 20 2011 13:47:08
Velveth version: 1.0.12
Readfile(s): -shortPaired -fastq s_5_seqsTrim6.txt
s_1_seqsMergedTrim6.txt
Velveth parameter string: auto_data_31 31 -shortPaired -fastq
s_5_seqsTrim6.txt s_1_seqsMergedTrim6.txt
Assembly directory: /deepspace/strauss/viningk/auto_data_31
Velvet hash value: 31
Roadmap file size: 12143769440
Total number of sequences: 150746258
**********************************************************
May 20 13:47:08
Beginning vanilla velvetg runs.
********************************************************
Assembly id: 1
Assembly score: 140
Velveth timestamp: May 20 2011 13:47:08
Velvetg timestamp: May 20 2011 18:59:34
Velveth version: 1.0.12
Velvetg version: 1.0.12
Readfile(s): -shortPaired -fastq s_5_seqsTrim6.txt
s_1_seqsMergedTrim6.txt
Velveth parameter string: auto_data_31 31 -shortPaired -fastq
s_5_seqsTrim6.txt s_1_seqsMergedTrim6.txt
Velvetg parameter string: auto_data_31
Assembly directory: /deepspace/strauss/viningk/auto_data_31
Velvet hash value: 31
Roadmap file size: 12143769440
Total number of sequences: 150746258
Total number of contigs: 2635009
n50: 140
length of longest contig: 6491
Total bases in contigs: 343232524
Number of contigs > 1k: 11471
Total bases in contigs > 1k: 15400103
Sample VelvetOptimiser output
(continued) **********************************************************
May 21 06:11:49 Setting cov_cutoff to 5.195.
********************************************************
Assembly id: 1
Assembly score: 93102316
Velveth timestamp: May 20 2011 13:47:08
Velvetg timestamp: May 21 2011 18:22:55
Velveth version: 1.0.12
Velvetg version: 1.0.12
Readfile(s): -shortPaired -fastq s_5_seqsTrim6.txt
s_1_seqsMergedTrim6.txt
Velveth parameter string: auto_data_31 31 -shortPaired -fastq
s_5_seqsTrim6.txt s_1_seqsMergedTrim6.txt
Velvetg parameter string: auto_data_31 -exp_cov 17 -
cov_cutoff 5.1952
Assembly directory: /deepspace/strauss/viningk/auto_data_31
Velvet hash value: 31
Roadmap file size: 12143769440
Total number of sequences: 150746258
Total number of contigs: 1080880
n50: 602
length of longest contig: 12710
Total bases in contigs: 261325501
Number of contigs > 1k: 52009
Total bases in contigs > 1k: 93102316
Paired Library insert stats:
Paired-end library 1 has length: 154, sample standard deviation:
57
Paired-end library 1 has length: 160, sample standard deviation:
63
**********************************************************
ABySS
Katherine Pillman
Oregon State University
Horticulture
http://www.molecularevolution.org/molevol
files/presentations/comparative-genomics-
abyss-20110714.pdf
29