09.05.2012cs-681 presentation 1/25 allpaths-lg a new standard for assembling a billion-piece genome...
TRANSCRIPT
09.05.2012 CS-681 PRESENTATION 1/25
ALLPATHS-LGALLPATHS-LG
a new standard for a new standard for assembling a assembling a
billion-piece genome puzzlebillion-piece genome puzzle
09.05.2012 CS-681 PRESENTATION 2/25
CS 681
presented by
Ömer KöksalÖmer Köksal
High-quality draft assemblies of mammalian genomes from massively parallel sequence data
ALLPATHS-LGALLPATHS-LGby
Sante Gnerre et al. (20 Authors)Jan 25th, 2011
09.05.2012 CS-681 PRESENTATION 3/25
Agenda
Introduction
Results
Model for Input Data
Sequencing Data
ALLPATHS-LG Assembly Method
Uncertainty in Assemblies
Human and Mouse Assemblies
Human Genome
Mouse Genome
Segmental Duplications
Understanding Gaps
Discussion
09.05.2012 CS-681 PRESENTATION 4/25
Introduction
High-quality assembly of a genome sequence is critical
Particularly challenging for large, repeat rich genomes such as those of mammals
Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each.
New massively parallel technologies are expected to lower cost dramatically but they could not, because of
• short sequencing (~100 bases in length)
• less accuracy
• difficult to assemble
09.05.2012 CS-681 PRESENTATION 5/25
Introduction (cont’d)
ALLPATHS-LG
de novo assembly of large (and small) genomes
it should be possible to generate high quality draft assemlies of Large Genomes
~1000 fold lower cost than a decade ago
Previous versions:
• ALLPATHS 1.0 (2008)
• ALLPATHS 2.0 (2009)
09.05.2012 CS-681 PRESENTATION 6/25
RESULTS Model for Input Data
Sequencing Data
ALLPATHS-LG Assembly Method
Uncertainty in Assemblies
Human and Mouse Assemblies
Human Genome
Mouse Genome
Segmental Duplication
Understanding Gaps
Results
09.05.2012 CS-681 PRESENTATION 7/25
De novo genome assembly depends on
• computational methods
• nature and quantity of sequence data used
Fairly standard model of Capillary-based sequence was modified
Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage
Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing
illumina sequencing was used (Table-1)
Results - Model for Input Data
09.05.2012 CS-681 PRESENTATION 8/25
Table 1 – Provisional sequencing model for de novo assembly
Results - Model for Input Data (cont’d)
09.05.2012 CS-681 PRESENTATION 9/25
Using the model above generated sequences are:
• Human Genome
• Mouse Genome
Human Genome:
• GM12878 (Coriell Institute) of 1000 Genomes Pilot Project
• NCBI Short Read: Human_NA_12878_Genome_on_illumina
Mouse Genome:
• C57BL/6J female DNA
• NCBI Short Read: Mouse_B6_Genome_on_illumina
Results – Sequencing Data
09.05.2012 CS-681 PRESENTATION 10/25
previous versions were improved extensively
can assembly small genomes
freely available at:
http://www.broadinstitute.org/science/programs/genome-biology/crd
Results - ALLPATHS-LG Assembly Method
09.05.2012 CS-681 PRESENTATION 11/25
Some key innovations in ALLPATH-LG- Handling repetitive sequences
-more resilient to repeats
- Error Correction
-for every 24-mer the algorithm examines the stack of all reads containing 24-mer
-incidence of incorrect error correction was reduced
- Use of jumping data
-it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions
- Efficient memory usage
-can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week
-3 week for mouse & 3.5 weeks for human)
Results - ALLPATHS-LG Assembly Method (cont’d)
09.05.2012 CS-681 PRESENTATION 12/25
Results – Uncertainty in Assemblies
The goal of assembly is to reconstruct the genome as accurately as possible
However in some locations the data may be compatible with more than one solutions
Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives
ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices
ATC{A,T}GGTTTTTTT{T,TT}ACAC
Variant Call Format (.VCF file)
09.05.2012 CS-681 PRESENTATION 13/25
NOTE:
Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties
Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors)
It would be desirable to assign probabilities to each alternative
Results – Uncertainty in Assemblies(cont’d)
09.05.2012 CS-681 PRESENTATION 14/25
Results – Human & Mouse Assemblies
Resulting genome assemblies provide good coverage of the human and mouse genomes
ALLPATHS-LG assemblies were compared with previously published assemblies
- Capillary-based sequencing
- SOAP (massively sequencing parallel sequencing)
09.05.2012 CS-681 PRESENTATION 15/25
Results – Human & Mouse Assemblies (cont’d)
09.05.2012 CS-681 PRESENTATION 16/25
N50 contig length of 24 kb
Scaffold length of 11.5 Mb
Contiguity is > 4 fold longer than SOAP algorithm
Connectivity is > 25 fold longer than SOAP algorithm
Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%)
Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%)
Results are similar to capillary based assemblies
Results – Human Genome
09.05.2012 CS-681 PRESENTATION 17/25
Local assembly error: 3.5 %
- Capillary: 4.1 %
- SOAP: 6.2 %
Long range accuracy: 99.1%
- Capillary: 99.7 %
- SOAP: 99.5 %
Results – Human Genome (cont’d)
09.05.2012 CS-681 PRESENTATION 18/25
Results are broadly similar for the mouse genome
N50 contig length of 16 kb
Scaffold length of 7.2 Mb
Connectivity is > 20 fold larger than SOAP algorithm
Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb)
Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%)
Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%)
Results are considerably better than SOAP
Results – Mouse Genome
09.05.2012 CS-681 PRESENTATION 19/25
Local assembly error: 3.0 %
- Capillary: 2.7 %
- SOAP: 14.2 %
Long range accuracy: 99.0 %
- Capillary: 99.1 %
- SOAP: 98.8 %
Results – Mouse Genome (cont’d)
09.05.2012 CS-681 PRESENTATION 20/25
Segmental duplications shows a challange
ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications
- Capillary: 60%
- SOAP: 12%
NOTE:
Clearly additional work is needed here
Results – Segmental Duplications
09.05.2012 CS-681 PRESENTATION 21/25
Rougly three quarters of the gaps captured
Remaining gaps are not spanned
Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps:
- For mouse: LINE elements are major contributors to GAPS
- For human: LINE & SINE elements
Results – Understanding Gaps
09.05.2012 CS-681 PRESENTATION 22/25
High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome
Costing tens of millions $ each to generate with capillary based sequencing
In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.
Discussion
09.05.2012 CS-681 PRESENTATION 23/25
ALLPATHS-LG
- Good long range connectity,
- Good accuracy,
- Good coverage
wrt capillary based sequencing and
better than SOAP
ALLPATHS-LG
- Quality of the assembliesis considerably better:
scaffolds are > 25 times longer
Discussion (cont’d)
09.05.2012 CS-681 PRESENTATION 24/25
ALLPATHS-LG is anticipated to yield even better results in the improved version.
ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT}
Computational hardware requirements:
- SOAP is faster (takes 3 days) but accuracy is low
- ALLPATHS-LG is slower but produces high quality assemblies
ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)
Discussion (cont’d)
09.05.2012 CS-681 PRESENTATION 25/25
Thank you.
Questions ?