09.05.2012cs-681 presentation 1/25 allpaths-lg a new standard for assembling a billion-piece genome...

09.05.2012 CS-681 PRESENTATION 1/25

ALLPATHS-LGALLPATHS-LG

a new standard for a new standard for assembling a assembling a

billion-piece genome puzzlebillion-piece genome puzzle


CS 681

presented by

Ömer KöksalÖmer Köksal

High-quality draft assemblies of mammalian genomes from massively parallel sequence data

ALLPATHS-LGALLPATHS-LGby

Sante Gnerre et al. (20 Authors)Jan 25th, 2011


Agenda

Introduction

Results

Model for Input Data

Sequencing Data

ALLPATHS-LG Assembly Method

Uncertainty in Assemblies

Human and Mouse Assemblies

Human Genome

Mouse Genome

Segmental Duplications

Understanding Gaps

Discussion


Introduction

High-quality assembly of a genome sequence is critical

Particularly challenging for large, repeat rich genomes such as those of mammals

Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each.

New massively parallel technologies are expected to lower cost dramatically but they could not, because of

• short sequencing (~100 bases in length)

• less accuracy

• difficult to assemble


Introduction (cont’d)

ALLPATHS-LG

de novo assembly of large (and small) genomes

it should be possible to generate high quality draft assemlies of Large Genomes

~1000 fold lower cost than a decade ago

Previous versions:

• ALLPATHS 1.0 (2008)

• ALLPATHS 2.0 (2009)


RESULTS Model for Input Data

Sequencing Data

ALLPATHS-LG Assembly Method

Uncertainty in Assemblies

Human and Mouse Assemblies

Human Genome

Mouse Genome

Segmental Duplication

Understanding Gaps

Results


De novo genome assembly depends on

• computational methods

• nature and quantity of sequence data used

Fairly standard model of Capillary-based sequence was modified

Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage

Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing

illumina sequencing was used (Table-1)

Results - Model for Input Data


Table 1 – Provisional sequencing model for de novo assembly

Results - Model for Input Data (cont’d)


Using the model above generated sequences are:

• Human Genome

• Mouse Genome

Human Genome:

• GM12878 (Coriell Institute) of 1000 Genomes Pilot Project

• NCBI Short Read: Human_NA_12878_Genome_on_illumina

Mouse Genome:

• C57BL/6J female DNA

• NCBI Short Read: Mouse_B6_Genome_on_illumina

Results – Sequencing Data

09.05.2012 CS-681 PRESENTATION 10/25

previous versions were improved extensively

can assembly small genomes

freely available at:

http://www.broadinstitute.org/science/programs/genome-biology/crd

Results - ALLPATHS-LG Assembly Method

09.05.2012 CS-681 PRESENTATION 11/25

Some key innovations in ALLPATH-LG- Handling repetitive sequences

-more resilient to repeats

- Error Correction

-for every 24-mer the algorithm examines the stack of all reads containing 24-mer

-incidence of incorrect error correction was reduced

- Use of jumping data

-it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions

- Efficient memory usage

-can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week

-3 week for mouse & 3.5 weeks for human)

Results - ALLPATHS-LG Assembly Method (cont’d)

09.05.2012 CS-681 PRESENTATION 12/25

Results – Uncertainty in Assemblies

The goal of assembly is to reconstruct the genome as accurately as possible

However in some locations the data may be compatible with more than one solutions

Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives

ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices

ATC{A,T}GGTTTTTTT{T,TT}ACAC

Variant Call Format (.VCF file)

09.05.2012 CS-681 PRESENTATION 13/25

NOTE:

Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties

Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors)

It would be desirable to assign probabilities to each alternative

Results – Uncertainty in Assemblies(cont’d)

09.05.2012 CS-681 PRESENTATION 14/25

Results – Human & Mouse Assemblies

Resulting genome assemblies provide good coverage of the human and mouse genomes

ALLPATHS-LG assemblies were compared with previously published assemblies

- Capillary-based sequencing

- SOAP (massively sequencing parallel sequencing)

09.05.2012 CS-681 PRESENTATION 15/25

Results – Human & Mouse Assemblies (cont’d)

09.05.2012 CS-681 PRESENTATION 16/25

N50 contig length of 24 kb

Scaffold length of 11.5 Mb

Contiguity is > 4 fold longer than SOAP algorithm

Connectivity is > 25 fold longer than SOAP algorithm

Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%)

Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%)

Results are similar to capillary based assemblies

Results – Human Genome

09.05.2012 CS-681 PRESENTATION 17/25

Local assembly error: 3.5 %

- Capillary: 4.1 %

- SOAP: 6.2 %

Long range accuracy: 99.1%

- Capillary: 99.7 %

- SOAP: 99.5 %

Results – Human Genome (cont’d)

09.05.2012 CS-681 PRESENTATION 18/25

Results are broadly similar for the mouse genome

N50 contig length of 16 kb

Scaffold length of 7.2 Mb

Connectivity is > 20 fold larger than SOAP algorithm

Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb)

Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%)

Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%)

Results are considerably better than SOAP

Results – Mouse Genome

09.05.2012 CS-681 PRESENTATION 19/25

Local assembly error: 3.0 %

- Capillary: 2.7 %

- SOAP: 14.2 %

Long range accuracy: 99.0 %

- Capillary: 99.1 %

- SOAP: 98.8 %

Results – Mouse Genome (cont’d)

09.05.2012 CS-681 PRESENTATION 20/25

Segmental duplications shows a challange

ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications

- Capillary: 60%

- SOAP: 12%

NOTE:

Clearly additional work is needed here

Results – Segmental Duplications

09.05.2012 CS-681 PRESENTATION 21/25

Rougly three quarters of the gaps captured

Remaining gaps are not spanned

Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps:

- For mouse: LINE elements are major contributors to GAPS

- For human: LINE & SINE elements

Results – Understanding Gaps

09.05.2012 CS-681 PRESENTATION 22/25

High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome

Costing tens of millions $ each to generate with capillary based sequencing

In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

Discussion

09.05.2012 CS-681 PRESENTATION 23/25

ALLPATHS-LG

- Good long range connectity,

- Good accuracy,

- Good coverage

wrt capillary based sequencing and

better than SOAP

ALLPATHS-LG

- Quality of the assembliesis considerably better:

scaffolds are > 25 times longer

Discussion (cont’d)

09.05.2012 CS-681 PRESENTATION 24/25

ALLPATHS-LG is anticipated to yield even better results in the improved version.

ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT}

Computational hardware requirements:

- SOAP is faster (takes 3 days) but accuracy is low

- ALLPATHS-LG is slower but produces high quality assemblies

ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

Discussion (cont’d)

09.05.2012 CS-681 PRESENTATION 25/25

Thank you.

Questions ?

09.05.2012cs-681 presentation 1/25 allpaths-lg a new standard for assembling a billion-piece genome...

Documents