a statistical base-caller for the illumina genome analyzer wally gilks university of leeds

61
A statistical base- caller for the Illumina Genome Analyzer Wally Gilks University of Leeds

Upload: lenard-gibbs

Post on 01-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

A statistical base-caller for the Illumina Genome Analyzer

Wally GilksUniversity of Leeds

DNA sequencing technologies

• Sanger sequencing

• “Next-Generation” sequencing

• Roche 454

• ABI SOLiD

• Illumina (Solexa)

• “Next-Next (3rd) Generation” sequencing

• VisiGen

• Helicos

• Oxford Nanopore

Illumina Genome Analyzer

• Description of technology

• Technological problems

• Our statistical model for base-calling

• Comparing our accuracy with Illumina’s

Illumina Genome Analyzer Flow cell

sbss.cap.ed.ac.uk/solexa

lanes

Layout of a flow cell

1 2 3 4 5 6 7 8

lanes

tile(330 per lane)

control lane

One tile of a flow cell

Chi, K.R., Nature Methods - 5, 11 - 14 (2008)

sequenceclusters(30,000 per tile)

tile

DNA sample preparation (over-simplified)

1) Extract DNA

2) Randomly shatter

3) Attach adapter sequence

4) Attach to flow-cell surface

5) PCR-amplify into clusters

Sequence clusters on the flow cell

A

C

T

G

A

A

.

.

.

.

.

.

adapter sequence

sequencefragment

C

T

G

A

.

.

.

.

.

.

T

G

C

G

.

.

.

.

.

.

T

T

G

A

Cluster 1 Cluster 2 Cluster 3

adapter sequence

flow-cellsurface

A

C

T

G

A

A

.

.

.

.

.

.

A

C

T

G

A

A

.

.

.

.

.

.

C

T

G

A

.

.

.

.

.

.

T

G

C

G

.

.

.

.

.

.

T

T

G

A

C

G

.

.

.

.

.

.

T

T

G

A

C

G

.

.

.

.

.

.

T

T

G

A

Sequencing cycle 1

A

C

T

G

A

A

.

.

.

.

.

.

add free adapters anddye-labelled bases

A

C

T

G

A

A

.

.

.

.

.

.

add block

Sequencing cycle 1

A

C

T

G

A

A

.

.

.

.

.

.

Fire laser

recordintensities

Sequencing cycle 1

Light detector

Frequency spectrum

A

C

T

G

A

A

.

.

.

.

.

.

remove block

Sequencing cycle 1

A

C

T

G

A

A

.

.

.

.

.

.

add dye-labelled bases

Sequencing cycle 2

A

C

T

G

A

A

.

.

.

.

.

.

Fire laser

recordintensities

Sequencing cycle 2

A

C

T

G

A

A

.

.

.

.

.

.

Fire laser

recordintensities

Sequencing cycle 3

A

C

T

G

A

A

.

.

.

.

.

.

Fire laser

recordintensities

Sequencing cycle 4

Illumina Genome Analyzer

• Description of technology

• Technological problems

• Our statistical model for base-calling

• Comparing our accuracy with Illumina’s

The “sticky-T” problem

A

C

T

G

A

A

.

.

.

.

.

.

mixedsignal

non-specific accumulation of T dye

Sticky-T: solution

• Regress intensity for cluster c against cycle number i, for each dye k.

• Normalise

k

kkrawkic

kic

ixx

ˆ

ˆˆ

rawkicx

Illumina Genome Analyzer

• Description of technology

• Technological problems

• Our statistical model for base-calling

• Comparing our accuracy with Illumina’s

The “cross-talk” problem

• Ideally, base “A” would produce a strong and distinct intensity on the A dye.

• Similarly for the other bases.• But in reality, base “A” can produce a signal on the “C”

dye, and so on.• This is called dye “cross-talk”.

Light detector

Frequency spectrum

What is the true base at cycle 1 in cluster 1 ?

Observations:

What is the true base at cycle 18 in cluster 1 ?

Observations:

What is the true base at cycle 36 in cluster 1 ?

Observations:

Cross-talk: solution

'),,,( TicGicCicAicic xxxxx

ibibic VNbx ,~

Model the normalised intensity at cycle i in cluster c:

as a 4-dimensional multivariate normal distribution

whose mean vector and variance matrix V depend on cycle number i and true base b.

The “phase” problem

A

C

T

G

A

A

.

.

.

.

.

.

Cycle 4: ideal

A

C

T

G

A

A

.

.

.

.

.

.

Cycle 4: misphased

A

C

T

G

A

A

.

.

.

.

.

.

Cycle 4: misphased

Phase problem: solution

• Assume probability c of a base-incorporation error at a given cycle i, constant over all cycles, but depending on cluster c.

• This implies a probability of

ic )1(

of being correctly phased at cycle i.

The “drop-off” problem

A

C

T

G

A

A

.

.

.

.

.

.

Cycle 4: ideal

A

C

T

G

A

A

.

.

.

.

.

.

Cycle 4: dropped off

Sequencing reactions terminated,perhaps due to failure of block release

Drop-off problem: solution

• Assume probability of dropping off at a given cycle i, constant over all cycles and clusters.

• This implies a probability of

i)1( of not having dropped off before cycle i.

Putting it all together

• We do not know when a molecule becomes misphased or drops off. We integrate over these events.

• Many identical molecules in each cluster: assume their independence, motivating normal theory.

The resulting model of the mean intensity vector

'),,,( TicGicCicAicic xxxxx

at cycle i in cluster c when the true base is b, is :

ibibic VNbx ,~ where

b bbi

cbi

ci

ib

bi

ci

ib

VVV

))1(1()1()1(

)1()1(fixed parameters

cluster-specific parameter known base frequency

)1)(1(1 cc

iA

iC

iG

iT

)(trace ibV

Illumina Genome Analyzer

• Description of technology

• Technological problems

• Our statistical model for base-calling

• Comparing our accuracy with Illumina’s

Base-calling

b cicb

cicb

icbxp

bxpxbp

)ˆ,ˆ,(

)ˆ,ˆ,()(

Posterior probability that cluster c at cycle i has base b is:

where

ibibic VNbx ,~

as described above.

Call b to maximise this posterior.

BLASTing reads

• Study should be designed with many replicates

• BLAST is used to group similar reads

• A consensus sequence is called for each group

Conclusion

• Currently, our method performs about as well as the Illumina pipeline.

• Our method produces a posterior probability of correctness of each base

call.

• Further work addressing heavy tails in the residuals should improve results.

• Others are trying to estimate the phase at each cycle for each cluster.

Thanks to:

• Irina Abnizova • Tom Skelly• Nava Whiteford• Klaus Maisinger

Next-Gen Sequencing Group, Sanger Inst.

Illumina

Oxford Nanopore Technologies