next-generation sequencing – the informatics angle

26
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008

Upload: brick

Post on 20-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Next-generation sequencing – the informatics angle. Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008. T1. Roche / 454 FLX system. pyrosequencing technology variable read-length the only new technology with >100bp reads - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Next-generation sequencing  – the informatics angle

Next-generation sequencing – the informatics

angle

Gabor T. MarthBoston College Biology Department

AGBT 2008Marco Island, FL. February 6. 2008

Page 2: Next-generation sequencing  – the informatics angle

T1. Roche / 454 FLX system

• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size

Page 3: Next-generation sequencing  – the informatics angle

T2. Illumina / Solexa Genome Analyzer

• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation

Page 4: Next-generation sequencing  – the informatics angle

T3. AB / SOLiD system

A C G T

A

C

G

T

2nd Base

1st

Bas

e

0

0

0

0

1

1

1

1

2

2

2

2

3

3

3

3

• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size

Page 5: Next-generation sequencing  – the informatics angle

T4. Helicos / Heliscope system

• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing

Page 6: Next-generation sequencing  – the informatics angle

A1. Variation discovery: SNPs and short-INDELs

1. sequence alignment

2. dealing with non-unique mapping

3. looking for allelic differences

Page 7: Next-generation sequencing  – the informatics angle

A2. Structural variation detection

• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

Page 8: Next-generation sequencing  – the informatics angle

A3. Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. Robertson et al. Nature Methods, 2007

Page 9: Next-generation sequencing  – the informatics angle

A4. Novel transcript discovery (genes)

Inferred exon 1• novel genes / exons

Inferred exon 2

• novel transcripts in known genes

Known exon 1 Known exon 2

Known exon 1 Known exon 2

Page 10: Next-generation sequencing  – the informatics angle

A5. Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

Page 11: Next-generation sequencing  – the informatics angle

A6. Expression profiling by tag counting

aligned reads

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

Page 12: Next-generation sequencing  – the informatics angle

A7. De novo organismal genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

Page 13: Next-generation sequencing  – the informatics angle

C1. Read length

read length [bp]0 100 200 300

~250 (var)

25-40 (fixed)

25-35 (fixed)

20-35 (var)

Page 14: Next-generation sequencing  – the informatics angle

When does read length matter?

• short reads often sufficient where the entire read length can be used for mapping:

SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)

• longer reads are needed where one must use parts of reads for mapping:

de novo sequencing

novel transcript discovery

aacttagacttacagacttacatacgta

Known exon 1 Known exon 2

accgattactatacta

Page 15: Next-generation sequencing  – the informatics angle

C2. Read error rate

• error rate dictates how many errors the aligner should tolerate

• error rate typically 0.4 - 1%

• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned

0 1 20.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fra

ctio

n of

gen

ome

Number of mismatches allowed

• applications where, in addition, specific alleles are essential, error rate is even more important

Page 16: Next-generation sequencing  – the informatics angle

0

5

10

15

20

25

30

35

40

0 5 10 15 20 25 30 35 40

Position on Read

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

10.00%

Err

or r

ate

C3. Error rate grows with each cycle

• this phenomenon limits useful read length

Page 17: Next-generation sequencing  – the informatics angle

C4. Substitutions vs. INDEL errors

• SNP discovery may require higher coverage for allele confirmation• INDELs can be discovered with very high confidence!

• gapped alignment necessary• good SNP discovery accuracy• short-INDEL discovery difficult

Page 18: Next-generation sequencing  – the informatics angle

C5. Quality values are important for allele calling

• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles

• inaccurate or not well calibrated base quality values hinder allele calling

Q-values should be accurate … and high!

Page 19: Next-generation sequencing  – the informatics angle

Quality values should be well-calibrated

assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle

Page 20: Next-generation sequencing  – the informatics angle

C6. Representational biases / library complexity

fragmentation biases

amplification biases

PCR

sequencing biases

sequencing

low/no representati

on high

representation

Page 21: Next-generation sequencing  – the informatics angle

Dispersal of read coverage

• this affects variation discovery (deeper starting read coverage is needed)• it has major impact is on counting applications

Page 22: Next-generation sequencing  – the informatics angle

Amplification errors

many reads from clonal copies of a single fragment

• early PCR errors in “clonal” read copies lead to false positive allele calls

early amplification error gets propagated onto every clonal copy

Page 23: Next-generation sequencing  – the informatics angle

C7. Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)

Page 24: Next-generation sequencing  – the informatics angle

Paired-end reads for SV discovery

• longer fragments increase the chance of spanning SV breakpoints and/or entire events

• SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std)

• longer fragments tend to have wider fragment length distributions

Page 25: Next-generation sequencing  – the informatics angle

C8. Technologies / properties / applications

  Technology

  Roche/454 Illumina/Solexa AB/SOLiD

Read properties      

Read length 250bp 20-40bp 25-35bp

Error rate <0.5% <1.0% <0.5%

Dominant error type INDEL SUB SUB

Paired-end reads available yes yes yes

Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)

       

Applications      

SNP discovery ○ ● ●

short-INDEL discovery   ● ○

SV discovery ○ ○ ●

CHIP-SEQ ○ ● ●

small RNA/gene discovery ○ ● ●

mRNA Xcript discovery ● ○ ○

Expression profiling ○ ● ●

De novo sequencing ● ? ?

Page 26: Next-generation sequencing  – the informatics angle

Thanks

http://bioinformatics.bc.edu/marthlab

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

MOSAIK talk Thursday, 7:40PM

Michael Egholm

David Bentley

Francisco de la Vega

Kristen StoopsEd Thayer

Clive Brown

Elaine Mardis