next-generation sequencing course, part 1: technologies

61
[I0D51A] Bioinformatics: High-Throughput Analysis Next-generation sequencing. Part 1: Technologies Prof Jan Aerts Faculty of Engineering - ESAT/SCD [email protected] TA: Alejandro Sifrim ([email protected]) 1

Upload: jan-aerts

Post on 14-Jun-2015

6.335 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Next-generation sequencing course, part 1: technologies

[I0D51A] Bioinformatics: High-Throughput AnalysisNext-generation sequencing. Part 1: Technologies

Prof Jan AertsFaculty of Engineering - ESAT/[email protected]

TA: Alejandro Sifrim ([email protected])

1

Page 2: Next-generation sequencing course, part 1: technologies

Announcements

May 27th (9am-noon): evaluation

open book

2

Page 3: Next-generation sequencing course, part 1: technologies

Note to self...

Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy first...

3

Page 4: Next-generation sequencing course, part 1: technologies

Overview

• linux refresher (6/5)

• next-generation sequencing technologies and applications (6/5)

• sequence mapping (13/5)

• variant calling - SNPs (20/5)

• variant calling - structural variation (20/5)

4

Page 5: Next-generation sequencing course, part 1: technologies

Linux Refresher...

5

Page 6: Next-generation sequencing course, part 1: technologies

Next-generation sequencing technologies

6

Page 7: Next-generation sequencing course, part 1: technologies

General principle

7

Page 8: Next-generation sequencing course, part 1: technologies

Big data...

8

Page 9: Next-generation sequencing course, part 1: technologies

First vs second generation sequencing

Shendure & Ji, 2008

Sanger sequencing (1st gen) 2nd/next gen sequencing

9

Page 10: Next-generation sequencing course, part 1: technologies

10

Korbel et al, 2007

Paired-end sequencing

Page 11: Next-generation sequencing course, part 1: technologies

General approaches

• 2nd generation: clonally amplified single molecules

• Roche 454 pyrosequencing

• Illumina Genome Analyzer -> HiSeq: reversible terminator technology

• ABI SOLiD: ligation-based extension

• Next-next-generation/3rd generation: true single molecule

• Helicos: Heliscore

• Pacific Biosciences: SMRT

11

Page 12: Next-generation sequencing course, part 1: technologies

12

Mardis, 2011

Page 13: Next-generation sequencing course, part 1: technologies

Steps

template preparation

sequencing and imaging

data analysis

genome enrichment

13

Page 14: Next-generation sequencing course, part 1: technologies

A. Genome enrichment

14

Page 15: Next-generation sequencing course, part 1: technologies

Sequencing costs

15

Page 16: Next-generation sequencing course, part 1: technologies

What?

Only sequence relevant parts of the genome instead of whole genome, e.g.:

• specific Mb-scale regions known to be involved in particular disease (e.g. based on GWAS)

• specific candidate genes belonging to disease pathway

• exome (= all exons)

=> how to isolate these from non-target sequence? “pulldown”

16

Page 17: Next-generation sequencing course, part 1: technologies

Pulldown: on-array

Turner et al, 2009

17

Page 18: Next-generation sequencing course, part 1: technologies

Pulldown: in-solution

Turner et al, 2009

18

Page 19: Next-generation sequencing course, part 1: technologies

Performance metrics

• fold-enrichment: ratio of abundance of target sequences post-enrichment vs pre-enrichment

• capture specificity: fraction of sequence reads that map to target

• uniformity: relative abundance of individual targets after enrichment

• completeness: fraction of target bases detectably captured

19

Page 20: Next-generation sequencing course, part 1: technologies

B. Template preparation

20

Page 21: Next-generation sequencing course, part 1: technologies

Problem: most imaging systems not designed to detect single fluorescent event => need amplified templates

Aim: to produce a representative, non-biased source of nucleic acid material from the genome under investigation => population of identical templates

Steps:

1. shear DNA

2. amplify templates

Options: emulsion PCR (emPCR) or solid phase amplification

21

Page 22: Next-generation sequencing course, part 1: technologies

emulsion = mixture of two or more immiscible (unblendable) liquids; e.g. mayonnaise, vinaigrette

emPCR: thousands of microreactors/micro-eppendorfs

one bead + one DNA molecule per microreactor => PCR to 1000s of copies

Amplification by emulsion PCR

22

Page 23: Next-generation sequencing course, part 1: technologies

Metzker et al, 2010

Williams et al, 2006

23

Page 24: Next-generation sequencing course, part 1: technologies

Solid-phase amplification

Metzker et al, 2010

http://bit.ly/6JYIUz

http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1

24

Page 25: Next-generation sequencing course, part 1: technologies

C. Sequencing and imaging

25

Page 26: Next-generation sequencing course, part 1: technologies

Sequencing and imaging

Technologies:

1. cyclic reversible termination

2. sequencing by ligation

3. pyrosequencing

4. real-time sequencing

26

Page 27: Next-generation sequencing course, part 1: technologies

Cyclic reversible termination

DNA synthesis is terminated after adding single nucleotide

start/stop/start/stop/start/stop/...

Illumina: 4-colour

sequencing steps

Metzker et al, 2010

sequencing result

27

Page 28: Next-generation sequencing course, part 1: technologies

Helicos: 1-colour

Metzker et al, 2010

sequencing steps

Metzker et al, 2010

sequencing result

28

Page 29: Next-generation sequencing course, part 1: technologies

Sequencing by ligation

http://bit.ly/fPh22X

sequencing steps

29

Page 30: Next-generation sequencing course, part 1: technologies

http://bit.ly/fPh22X

sequencing result

30

Page 31: Next-generation sequencing course, part 1: technologies

Pyrosequencing

Metzker et al, 2010

Metzker et al, 2010

31

Page 32: Next-generation sequencing course, part 1: technologies

Real-time sequencing

“ZMW” zero-mode waveguide

DNA polymerase

“strobe sequencing”

32

Page 33: Next-generation sequencing course, part 1: technologies

Run time Gb/run

Roche 454

Illumina

SOLiD

Helicos

PacBio

8.5 hr 45

9 days 35

14 days 50

8 days 37

? ?

33

Page 34: Next-generation sequencing course, part 1: technologies

• base quality drops along read

Sanger > SOLiD > Illumina > 454 > Helicos

(“dephasing” within clusters)

• base calling errors

Accuracy - base calling error

34

Page 35: Next-generation sequencing course, part 1: technologies

Accuracy - homopolymer runs

Issue for Roche 454:

39% of errors are homopolymers

A5 motifs: 3.3% error rate

A8 motifs: 50% error rate

Reason: use signal intensity as a measure for homopolymer length

35

Page 36: Next-generation sequencing course, part 1: technologies

36

Page 37: Next-generation sequencing course, part 1: technologies

Ronaghi, Genome Res 11:3-11 (2001)

37

Page 39: Next-generation sequencing course, part 1: technologies

Is it 4? Is it 5? Is it 4?

http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg

39

Page 40: Next-generation sequencing course, part 1: technologies

Consensus accuracy

Increase accuracy for SNP calling by increasing coverage:

Illumina: 20X

SOLiD: 12X

454: 7.4X

Sanger: 3X

Factors: raw accuracy + read length

How deep do you have to sequence? => Poisson distribution: “If you sequence at average of 10X, how much of the genome will be covered at least 5X”?

40

Page 41: Next-generation sequencing course, part 1: technologies

Bentley et al, Nature 456:53-56 (2008)

41

Page 42: Next-generation sequencing course, part 1: technologies

FASTQ file format

“@” + identifier

sequence

“+” + identifier (optional)

phred-based quality scores

phred quality score encoding

Wikipedia

example fastq entries (n=2)

example fasta entries (n=2)

42

Page 43: Next-generation sequencing course, part 1: technologies

Sequence quality control

Is this good sequence? (essential!)

E.g.: using FastQC tool (Babraham Institute, UK; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)

43

Page 44: Next-generation sequencing course, part 1: technologies

Sequence quality control

per base sequence quality

good bad

44

Page 45: Next-generation sequencing course, part 1: technologies

Sequence quality control

per sequence quality scores

good bad

45

Page 46: Next-generation sequencing course, part 1: technologies

Sequence quality control

per base sequence content

good bad

46

Page 47: Next-generation sequencing course, part 1: technologies

Sequence quality control

per base GC content

good bad

47

Page 48: Next-generation sequencing course, part 1: technologies

Sequence quality control

per sequence GC content

good bad

48

Page 49: Next-generation sequencing course, part 1: technologies

Sequence quality control

k-mer content

good bad

49

Page 50: Next-generation sequencing course, part 1: technologies

Intermezzo: Galaxy

50

Page 51: Next-generation sequencing course, part 1: technologies

Online genome analysis

51

http://galaxy.psu.edu/

“Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...”

Page 52: Next-generation sequencing course, part 1: technologies

52

Page 53: Next-generation sequencing course, part 1: technologies

53

Page 54: Next-generation sequencing course, part 1: technologies

Applications of next-generation sequencing

54

Page 55: Next-generation sequencing course, part 1: technologies

55

Kahvejian et al, 2008

Page 56: Next-generation sequencing course, part 1: technologies

5650

Kahvejian et al, 2008

DNA-seq

ChIP-seq

RNA-seq

Page 57: Next-generation sequencing course, part 1: technologies

575150

Kahvejian et al, 2008

DNA-seq

ChIP-seq

RNA-seq

identify sequence variations

identify pathogens

Page 58: Next-generation sequencing course, part 1: technologies

Exercises

58

Page 59: Next-generation sequencing course, part 1: technologies

59

Try to login to the server mentioned on Toledo with username and password provided there.

There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt and s_2_sequence.txt (= paired ends)

• How many sequences are in s_1_sequence.txt?

• What encoding was used for the quality score? Illumina? Sanger?

• What are the numerical quality scores for the first sequence in s_1_sequence.txt (i.e. 7172283/1)?

Page 60: Next-generation sequencing course, part 1: technologies

• Create an account on the Galaxy server

• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload them into Galaxy. These files are also available on the linux server

• Have a look at the contents of s_1_sequence.txt.

• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ Groomer”)

• Draw the quality score boxplot for s_1_sequence.txt

• Draw the nucleotide distribution chart for s_1_sequence.txt

60

Page 61: Next-generation sequencing course, part 1: technologies

Bentley DR et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53-59 (2008)

Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could sequence everything? Nature Biotechnology 26: 1125-1133 (2008)

Korbel JO et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318: 420-426 (2007)

Mardis ER. A decade’s perspective on DNA sequencing technology. Nature 470: 198-203 (2011)

Metzker ML. Sequencing technologies - the next generation. Nature Reviews Genetics 11:31-46 (2010)

Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology 26:1135-1145 (2008)

Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics and Human Genetics 10 (2009)

61

References