Download - Class 11 – 2 nd “next generation” seq. method Review other sequencing methods Sanger Ion torrent New method Begin to consider uses of DNA seq

Class 11 – 2nd “next generation” seq. method

Review other sequencing methodsSangerIon torrent

New method

Begin to consider uses of DNA seq.

Old method – Sangerchain terminationNobel:clone target DNA in bac.to get ~1011 copies needed for 4 seq rxns:DNA template + primer + pol + dNTP + ddATP(or ddCTP etc., each in separate tube);ddNTP’s lack 3’OH,incorporate normallybut can’t be extended;run gel w/4 lanes; bandsin G lane show size offrags. ending in G, etc.

Di-dexoy NTPs lack 3’OH group

They are incorporated normally, butnext base can’t be chemically attachedbecause it attaches thru 3’ O

missing OH

More elegant later method:label each ddNTP with a diff. colored fluorrun electrophoresis products in single lane

camera records color of products as they run off the bottom of the gel

***

**

Ion Torrent Method

Part A: produce ~107 copies of individual DNA fragments on mm-sized beads because sequencing method requires multiple identical target molecules/bead

Part B: read sequence by primer extension synthesis, 1 base at a time, detecting pH change when dNTPs are incorporated in individual wells containing single beads, using array of ion-sensitive field effect transistors (ISFETs)

Part A - method to put many copies of single short piece of DNA on micron-size bead; diff. DNAs on diff. beads

Shear target DNA; select pieces ~200 bp in length (how?)

Ligate forked adapter oligos to ends of sheared DNA

Note this allows all pieces to be amplified witholigos F and R (the reverse complement of R’ = F)

(without fork, F and F’ would be at 5’ and 3’ ends and their annealing on single templates would impede pcr)

F R’

FR’

/

Make water-in-oil emulsion containing: 1) pcr reagents to amplify DNA using primers F and R 2) hydrophilic micron-size beads with lots of oligos F

attached via their 5’ ends 3) bead and DNA concentration adjusted such that

~1 DNA fragment and 1 bead/water droplet

Each droplet acts like testtube to isolate individ. DNAspecies. Because many copies of F are on eachbead, many product strands ( ~107) starting with F get attached to each beadF

Break emulsion with soap, spin down beads, melt off non-covalently attached strand, spin down beads - most now have single-stranded DNA starting with F and ending with R’

Centrifuge enriched beads into wells just big enough to hold a single bead

http://upload.wikimedia.org/wikipedia/commons/1/10/DNTP_nucleotide_incorporation_reaction.svg

Part B: to get sequence, add primer R, DNA pol and a singledNTP, e.g. dATP; if T is next base on template, A will be in-corporated, generating ~107 H+ ions as dATP ->dAMP+PP+H+

If T is not the next base, no H+ will be produced

Key ideas and innovations in Illumina Method

Biochemistry• “bridging” pcr to get array of ~108 DNA spots on

glass slide, each containing ~104 copies of anindividual ~ 200 bp DNA species in ~ 1mm area

• sequencing by synthesis, 1 base at a time, using dNTPs with removable fluors and 3’ blocking groups

• reading ~35b from both ends of each DNA species toget seq that should be known distance apart in ref. seq.

Image analysis – automated collection and analysis of ~106

microscope images/run

Informatics – mapping short seq. runs to genome

First challenge – how to assemble multiple copies of individual templates on solid surface where sequencing will be done

Shear genomic DNA(nebulizer) into segments~200 – 2000bp

“blunt” ends w/ DNA pol

Ligate “forked” adapter oligonucleotide

Pcr w/ oligos complementary to adapter seq forked endsA, B -> at 5’-ends of alternate strands of all fragments

B

AAB’

B’B’A

A’

Substrate = glass flow cell, 8 channels ~100mm height, thin layer of polyacrylamide applied in each channel

Polyacrylamide contains bromo… (BRAPA) which covalently links to phospho-thioate group on 5’ end of new primers

3’ ~20 bases of attached primers match those of oligo A or B used to pcr the genomic fragments, so melted amplified genomic fragments anneal to the attached primers. Primer ext. w/ DNA pol makes copy of 1 strand of particular genomic fragment at some spot on surface

Next challenge – make multiple copies of each fragment in small region on substrate surface (to have enough copies to get a strong sequencing signal)

Now melt off template

Newly synthsized strand anneals at its3’-end to nearby, 5’-attached oligo A

AB B

A’

A A’A

A

B’

Repetition grows thicket of both strands of particular genomic fragment in small spot on surface “bridging” pcr; note all strands are covalently attached via 5’ ends

For unexplained reason they do this surface pcr by repeated cycles of chemical rather than thermal denaturation

BB

A’

Image of DNA fragments onsurface after bridging pcr;each fragment is labeled (during sequencing) with 1 of 4 differently colored fluors by method explained below

Each spot = “polony” or“cluster” of many copies ofsingle DNA fragment

Spot diameters ~1mm; each spot contains ~ 104 strands; -> primers ~10nm apart; areal density c/w initial conc. of annealed genomic fragments ~3pM

Next challenge – how to make surface pcr’d DNA single-stranded to serve as sequencing template

Clever method – cut one strand of DNA at chemically sensitive site (*) engineered in oligo B, then melt off non-coval. attached DNA, add free primer B that anneals to distal (3’) end of attached template, extend B w/pol

B

B

How to make the single-strand cut?

Put diol modified base in attached oligo B; diol can be chemically cleaved by periodate

How to sequence other end of template?

A

A

diol

After sequencing 1st strand, melt off primer-ext. product, perform another cycle of bridging pcr (ii), make single- stranded cut in attached oligo A, melt off oligo A extension product, seq. w/ soluble primer A

iiB

AA

Note you need a new way to make ss cut in oligo A so youcan make the A and B cuts separately; here are 2 ways:

Synthesize oligo A with uracil U instead of T at given position; enzyme uracil glycosylase removes uracil (not normally in DNA); heat or high pH then breaks A strand at site of removed U

Alternative: put oxoG in place of one G in oligo A; enzyme Fpg glycosylase removes abnormal oxoG; heat or high pH then breaks A strand at site of removed G

Novel use of enzymes that remove abnormal bases (repair mutations in vivo) plus ability to insert abnormal bases during oligo synthesis makes this possible

Additional complication: any free 3’ ends on DNA on surface might “fold-back” and serve as primer for competing sequence rxn

They block this by enzymatically adding nucleotide w/blocked 3’OH group to all DNAs before adding seq. primer

How is sequence read biochemically?

3’ azide group N3 blocks extension

T modified with flour

A, C and G similarly modified but with diff. colored fluors;only one base is added at a time due to 3’ blocking group

base

sugar

They synthesized novel nucleotides!

Treatment with TCEPremoves fluor and3’ blocking group, which allows next nucleotide to be addedand its color detected, (prev. fluor is removed)

Amazing that bulky, unnatural chemical groups left attached do not inhibit polymerase, or mess up base-pairing

They say they had to engineer (mutate) DNA polymerase to get it to incorporate these modified bases efficiently This is another innovative step!

Repeated cycles of flowing in polymerase plus 4 modified nucleotides (1 of which gets incorporated in given spot), washing, taking picture, treating with TCEP -> sequence

Picture taken at step n during sequencing run; all strands in a given cluster label with A, C, G or T depending on sequence at nth base in template strand.How does spot density compareto ion torrent?

Image analysis technology and innovations

Note they use TIRF microscopy to reduce background, only see fluors within < 1mm of surface

Why “custom” objective?

“custom”

How big is typical microscope field of view (FOV)at 60x magnification? Imagine FOV expanded60x in each direction and mapped to 3x3mm CCD

How many images would they need to cover ~10cm2 flow

cell surface?

How long would it take to collect these images seriallyif they have to move slide 1 FOV between images?

Their “custom” lens gives them ?? (0.1mm)2 FOV

How many sets of images do they need (1 for each baseaddition)? How long does it take to collect datafor 1 run? ~week

Do they need to align the spots in images of the same FOV taken hours apart? Automated spot alignment program

Cross-talk of different fluors – they need to adjust image intensities to correct for “red” fluorescence of “green” fluor, etc to get best estimate of which dNTP was incorp.

If base extension or deblocking is not complete for all strands in cluster, different nucleotides will be incorporated at subsequent steps, purity of fluorescence signal will erode (phasing prob.)

Quality control measures used to decide when base calling is unreliable; e.g. purity filter: intensity of 1 base must be > .6 sum of it plus next brightest base in 1st 12 positions

21

# errors determined by sequencing DNA with known seq.

Even with QC criteriato selectgood readsget only~35 breliableseq.!

# errors/35 bases

How does Illumina method differ from Sanger /ion torrent? Sanger ion torrent IlluminaHow to get many clone in emulsion bridging pcrcopies of template bacteria pcr -> beads on glass surface

seq rxn/ dye-labeled normal reversibly biochemistry ddNTP chain dNTPs, 1 3’blocked dye-

terminators at a time lableled dNTPs,4 at a time

read out gel electro- ISFET detect. sequential photosphoresis of H+-released see order of base labeled DNA base incorp. addition insize => pos. in each well @ cluster

seq. length ~1000 ~100 ~35

Informatics – mapping shorts seq. reads to genome

2 programs used to look for matches betw. the ~35b end seq. they obtain for a cluster and ref seq.

ELAND – finds all seq. in reference that match first 32 bases of cluster seq, allowing up to 2 mismatches but no gaps; then sees which of these best match cluster seq at any bases beyond 32

MAQ – more sophisticated in allowing gaps betw. ref. and cluster seq., so picks up more matches with small “indels”, but potentially more errors

If genome seq. were random, what length seq. would be unique (unlikely to occur more than once)?

Complication: some sequences >35b occur many times

“selfish” genes have replicated and re-inserted in different positions in the genome, e.g.

short interspersed nuclear elements (SINES, alu)~300 bp; ~106 copies (~10% of genome)

long interspersed nuclear elements (LINES) ~6000 bp; ~105 copies (~20% of genome)

Two features help assignment of 35b reads to correct position in genome

they know the paired end read should mapto other DNA strand about 200 bp awayin reference sequence

each region of DNA is read many times, so they can just map consensus sequence

for any segment

Tests of quality

How uniformly does their data cover the ref. seq.?

If some DNA segments don’t amplify well (? due tohigh GC content) they might be absent in their seq.

If cluster seq. is random sample of ref seq., Poisson dist.predicts how many times, n, a ref. seq. base should appear in cluster seq.

pn=e-mmn/n! where m = aver. # times

m=130Gb of cluster seq/3Gb per genome = 43

Fig. 2 Take every 50th base of ref seq.; how many times is an overlapping frag. found in a cluster seq. mapped to the ref seq.? Make a histogram of the # of such bases found n times in the cluster seq data set. For interest, consider separately bases that don’t occur in repetitive elements like SINES and LINES (unique only)

The dist. is pretty closeto Poisson (only slt. moresamples in tails), so themethod seems to samplepretty randomly

Plot # times a particular baseis sequenced in the data set)as function of GC content ofseq. in which it occurs.Only cluster sequenceswith most extreme GCcontents were sequenced less than the average ~40 times

So what? If a seq. (with extreme GC content) is under-sampled, you might get only the maternal or paternalcopy (allele) in the seq data set and so miss finding apolymorphism (false negative)

Does GC content affect how often a region is sampled?

Next evaluation – compare how often SNPs are identifiedin the seq. vs. SNP hybridization assays (“GT, genotyping”)

Note this company makes SNP hybrid. assay, so it workinghard on technology that may replace its current platform!

latest version of hybrid.assay w/ >3M SNPs

std version of hybrid.assay (GT) w/.5M SNPs

<1% discordant calls most often the array assay (GT) finds a SNP missed by seq.

Using ELAND program:

Same table, using MAQ program, seq. does slt. better,but in general GT and seq. have similar fail-to-detect rates

Their new, favorite set of SNPs with least ambiguity

Most GT failures-to-detect are due to person carrying so variant a seq. that it fails to hybridize to anything on the chip

Most seq. failures-to-detect are due to low sampling rate of one allele

But seq. picked up ~1M new SNPS in this person!

Why? Std SNP panels selected for SNPs that occur fairly frequently in population This individual of African ancestry - ?underrepresented

in std SNP panel Maybe most of us carry lots of “private” SNPs

that are very rare in the population

How can you get information about structural changes larger than 35bases from 35base long reads?

Use info from paired end reads!

Idea – label ends of genomic DNA segments w/ biotinnucleotides (B) using DNA pol

circularize DNA segments (ligate diluted sample)re-shear DNA; purify biotinylated DNA; make clustersas before and read seq of ends of junction frags.

Now sequence at opposite ends of small frags comes from genomic DNA regions separated by length of circularized fragments; also, oriention wrt each other is flipped

If you can map both end sequences to genome, you can find deletions (end seq. further apart in ref. seq. than circularized fragment length), insertions (end seq. closer together in ref. seq. than circ. frag. length), inversions (orientation reversed)

They identified 1000s of >50bp deletions, many of whichwere known selfish DNA elements present in reference seq.but not in the seq. of the person whose DNA they analyzed

90% of these are SINES present inreference but not in this individual

60% are LINES

They also found 2345 insertions

How many are in coding sequences?How many are homozygous?

Map of a region containing an inversion flanked by2 small deletions. What do symbols represent?

Note ~2kb region of ref seq. with no read pairs (green) “short insert” pairs flanking this region (orange) map to sites ~2kb apart in ref. but ~.5kb in this sample (i.e. span deletion)

Last level of complexity – bio-medical interpretationof seq. information

Example - variability greater in certain areas of genomee.g. parts of X chromosome - why?

Potentially medically relevant findings – your DNA is likelysimilar!

26,140 SNPs in protein-coding regions5,361 encode non-conservative amino acid changes153 encode premature terminations “many of which are expected to affect protein function”excerpt of Table 9

Summary - Impressive accomplishment!

Innovations in many fields – all needed for useful product molecular biology: bridging pcr to get ~104 copies

of individual fragments arrayed on surface, nicking tricks to convert pcr products to ss for sequencing and getting the complementary ss for sequencing, new dNTPs with reversibly blocked 3’ ends and chemically removable fluors, to seq. 1 base at a time engineered DNA pols that use these new dNTPs

photonics, data acquisition, informatics …

Lots of detail -> fuller explanation than ion-torrent

Major challenges remaining

quantifying errors

methods for resequencing variants for confirmation

identifying structural variants larger than the pieces of dna sequenced – e.g. deletions, insertions,duplications, inversions

speeding up (parallelization of) data acquisition

interpretation – clinical significance of variants;implications for human biology

Some key ideas you should take away from today:

How they get array of spots, each withmany copies of a DNA to sequence

How they get sequence, 1 base at a time,using reversible dye terminator chem.

How they get information about structuralvariants larger than the 35 bp runs(paired end reads)

How over sequencing (fold-coverage) helpsHow they evaluate seq. accuracyWhat kind of mutational load are we all likely

to carry in our DNA

Next topic - Why sequence? Clinical issues assoc. w/seq.

1. basic biologydetermine amino acid seq. of proteinslearn role of non-coding seq.study evolutionary relationships

2. medical applications inherited disease (e.g. CF) diagnosis, prenatal dx,

disease risk prediction, dis. mechanismsdrug sensitivity (“personalized med”)

sequence variants assoc w/disease but not causal cancer mutations – identify drugs to use/not usediagnose microbial infections

3. non-medical applicationse.g. plant engineering, forensics

Problems/challenges with interpreting seq. info.

accuracy – even error rate 0.001% -> ~104 errors in 3*109 bp human genome seq. how to re-check possible mutations

predictive value – how reliable are clinical assoc.? how useful if you can’t (for now at least)

change outcome (Alzheimer’s) will it lead to unnecessary additional testing?

cancer – how rapidly do cancers develop new mutations to become resistant to rx

Download - Class 11 – 2 nd “next generation” seq. method Review other sequencing methods Sanger Ion torrent New method Begin to consider uses of DNA seq

Top Related