Class 11 – 2nd “next generation” seq. method
Review other sequencing methodsSangerIon torrent
New method
Begin to consider uses of DNA seq.
Old method – Sangerchain terminationNobel:clone target DNA in bac.to get ~1011 copies needed for 4 seq rxns:DNA template + primer + pol + dNTP + ddATP(or ddCTP etc., each in separate tube);ddNTP’s lack 3’OH,incorporate normallybut can’t be extended;run gel w/4 lanes; bandsin G lane show size offrags. ending in G, etc.
Di-dexoy NTPs lack 3’OH group
They are incorporated normally, butnext base can’t be chemically attachedbecause it attaches thru 3’ O
missing OH
More elegant later method:label each ddNTP with a diff. colored fluorrun electrophoresis products in single lane
camera records color of products as they run off the bottom of the gel
***
**
Ion Torrent Method
Part A: produce ~107 copies of individual DNA fragments on mm-sized beads because sequencing method requires multiple identical target molecules/bead
Part B: read sequence by primer extension synthesis, 1 base at a time, detecting pH change when dNTPs are incorporated in individual wells containing single beads, using array of ion-sensitive field effect transistors (ISFETs)
Part A - method to put many copies of single short piece of DNA on micron-size bead; diff. DNAs on diff. beads
Shear target DNA; select pieces ~200 bp in length (how?)
Ligate forked adapter oligos to ends of sheared DNA
Note this allows all pieces to be amplified witholigos F and R (the reverse complement of R’ = F)
(without fork, F and F’ would be at 5’ and 3’ ends and their annealing on single templates would impede pcr)
F R’
FR’
/
Make water-in-oil emulsion containing: 1) pcr reagents to amplify DNA using primers F and R 2) hydrophilic micron-size beads with lots of oligos F
attached via their 5’ ends 3) bead and DNA concentration adjusted such that
~1 DNA fragment and 1 bead/water droplet
Each droplet acts like testtube to isolate individ. DNAspecies. Because many copies of F are on eachbead, many product strands ( ~107) starting with F get attached to each beadF
Break emulsion with soap, spin down beads, melt off non-covalently attached strand, spin down beads - most now have single-stranded DNA starting with F and ending with R’
Centrifuge enriched beads into wells just big enough to hold a single bead
http://upload.wikimedia.org/wikipedia/commons/1/10/DNTP_nucleotide_incorporation_reaction.svg
Part B: to get sequence, add primer R, DNA pol and a singledNTP, e.g. dATP; if T is next base on template, A will be in-corporated, generating ~107 H+ ions as dATP ->dAMP+PP+H+
If T is not the next base, no H+ will be produced
Key ideas and innovations in Illumina Method
Biochemistry• “bridging” pcr to get array of ~108 DNA spots on
glass slide, each containing ~104 copies of anindividual ~ 200 bp DNA species in ~ 1mm area
• sequencing by synthesis, 1 base at a time, using dNTPs with removable fluors and 3’ blocking groups
• reading ~35b from both ends of each DNA species toget seq that should be known distance apart in ref. seq.
Image analysis – automated collection and analysis of ~106
microscope images/run
Informatics – mapping short seq. runs to genome
First challenge – how to assemble multiple copies of individual templates on solid surface where sequencing will be done
Shear genomic DNA(nebulizer) into segments~200 – 2000bp
“blunt” ends w/ DNA pol
Ligate “forked” adapter oligonucleotide
Pcr w/ oligos complementary to adapter seq forked endsA, B -> at 5’-ends of alternate strands of all fragments
B
AAB’
B’B’A
A’
Substrate = glass flow cell, 8 channels ~100mm height, thin layer of polyacrylamide applied in each channel
Polyacrylamide contains bromo… (BRAPA) which covalently links to phospho-thioate group on 5’ end of new primers
3’ ~20 bases of attached primers match those of oligo A or B used to pcr the genomic fragments, so melted amplified genomic fragments anneal to the attached primers. Primer ext. w/ DNA pol makes copy of 1 strand of particular genomic fragment at some spot on surface
Next challenge – make multiple copies of each fragment in small region on substrate surface (to have enough copies to get a strong sequencing signal)
Now melt off template
Newly synthsized strand anneals at its3’-end to nearby, 5’-attached oligo A
AB B
A’
A A’A
A
B’
Repetition grows thicket of both strands of particular genomic fragment in small spot on surface “bridging” pcr; note all strands are covalently attached via 5’ ends
For unexplained reason they do this surface pcr by repeated cycles of chemical rather than thermal denaturation
BB
A’
Image of DNA fragments onsurface after bridging pcr;each fragment is labeled (during sequencing) with 1 of 4 differently colored fluors by method explained below
Each spot = “polony” or“cluster” of many copies ofsingle DNA fragment
Spot diameters ~1mm; each spot contains ~ 104 strands; -> primers ~10nm apart; areal density c/w initial conc. of annealed genomic fragments ~3pM
Next challenge – how to make surface pcr’d DNA single-stranded to serve as sequencing template
Clever method – cut one strand of DNA at chemically sensitive site (*) engineered in oligo B, then melt off non-coval. attached DNA, add free primer B that anneals to distal (3’) end of attached template, extend B w/pol
B
B
How to make the single-strand cut?
Put diol modified base in attached oligo B; diol can be chemically cleaved by periodate
How to sequence other end of template?
A
A
diol
After sequencing 1st strand, melt off primer-ext. product, perform another cycle of bridging pcr (ii), make single- stranded cut in attached oligo A, melt off oligo A extension product, seq. w/ soluble primer A
iiB
AA
Note you need a new way to make ss cut in oligo A so youcan make the A and B cuts separately; here are 2 ways:
Synthesize oligo A with uracil U instead of T at given position; enzyme uracil glycosylase removes uracil (not normally in DNA); heat or high pH then breaks A strand at site of removed U
Alternative: put oxoG in place of one G in oligo A; enzyme Fpg glycosylase removes abnormal oxoG; heat or high pH then breaks A strand at site of removed G
Novel use of enzymes that remove abnormal bases (repair mutations in vivo) plus ability to insert abnormal bases during oligo synthesis makes this possible
Additional complication: any free 3’ ends on DNA on surface might “fold-back” and serve as primer for competing sequence rxn
They block this by enzymatically adding nucleotide w/blocked 3’OH group to all DNAs before adding seq. primer
How is sequence read biochemically?
3’ azide group N3 blocks extension
T modified with flour
A, C and G similarly modified but with diff. colored fluors;only one base is added at a time due to 3’ blocking group
base
sugar
They synthesized novel nucleotides!
Treatment with TCEPremoves fluor and3’ blocking group, which allows next nucleotide to be addedand its color detected, (prev. fluor is removed)
Amazing that bulky, unnatural chemical groups left attached do not inhibit polymerase, or mess up base-pairing
They say they had to engineer (mutate) DNA polymerase to get it to incorporate these modified bases efficiently This is another innovative step!
Repeated cycles of flowing in polymerase plus 4 modified nucleotides (1 of which gets incorporated in given spot), washing, taking picture, treating with TCEP -> sequence
Picture taken at step n during sequencing run; all strands in a given cluster label with A, C, G or T depending on sequence at nth base in template strand.How does spot density compareto ion torrent?
Image analysis technology and innovations
Note they use TIRF microscopy to reduce background, only see fluors within < 1mm of surface
Why “custom” objective?
“custom”
How big is typical microscope field of view (FOV)at 60x magnification? Imagine FOV expanded60x in each direction and mapped to 3x3mm CCD
How many images would they need to cover ~10cm2 flow
cell surface?
How long would it take to collect these images seriallyif they have to move slide 1 FOV between images?
Their “custom” lens gives them ?? (0.1mm)2 FOV
How many sets of images do they need (1 for each baseaddition)? How long does it take to collect datafor 1 run? ~week
Do they need to align the spots in images of the same FOV taken hours apart? Automated spot alignment program
Cross-talk of different fluors – they need to adjust image intensities to correct for “red” fluorescence of “green” fluor, etc to get best estimate of which dNTP was incorp.
If base extension or deblocking is not complete for all strands in cluster, different nucleotides will be incorporated at subsequent steps, purity of fluorescence signal will erode (phasing prob.)
Quality control measures used to decide when base calling is unreliable; e.g. purity filter: intensity of 1 base must be > .6 sum of it plus next brightest base in 1st 12 positions
21
# errors determined by sequencing DNA with known seq.
Even with QC criteriato selectgood readsget only~35 breliableseq.!
# errors/35 bases
How does Illumina method differ from Sanger /ion torrent? Sanger ion torrent IlluminaHow to get many clone in emulsion bridging pcrcopies of template bacteria pcr -> beads on glass surface
seq rxn/ dye-labeled normal reversibly biochemistry ddNTP chain dNTPs, 1 3’blocked dye-
terminators at a time lableled dNTPs,4 at a time
read out gel electro- ISFET detect. sequential photosphoresis of H+-released see order of base labeled DNA base incorp. addition insize => pos. in each well @ cluster
seq. length ~1000 ~100 ~35
Informatics – mapping shorts seq. reads to genome
2 programs used to look for matches betw. the ~35b end seq. they obtain for a cluster and ref seq.
ELAND – finds all seq. in reference that match first 32 bases of cluster seq, allowing up to 2 mismatches but no gaps; then sees which of these best match cluster seq at any bases beyond 32
MAQ – more sophisticated in allowing gaps betw. ref. and cluster seq., so picks up more matches with small “indels”, but potentially more errors
If genome seq. were random, what length seq. would be unique (unlikely to occur more than once)?
Complication: some sequences >35b occur many times
“selfish” genes have replicated and re-inserted in different positions in the genome, e.g.
short interspersed nuclear elements (SINES, alu)~300 bp; ~106 copies (~10% of genome)
long interspersed nuclear elements (LINES) ~6000 bp; ~105 copies (~20% of genome)
Two features help assignment of 35b reads to correct position in genome
they know the paired end read should mapto other DNA strand about 200 bp awayin reference sequence
each region of DNA is read many times, so they can just map consensus sequence
for any segment
Tests of quality
How uniformly does their data cover the ref. seq.?
If some DNA segments don’t amplify well (? due tohigh GC content) they might be absent in their seq.
If cluster seq. is random sample of ref seq., Poisson dist.predicts how many times, n, a ref. seq. base should appear in cluster seq.
pn=e-mmn/n! where m = aver. # times
m=130Gb of cluster seq/3Gb per genome = 43
Fig. 2 Take every 50th base of ref seq.; how many times is an overlapping frag. found in a cluster seq. mapped to the ref seq.? Make a histogram of the # of such bases found n times in the cluster seq data set. For interest, consider separately bases that don’t occur in repetitive elements like SINES and LINES (unique only)
The dist. is pretty closeto Poisson (only slt. moresamples in tails), so themethod seems to samplepretty randomly
Plot # times a particular baseis sequenced in the data set)as function of GC content ofseq. in which it occurs.Only cluster sequenceswith most extreme GCcontents were sequenced less than the average ~40 times
So what? If a seq. (with extreme GC content) is under-sampled, you might get only the maternal or paternalcopy (allele) in the seq data set and so miss finding apolymorphism (false negative)
Does GC content affect how often a region is sampled?
Next evaluation – compare how often SNPs are identifiedin the seq. vs. SNP hybridization assays (“GT, genotyping”)
Note this company makes SNP hybrid. assay, so it workinghard on technology that may replace its current platform!
latest version of hybrid.assay w/ >3M SNPs
std version of hybrid.assay (GT) w/.5M SNPs
<1% discordant calls most often the array assay (GT) finds a SNP missed by seq.
Using ELAND program:
Same table, using MAQ program, seq. does slt. better,but in general GT and seq. have similar fail-to-detect rates
Their new, favorite set of SNPs with least ambiguity
Most GT failures-to-detect are due to person carrying so variant a seq. that it fails to hybridize to anything on the chip
Most seq. failures-to-detect are due to low sampling rate of one allele
But seq. picked up ~1M new SNPS in this person!
Why? Std SNP panels selected for SNPs that occur fairly frequently in population This individual of African ancestry - ?underrepresented
in std SNP panel Maybe most of us carry lots of “private” SNPs
that are very rare in the population
How can you get information about structural changes larger than 35bases from 35base long reads?
Use info from paired end reads!
Idea – label ends of genomic DNA segments w/ biotinnucleotides (B) using DNA pol
circularize DNA segments (ligate diluted sample)re-shear DNA; purify biotinylated DNA; make clustersas before and read seq of ends of junction frags.
Now sequence at opposite ends of small frags comes from genomic DNA regions separated by length of circularized fragments; also, oriention wrt each other is flipped
If you can map both end sequences to genome, you can find deletions (end seq. further apart in ref. seq. than circularized fragment length), insertions (end seq. closer together in ref. seq. than circ. frag. length), inversions (orientation reversed)
They identified 1000s of >50bp deletions, many of whichwere known selfish DNA elements present in reference seq.but not in the seq. of the person whose DNA they analyzed
90% of these are SINES present inreference but not in this individual
60% are LINES
They also found 2345 insertions
How many are in coding sequences?How many are homozygous?
Map of a region containing an inversion flanked by2 small deletions. What do symbols represent?
Note ~2kb region of ref seq. with no read pairs (green) “short insert” pairs flanking this region (orange) map to sites ~2kb apart in ref. but ~.5kb in this sample (i.e. span deletion)
Last level of complexity – bio-medical interpretationof seq. information
Example - variability greater in certain areas of genomee.g. parts of X chromosome - why?
Potentially medically relevant findings – your DNA is likelysimilar!
26,140 SNPs in protein-coding regions5,361 encode non-conservative amino acid changes153 encode premature terminations “many of which are expected to affect protein function”excerpt of Table 9
Summary - Impressive accomplishment!
Innovations in many fields – all needed for useful product molecular biology: bridging pcr to get ~104 copies
of individual fragments arrayed on surface, nicking tricks to convert pcr products to ss for sequencing and getting the complementary ss for sequencing, new dNTPs with reversibly blocked 3’ ends and chem- ically removable fluors, to seq. 1 base at a time engineered DNA pols that use these new dNTPs
photonics, data acquisition, informatics …
Lots of detail -> fuller explanation than ion-torrent
Major challenges remaining
quantifying errors
methods for resequencing variants for confirmation
identifying structural variants larger than the pieces of dna sequenced – e.g. deletions, insertions,duplications, inversions
speeding up (parallelization of) data acquisition
interpretation – clinical significance of variants;implications for human biology
Some key ideas you should take away from today:
How they get array of spots, each withmany copies of a DNA to sequence
How they get sequence, 1 base at a time,using reversible dye terminator chem.
How they get information about structuralvariants larger than the 35 bp runs(paired end reads)
How over sequencing (fold-coverage) helpsHow they evaluate seq. accuracyWhat kind of mutational load are we all likely
to carry in our DNA
Next topic - Why sequence? Clinical issues assoc. w/seq.
1. basic biologydetermine amino acid seq. of proteinslearn role of non-coding seq.study evolutionary relationships
2. medical applications inherited disease (e.g. CF) diagnosis, prenatal dx,
disease risk prediction, dis. mechanismsdrug sensitivity (“personalized med”)
sequence variants assoc w/disease but not causal cancer mutations – identify drugs to use/not usediagnose microbial infections
3. non-medical applicationse.g. plant engineering, forensics
Problems/challenges with interpreting seq. info.
accuracy – even error rate 0.001% -> ~104 errors in 3*109 bp human genome seq. how to re-check possible mutations
predictive value – how reliable are clinical assoc.? how useful if you can’t (for now at least)
change outcome (Alzheimer’s) will it lead to unnecessary additional testing?
cancer – how rapidly do cancers develop new mutations to become resistant to rx