cse182-l16 - university of california, san diego · • the signal decays after 1000 bases....

47
CSE182-L16 LW statistics/Assembly

Upload: others

Post on 30-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

CSE182-L16

LW statistics/Assembly

Page 2: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Silly Quiz

•  Who are these people, and what is the occasion?

Page 3: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Genome Sequencing and Assembly

Page 4: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Sequencing

•  A break at T is shown here.

•  Measuring the lengths using electrophoresis allows us to get the position of each T

•  The same can be done with every nucleotide. Fluorescent labeling can help separate different nucleotides

November 13 Bafna

Page 5: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

•  Automated detectors ‘read’ the terminating bases.

•  The signal decays after 1000 bases.

November 13 Bafna

Page 6: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Sequencing Genomes: Clone by Clone

•  Clones are constructed to span the entire length of the genome.

•  These clones are ordered and oriented correctly (Mapping)

•  Each clone is sequenced individually

November 13 Bafna

Page 7: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Shotgun Sequencing

•  Shotgun sequencing of clones was considered viable

•  However, researchers in 1999 proposed shotgunning the entire genome.

November 13 Bafna

Page 8: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Library

•  Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.

November 13 Bafna

Page 9: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Whole Genome Shotgun

•  Break up the entire genome into pieces

•  Sequence ends, and assemble using a computer

•  LW statistics & Repeats argue against the success of such an approach

Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

November 13 Bafna

Page 10: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Whole Genome Shotgun

•  Break up the entire genome into pieces

•  Sequence ends, and assemble using a computer

•  LW statistics & Repeats argue against the success of such an approach

Alternative: build a roadmap of the genome, with physical clones mapped for each region. Sequence each of the clones, and put them together

Page 11: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Shotgun Sequencing

•  Shotgun sequencing of clones was considered viable for small genomes

•  However, researchers in 1999 proposed shotgunning the entire genome.

Page 12: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Massively parallel sequencing

•  Sanger sequencing allows us to sequence <=1000 bp in one lane, up to 96 lanes, in one run.

•  Today, we can sequence many Mbp in a single run

Page 13: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span
Page 14: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span
Page 15: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Questions

•  Algorithmic: How do you put the genome back together from the pieces?

•  Statistical? How many pieces do you need to sequence, etc.? –  The answer to the statistical questions had already

been given in the context of mapping, by Lander and Waterman.

Page 16: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Lander Waterman Statistics

G

L

•  The fragments are falling randomly on the genome •  Overlapping fragments form islands of contiguous sequence. •  Ideally, we want one island for each chromosome. How many

fragments should we sequence?

Page 17: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Lander Waterman Statistics

G

L €

G = Genome LengthL = Fragment LengthN = Number of FragmentsT = Required Overlapc = Coverage = LN/Gα = N/Gθ = T/Lσ = 1-θ

Page 18: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

LW statistics: questions

•  As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island of overlapping contigs.

•  Q1: What is the expected number of islands?

•  The number should increase at first, and gradually decrease.

Page 19: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Analysis: Expected Number Islands

•  Computing Expected # islands. •  Let Xi=1 if an island ends at position i, Xi=0

otherwise. •  Number of islands = ∑i Xi •  Expected # islands = E(∑i Xi) = ∑i E(Xi)

Page 20: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Prob. of an island ending at i

•  E(Xi) = Prob (Island ends at pos. i) •  =Prob(clone began at position i-L+1

AND no clone began in the next L-T positions)

i L T

E(Xi) =α 1−α( )L−T =αe−cσ

Expected # islands = E(Xi) =i∑ Gαe−cσ = Ne−cσ

Page 21: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Computing # islands

•  As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island.

•  Q1: What is the expected number of islands?

•  Ans: N exp(-cσ) •  The number

increases at first, and gradually decreases.

Page 22: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Expected # of clones in an island

•  Expected # of clones in an island =

ecσ

Q: How? Why do we care?

Often, at the beginning of a genome project, we do not know the length of the genome. This equation helps us determine the length.

Page 23: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Problem 1: size of contigs

•  Islands might simply be too small in length •  σ = (1-T/L) = (1-50/500) = 0.9, c = 8. •  #Islands = N e-cσ = 36K •  Size of an island = 54K •  Not enough to make it an acceptable assembly! •  PLUS, there is the problem of Repeats, Chimerism etc.

Page 24: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Assembly Basics

•  Three main components: –  Overlap –  Layout –  Consensus

Page 25: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Overlap

•  Given a pair of fragments s1 and s2, do they belong together?

•  Yes, if a prefix of s2 matches a suffix of s1

•  How would you compute such a match?

Page 26: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Overlap

•  S[i,j] = optimum score of an alignment of s1[1..i] against a substring of that starts anywhere, but ends in j. s2[*..j]

i

j

•  The best prefix-suffix alignment is given by:

•  Maxi {S[i,n] }

Page 27: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Overlap Detection

•  Compute the best prefix-suffix alignments between each pair of fragments.

•  Keep the “high-scoring” ones as evidence of true overlap.

•  What is the problem?

Page 28: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Overlap detection problem

•  Consider the number of fragments. The LW statistics say that we need good coverage (c=8, 10) to get most of the base-pairs. –  G = 3000Mb, L=500 –  Coverage LN/G = 10 –  N = 10*3*109/500 = 6*107

–  Number of comparisons needed = 3.6 * 1015

–  Number of alignments per minute=6 –  Number of compute nodes = 100 –  Time needed (Number of years) =

•  Not good! (Only a small fraction are true overlaps)

3.6⋅ 1015

(6⋅ 60⋅ 24⋅ 365⋅ 100)=11M

Page 29: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

k-mer based overlap (Piegeonhole principle again)

•  Consider a 25bp sequence. –  Expected number of occurrences in

the genome –  3*109*4-25 = 2*10-6

•  A 25-bp sequence appears is unique to the genome!

•  Two overlapping sequences should share a 25-mer

•  Two non-overlapping sequences should not!

25bp

Page 30: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Sorting k-mers

•  Build a list of k-mers that appear in the sequences and their reverse complements

•  Create a record with 4 entries: –  K-mer –  Sequence number –  Position in the sequence –  Reverse complementation flag

•  Sort a vector of these according to k-mer •  How many records per k-mer are

expected? •  If number of records exceeds threshold,

discard (why?)

K-mer S.id Pos.

Page 31: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Alignment module

•  Coalesce k-mer hits into longer, gap-free partial alignments.

•  These extended k-mer hits are saved.

•  For each pair of sequences, form a directed graph.

•  For each maximal path in the graph, construct an alignment.

•  Refine alignment via banded DP

Page 32: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Problem2: Size

•  Islands might simply be too small in length •  σ = (1-T/L) = (1-50/500) = 0.9, c = 8. •  #Islands = N e-cσ = 36K •  Size of an island = 54K •  Not enough to make it an acceptable assembly! •  PLUS, there is the problem of Repeats, Chimerism etc.

Page 33: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Solution 2: Clones can have mate-pairs

•  Recall that we sequence about 1000bp of the end of a clone •  If we sequenced both ends, we get extra information,

particularly if we know the length of the original clone.

Page 34: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Mate Pairs

•  Mate-pairs allow you to merge islands (contigs) into super-contigs

Page 35: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Super-contigs are quite large

•  Make clones of truly predictable length. EX: 3 sets can be used: 2Kb, 10Kb and 50Kb. The variance in these lengths should be small.

•  Use the mate-pairs to order and orient the contigs, and make super-contigs.

Page 36: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Problem 3: Repeats

Page 37: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Repeats & Chimerisms

•  40-50% of the human genome is made up of repetitive elements.

•  Repeats can cause great problems in the assembly!

•  Chimerism causes a clone to be from two different parts of the genome. Can again give a completely wrong assembly

Page 38: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Repeat detection

•  Lander Waterman strikes again! •  The expected number of clones in a Repeat containing island is

MUCH larger than in a non-repeat containing island (contig). •  Thus, every contig can be marked as Unique, or non-unique. In

the first step, throw away the non-unique islands.

Repeat

Page 39: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Detecting Repeat Contigs 1: Read Density

•  Compute the log-odds ratio of two hypotheses:

•  H1: The contig is from a unique region of the genome.

•  The contig is from a region that is repeated at least twice

Page 40: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Detecting Chimeric reads

•  Chimeric reads: Reads that contain sequence from two genomic locations.

•  Good overlaps: G(a,b) if a,b overlap with a high score

•  Transitive overlap: T(a,c) if G(a,b), and G(b,c)

•  Find a point x across which only transitive overlaps occur. X is a point of chimerism

Page 41: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Whole genome shotgun

•  Input: –  Shotgun sequence fragments (reads) –  Mate pairs

•  Output: –  A single sequence created by consensus of overlapping reads

•  First generation of assemblers did not include mate-pairs (Phrap, CAP..)

•  Second generation: CA, Arachne, Euler

Page 42: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Assembly

•  Use k-mers to detect potential overlaps •  Use alignments to build contig graphs •  Decide the unique contigs based on LW

statistics –  Discard repeat contigs –  Break chimeric contigs

•  Use mate-pairs to build scaffolds •  Fill gaps

Page 43: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Assembly

•  Use k-mers to detect potential overlaps •  Use alignments to build contig graphs •  Decide the unique contigs based on LW

statistics –  Discard repeat contigs –  Break chimeric contigs

•  Use mate-pairs to build scaffolds

Page 44: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Consensus Derivation

•  Consensus sequence is created by converting pairwise read alignments into multiple-read alignments

Page 45: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Summary

•  Whole genome shotgun is now routine: –  Human, Mouse, Rat, Dog, Chimpanzee.. –  Many Prokaryotes (One can be sequenced in a day) –  Plant genomes: Arabidopsis, Rice –  Model organisms: Worm, Fly, Yeast

•  A lot is not known about genome structure, organization and function. –  Comparative genomics offers low hanging fruit

Page 46: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

Final exam syllabus

•  Take home •  The entire course, but emphasis will be given to

post-midterm lectures •  HMMs, •  Gene-finding, •  mass spectrometry, •  Micro-array analysis, •  genome sequencing and assembly

Page 47: CSE182-L16 - University of California, San Diego · • The signal decays after 1000 bases. November 13! Bafna! Sequencing Genomes: Clone by Clone • Clones are constructed to span

What we did not cover