the science of information: from communication to dna sequencing texpoint fonts used in emf:...
TRANSCRIPT
![Page 1: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/1.jpg)
The Science of Information:From Communication to DNA Sequencing
David Tse
U.C. Berkeley
CUHK
December 14, 2012
Research supported by NSF Center for Science of Information.
![Page 2: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/2.jpg)
Communication: the beginning
• Prehistoric: smoke signals, drums.• 1837: telegraph• 1876: telephone• 1897: radio• 1927: television
Communication design tied to the specific source and specific physical medium.
![Page 3: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/3.jpg)
Grand Unification
channel capacity C bits/ sec
source entropy rateH bits/ source sym
Shannon 48
Theorem:max. rate of reliable communication
=CH
source sym / sec.
Model all sources and channels statistically.
A unified way of looking at all communication problems in terms of information flow.
source reconstructed source
![Page 4: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/4.jpg)
60 Years Later
• All communication systems are designed based on the principles of information theory.
• A benchmark for comparing different schemes and different channels.
• Suggests totally new ways of communication (eg. MIMO, opportunistic communication).
![Page 5: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/5.jpg)
Secrets of Success
• Information, then computation.
It took 60 years, but we got there.
• Simple models, then complex.
The discrete memoryless channel
………… is like the Holy Roman Empire.
![Page 6: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/6.jpg)
Looking Forward
Can the success of this way of thinking be broadened to other fields?
![Page 7: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/7.jpg)
Information Theory of DNA Sequencing
![Page 8: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/8.jpg)
DNA sequencing
A basic workhorse of modern biology and medicine.
Problem: to obtain the sequence of nucleotides.
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
courtesy: Batzoglou
![Page 9: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/9.jpg)
Impetus: Human Genome Project
1990: Start
2001: Draft
2003: Finished3 billion nucleotides
courtesy: Batzoglou
3 billion $$$$
![Page 10: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/10.jpg)
Sequencing gets cheaper and faster
Cost of one human genome• HGP: $ 3 billion• 2004: $30,000,000• 2008: $100,000• 2010: $10,000• 2011: $4,000 • 2012-13: $1,000• ???: $300
Time to sequence one genome: years days
Massive parallelization.
courtesy: Batzoglou
![Page 11: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/11.jpg)
But many genomes to sequence
100 million species(e.g. phylogeny)
7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations
such as HIV, cancer) courtesy: Batzoglou
![Page 12: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/12.jpg)
Whole Genome Shotgun Sequencing
Reads are assembled to reconstruct the original DNA sequence.
Number of reads
read length L ¼ 100 - 1000 N ¼ 108
genome length G ¼ 109
![Page 13: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/13.jpg)
A Gigantic Jigsaw Puzzle
![Page 14: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/14.jpg)
Many Sequencing Technologies
• HGP era: single technology (Sanger)
• Current: multiple “next generation” technologies (eg. Illumina, SoLiD, Pac Bio, Ion Torrent, etc.)
• Each technology has different read length, noise statistics, etc
Eg.: Illumina: L = 50 to 200, error ~ 1 % substitution
Pac Bio: L = 2000 to 4000, error ~ 10-15% indels
![Page 15: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/15.jpg)
Many assembly algorithms
Source: Wikipedia
![Page 16: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/16.jpg)
And many more…….
A grand total of 42!
![Page 17: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/17.jpg)
Computational View
“Since it is well known that the assembly problem is NP-hard, …………”
• algorithm design based largely on heuristics• no optimality or performance guarantees
But NP-hardness does not mean it is hopeless to be close to optimal.
Can we first define optimality without regard to computational complexity?
![Page 18: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/18.jpg)
Information theoretic view
• Given a statistical model, what is the read length L and number of reads N needed to reconstruct with probability 1-ε ?
• Are there computationally efficient assembly algorithms that perform close to the fundamental limits?
Open questions!
![Page 19: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/19.jpg)
• Reads are uniformly sampled from the DNA sequence.
• Read process is noiseless.
Impact of noise: later.
A basic read model
![Page 20: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/20.jpg)
Coverage Analysis
• Pioneered by Lander-Waterman
in 1988.
• What is the number of reads needed to cover the entire DNA sequence with probability 1-²?
• Ncov only provides a lower bound on the number of reads needed for reconstruction.
• Ncov does not depend on the DNA statistics!
![Page 21: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/21.jpg)
Repeat statistics do matter!
easier jigsaw puzzle harder jigsaw puzzle
How exactly do the fundamental limits depend on repeat statistics?
![Page 22: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/22.jpg)
reconstructable by greedy algorithm
Simple model: I.I.D. DNA, G ! 1
(Motahari, Bresler & T. 12)
read length L
1
many repeats of length L
no repeatsof length L
normalized # of reads
coverage
no coverage
What about for finite real DNA?
![Page 23: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/23.jpg)
`
log(# of -̀repeats)
i.i.d. fit data
I.I.D. DNA vs real DNA
Example: human chromosome 22 (build GRCh37, G = 35M)
(Bresler, Bresler & T. 12)
Can we derive performance bounds directly in terms of empirical repeat statistics?
![Page 24: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/24.jpg)
Lower bound: Interleaved repeats
Necessary condition:
all interleaved repeats are bridged.
L
m m nn
In particular: L > longest interleaved repeat length (Ukkonen)
![Page 25: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/25.jpg)
Lower bound: Triple repeats
Necessary condition:
all triple repeats are bridged
In particular: L > longest triple repeat length (Ukkonen)
L
![Page 26: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/26.jpg)
`
log(# of -̀repeats)
Chromosome 22 (Lower Bound)
GRCh37 Chr 22 (G = 35M)
triple repeat
interleaved repeat
coverage
what is achievable?
![Page 27: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/27.jpg)
Greedy algorithm (TIGR Assembler, phrap, CAP3...)
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads
2. Find two contigs with largest overlap and merge them into a new contig
3. Repeat step 2 until only one contig remains
![Page 28: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/28.jpg)
Greedy algorithm: first error at overlap
A sufficient condition for reconstruction:
repeat
bridging read already merged
contigs
all repeats are bridged
L
![Page 29: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/29.jpg)
`
log(# of -̀repeats)
Chromosome 22
GRCh37 Chr 22 (G = 35M)
greedyalgorithm
lower bound
![Page 30: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/30.jpg)
longest interleaved repeatsat length 2248
lower bound
longest repeat at
Chromosome 19
GRCh37 Chr 19 (G = 55M)
log(# of -̀repeats)
greedyalgorithm
non-interleaved repeatsare resolvable!
![Page 31: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/31.jpg)
de Bruijn graph
ATAGCCCTAGCGAT
[Idury-Waterman 95]
[Pevzner et al 01]
(K = 4)
TAGC
AGCC
AGCG
GCCC
GCGA
CCCTCCTA
CTAG
ATAG
CGAT
1. Add a node for each K-mer in a read
2. Add edges for adjacent K-mers
![Page 32: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/32.jpg)
Resolving non-interleaved repeats
non-interleaved repeat
Unique Eulerian path.
![Page 33: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/33.jpg)
Resolving bridged interleaved repeats
interleaved repeat
bridging read
Bridging read resolves one repeat and the unique Eulerian path resolves the other.
![Page 34: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/34.jpg)
Resolving triple repeats
triple repeat
all copies bridged
neighborhood of triple repeat
all copies bridgedresolve repeat locally
![Page 35: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/35.jpg)
Multibridging De-Brujin
Theorem:
Original sequence is reconstructable if:
2. interleaved repeats are (single) bridged
3. coverage
1. triple repeats are all-bridged
Necessary conditions for ANY algorithm:
1. triple repeats are (single) bridged
2. interleaved repeats are (single) bridged.
3. coverage.
(Bresler, Bresler & T. 12)
![Page 36: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/36.jpg)
longest interleaved repeatsat length 2248
lower bound
longest repeat at
Chromosome 19
GRCh37 Chr 19 (G = 55M)
log(# of -̀repeats)
De-brujin algorithmclose to optimal
triple repeat
![Page 37: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/37.jpg)
GAGE Benchmark Datasets
Staphylococcus aureus
i.i.d. fit
data
Rhodobacter sphaeroides
G = 4,603,060 G = 2,903,081 G =88,289,540
Human Chromosome14
http://gage.cbcb.umd.edu/
![Page 38: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/38.jpg)
Gap
Sulfolobus islandicus. G = 2,655,198
triple repeat lower bound
interleaved repeatlower bound
De-Brujinalgorithm
![Page 39: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/39.jpg)
Read Noise
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA
Illumina noise profile
Each symbol corrupted by a noisy channel.
![Page 40: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/40.jpg)
Erasures on i.i.d. uniform DNA
Theorem:
If the erasure probability is less than 1/3, then noiseless performance can be achieved.
A separation architecture is optimal:
(Ma, Motahari, Ramchandran & T. 12)
errorcorrection
assembly
![Page 41: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/41.jpg)
Why?
• Coverage means most positions are covered by many reads.
• Aligning noisy reads locally is easier than assembling noiseless reads globally for perasure < 1/3.
noise averaging
![Page 42: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/42.jpg)
Conclusions
• A systematic approach to assembly design based on information.
• More powerful than just computational complexity considerations.
• Simple models are useful for initial insights but a data-driven approach yields a more complete picture.
![Page 43: The Science of Information: From Communication to DNA Sequencing TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA David Tse U.C. Berkeley CUHK December 14,](https://reader031.vdocument.in/reader031/viewer/2022032312/56649e115503460f94afdd32/html5/thumbnails/43.jpg)
Collaborators
Acknowledgments
Abolfazl Motahari
Guy Bresler
Ma’ayan Bresler
Nan Ma
Kannan Ramchandran
Yun Song Lior Pachter Serafim Batzoglou