![Page 1: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/1.jpg)
High Throughput Sequencing:Microscope in the Big Data Era
Sreeram Kannan and David Tse
Tutorial
ISIT 2014
Research supported by NSF Center for Science of Information.
![Page 2: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/2.jpg)
DNA sequencing
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
![Page 3: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/3.jpg)
High throughput sequencing revolution
tech. driver for communications
![Page 4: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/4.jpg)
Shotgun sequencing
read
![Page 5: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/5.jpg)
Technologies
Sequencer
Sanger 3730xl
454 GS Ion Torrent
SOLiDv4 Illumina HiSeq 2000
Pac Bio
Mechanism
Dideoxy chain termination
Pyrosequencing
Detection of hydrogen ion
Ligation and two-base coding
Reversible Nucleotides
Single molecule real time
Read length
400-900 bp
700 bp ~400 bp 50 + 50 bp
100 bp PE
1000~10000 bp
Error Rate 0.001% 0.1% 2% 0.1% 2% 10-15%
Output data (per run)
100 KB 1 GB 100 GB 100 GB 1 TB 10 GB
![Page 6: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/6.jpg)
High throughput sequencing:Microscope in the big data era
Genomic variations, 3-D structures, transcription, translation, protein interaction, etc.
The quantities measured can be dynamic and vary spatially.
Example: RNA expression is different in different tissues and at different times.
![Page 7: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/7.jpg)
Computational problems for high throughput data
measure data
manage data
utilize data
• Assembly (de Novo)
• Variant calling (reference-based assembly)
• Compression
• Privacy
• Genome wide association studies
• Phylogenetic tree reconstruction
• Pathogen detection
Scope of this tutorial
![Page 8: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/8.jpg)
Assembly: three points of view
• Software engineering
• Computational complexity theoretic
• Information theoretic
![Page 9: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/9.jpg)
Assembly as a software engineering problem
• A single sequencing experiment can generate 100’s of millions of reads, 10’s to 100’s gigabytes of data.
• Primary concerns are to minimize time and memory requirements.
• No guarantee on optimality of assembly quality and in fact no optimality criterion at all.
![Page 10: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/10.jpg)
Computational complexity view
• Formulate the assembly problem as a combinatorial optimization problem:– Shortest common superstring (Kececioglu-Myers 95)– Maximum likelihood (Medvedev-Brudno 09)– Hamiltonian path on overlap graph (Nagarajan-Pop 09)
• Typically NP-hard and even hard to approximate.
• Does not address the question of when the solution reconstructs the ground truth.
![Page 11: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/11.jpg)
Information theoretic view
Basic question:
What is the quality and quantity of read data needed to reliably reconstruct?
![Page 12: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/12.jpg)
Tutorial outline
I. De Novo DNA assembly.
II. Reference-based DNA assembly.
III. De Novo RNA assembly
![Page 13: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/13.jpg)
Themes
• Interplay between information and computational complexity.
• Role of empirical data in driving theory and algorithm development.
![Page 14: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/14.jpg)
Part I:
De Novo DNA Assembly
![Page 15: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/15.jpg)
Shotgun sequencing model
Basic model : uniformly sampled reads.
Assembly problem: reconstruct the genome given the reads.
![Page 16: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/16.jpg)
A Gigantic Jigsaw Puzzle
![Page 17: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/17.jpg)
Challenges
Long repeats
`
log(# of -̀repeats)
Human Chr 22repeat length histogram
Illumina read error profile
Read errors
![Page 18: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/18.jpg)
Two-step approach
• First, we assume the reads are noiseless
• Derive fundamental limits and near-optimal assembly algorithms.
• Then, we add noise and see how things change.
![Page 19: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/19.jpg)
Repeat statistics
easier jigsaw puzzle harder jigsaw puzzle
How exactly do the fundamental limits depend on repeat statistics?
![Page 20: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/20.jpg)
Lower bound: coverage
• Introduced by Lander-Waterman
in 1988.
• What is the number of reads needed to cover the entire DNA sequence with probability 1-²?
• NLW only provides a lower bound on the number of reads needed for reconstruction.
• NLW does not depend on the DNA repeat statistics!
![Page 21: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/21.jpg)
reconstructable by greedy algorithm
Simple model: I.I.D. DNA, G ! 1
(Motahari, Bresler & Tse 12)
read length L
1
many repeats of length L
no repeatsof length L
normalized # of reads
coverage
no coverage
What about for finite real DNA?
![Page 22: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/22.jpg)
`
log(# of -̀repeats)
i.i.d. fit data
I.I.D. DNA vs real DNA
Example: human chromosome 22 (build GRCh37, G = 35M)
(Bresler, Bresler & Tse 12)
Can we derive performance bounds on an individual sequence basis?
![Page 23: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/23.jpg)
GREEDYDEBRUIJN
SIMPLEBRIDGING
MULTIBRIDGING
Lander-Waterman coverage
ML lower bound
Individual sequence performance bounds
repeatlength
Human Chr 19Build 37
(Bresler, Bresler, Tse BMC Bioinformatics 13)
Lcritical
Given a genome s
log(# of -̀repeats)
![Page 24: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/24.jpg)
Rhodobacter sphaeroides
GAGE Benchmark Datasets
Staphylococcus aureus
G = 4,603,060 G = 2,903,081 G = 88,289,540
Human Chromosome14
http://gage.cbcb.umd.edu/
MULTIBRIDGINGlower bound
MULTIBRIDGINGlower bound
MULTIBRIDGINGlower bound
![Page 25: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/25.jpg)
Lower bound: Interleaved repeats
Necessary condition:
all interleaved repeats are bridged.
L
m m nn
In particular: L > longest interleaved repeat length (Ukkonen)
![Page 26: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/26.jpg)
Lower bound: Triple repeats
Necessary condition:
all triple repeats are bridged
In particular: L > longest triple repeat length (Ukkonen)
L
![Page 27: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/27.jpg)
length
Lander-Waterman coverage
lower bound
Individual sequence performance bounds
Human Chr 19Build 37
(Bresler, Bresler, T. BMC Bioinformatics 13)
log(# of -̀repeats)
![Page 28: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/28.jpg)
Greedy algorithm (TIGR Assembler, phrap, CAP3...)
Input: the set of N reads of length L
1. Set the initial set of contigs as the reads
2. Find two contigs with largest overlap and merge them into a new contig
3. Repeat step 2 until only one contig remains
![Page 29: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/29.jpg)
Greedy algorithm: first error at overlap
A sufficient condition for reconstruction:
repeat
bridging read already merged
contigs
all repeats are bridged
L
![Page 30: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/30.jpg)
longest interleaved repeatsat length 2248
lower bound
longest repeat at
Back to chromosome 19
GRCh37 Chr 19 (G = 55M)
log(# of -̀repeats)
greedyalgorithm
non-interleaved repeatsare resolvable!
![Page 31: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/31.jpg)
Dense Read Model
• As the number of reads N increases, one can recover exactly the L-spectrum of the genome.
• If there is at least one non-repeating L-mer on the genome, this is equivalent information to having a read at every starting position on the genome.
• Key question:
What is the minimum read length L for which the genome is uniquely reconstructable from its L-spectrum?
![Page 32: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/32.jpg)
de Bruijn graph
ATAGACCCTAGACGAT
1. Add a node for each (L-1)-mer on the genome.
2. Add k edges between two (L-1)-mers if their overlap has length L-2 and the corresponding L-mer appears k times in genome.
(L = 5)
TAGA
AGCC
AGCG
GCCC
GCGA
CCCTCCTA
CTAG
ATAG
CGAT
AGAC
![Page 33: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/33.jpg)
Eulerian path
ATAGACCCTAGACGAT
(L = 5)
TAGA
AGCC
AGCG
GCCC
GCGA
CCCTCCTA
CTAG
ATAG
CGAT
AGAC
Theorem (Pevzner 95) :
If L > max(linterleaved, ltriple) , then the de Bruijn graph has a unique Eulerian path which is the original genome.
![Page 34: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/34.jpg)
Resolving non-interleaved repeats
non-interleaved repeat
Unique Eulerian path.
Condensed sequence graph
![Page 35: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/35.jpg)
From dense reads to shotgun reads[Idury-Waterman 95]
[Pevzner et al 01]
Idea: mimic the dense read scenario by looking at K-mers of the length L reads
Construct the K-mer graph and find an Eulerian path.
Success if we have K-coverage of the genome and K > Lcritical
![Page 36: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/36.jpg)
GREEDYDEBRUIJN
length
Lander-Waterman coverage
lower bound
De Bruijn algorithm: performance
Human Chr 19Build 37
Loss of info. from the reads!
![Page 37: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/37.jpg)
Resolving bridged interleaved repeats
interleaved repeat
bridging read
Bridging read resolves one repeat and the unique Eulerian path resolves the other.
![Page 38: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/38.jpg)
GREEDYDEBRUIJN
SIMPLEBRIDGING
length
Lander-Waterman coverage
lower bound
Simple bridging: performance
Human Chr 19Build 37
![Page 39: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/39.jpg)
Resolving triple repeats
triple repeat
all copies bridged
neighborhood of triple repeat
all copies bridgedresolve repeat locally
![Page 40: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/40.jpg)
Triple Repeats: subtleties
![Page 41: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/41.jpg)
Multibridging De-Brujin
Theorem: (Bresler,Bresler, Tse 13)
Original sequence is reconstructable if:
2. interleaved repeats are (single) bridged
3. coverage
1. triple repeats are all-bridged
Necessary conditions for ANY algorithm:
1. triple repeats are (single) bridged
2. interleaved repeats are (single) bridged.
3. coverage.
![Page 42: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/42.jpg)
GREEDYDEBRUIJN
SIMPLEBRIDGING
MULTIBRIDGING
length
Lander-Waterman coverage
lower bound
Multibridging: near optimality for Chr 19
Human Chr 19Build 37
![Page 43: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/43.jpg)
Rhodobacter sphaeroides
GAGE Benchmark Datasets
Staphylococcus aureus
G = 4,603,060 G = 2,903,081 G = 88,289,540
Human Chromosome14
http://gage.cbcb.umd.edu/
MULTIBRIDGINGlower bound
MULTIBRIDGINGlower bound
MULTIBRIDGINGlower bound
Lcritical Lcritical Lcritical
Lcritical = length of the longest triple or interleaved repeat.
![Page 44: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/44.jpg)
Gap
Sulfolobus islandicus. G = 2,655,198
triple repeat lower bound
interleaved repeatlower bound
MULTIBRIDGINGalgorithm
![Page 45: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/45.jpg)
Complexity: Computational vs Informational
• Complexity of MULTIBRIDGING – For a G length genome, O(G2)
• Alternate formulations of Assembly– Shortest Common Superstring: NP-Hard– Greedy is O(G), but only a 4-approximation to SCS in the
worst case– Maximum Likelihood: NP-Hard
• Key differences– We are concerned only with instances when reads are
informationally sufficient to reconstruct the genome.– Individual sequence formulation lets us focus on issues
arising only in real genomes.
![Page 46: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/46.jpg)
Confidence
• When the algorithm obtains an answer, can it be sure?
• Under the dense read model, we can guarantee that when there is a unique Eulerian cycle, the reconstructed answer is correct. – This happens whenever L > max(linterleaved, ltriple)
• Conversely, when L > max(linterleaved, ltriple), there are multiple reconstructions that are consistent with the observed data.
• Under the shotgun read model, there is ambiguity in some scenarios.
![Page 47: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/47.jpg)
Read Errors
Error rate and nature depends on sequencing technology:
Examples:
• Illumina: 0.1 – 2% substitution errors• PacBio: 10 – 15% indel errors
We will focus on a simple substitution noise model with noise parameter p.
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTTATACTTA
![Page 48: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/48.jpg)
Consistency
Basic question:
What is the impact of noise on Lcritical?
This question is equivalent to whether the L-spectrum is exactly recoverable as the number of noisy reads
N -> 1.
Theorem (C.C. Wang 13):
Yes, for all p except p = ¾.
![Page 49: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/49.jpg)
What about coverage depth?
Theorem (Motahari, Ramchandran,Tse, Ma 13):
Assume i.i.d. genome model. If read error rate p is less than a threshold, then Lander-Waterman coverage is sufficient for L > Lcritical
For uniform distr. on {A,G,C,T}, threshold is 19%.
A separation architecture is optimal:
errorcorrection
assembly
![Page 50: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/50.jpg)
Why?
• Coverage means most positions are covered by many reads.
• Multiple aligning overlapping noisy reads is possible if
• Assembly using noiseless reads is possible if
noise averaging
M
![Page 51: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/51.jpg)
From theory to practice
Two issues:
1) Multiple alignment is performed by testing joint typicality of M sequences, computationally too expensive.
Solution: use the technique of finger printing.
2) Real genomes are not i.i.d.
Solution: replace greedy by multibridging.
![Page 52: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/52.jpg)
X-phased multibridging
Prochlorococcus marinus
Substitution errors of rate 1.5 %
Lcritical
Lam, Khalak, T.Recomb-Seq 14
![Page 53: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/53.jpg)
More results
Helicobacter pylori
Methanococcus maripaludis Mycoplasma agalactiae
Prochlorococcus marinus
Lcritical
Lcritical Lcritical
Lcritical
![Page 54: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/54.jpg)
A more careful look
Mycoplasma agalactiae
Lcritical
Lcritical-approx
![Page 55: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/55.jpg)
Approximate repeat example: Yersinia pestis
exact triple repeat, length 1662
approximate triple repeat length
5608
![Page 56: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/56.jpg)
Application: finishing tool for PacBio reads
OurfinishingTool
raw_reads.fastacontigs.fasta
improved_contigs.fasta
https://github.com/kakitone/finishingTool
PacBio Assembler
HGAP
raw_reads.fasta contigs.fasta
![Page 57: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/57.jpg)
Experimental results
Escherichia coli Meiothermus ruber Pedobacter heparinus
Before
After
![Page 58: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/58.jpg)
More detail of the result
Species Before [Ncontigs]
After [Ncontigs]
% Match with reference
Time Size
Escherichia coli (MG 1655)
21 7 [finisherSC]99.60
< 3 mins (laptop)
~ 4.6M
Meiothermus ruber (DSM 1279)
3 1 [finisherSC]99.99
< 1 min(laptop)
~ 3.0M
Pedobacter heparinus (DSM 2366)
18 5 [finisherSC]99.89
< 3 mins(laptop)
~ 5.1M
S_cerivisea (fungus)
252 78 [finisherSC]95.46
< 3 hours(laptop)
~ 12.4M
S_cerivisea(fungus)
252 55 [Greedy] 53.91
< 3 hours(laptop)
~ 12.4M
![Page 59: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/59.jpg)
Part II:
Reference-Based DNA Assembly
(Mohajer, Kannan, Tse ‘14)
![Page 60: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/60.jpg)
Many genomes to sequence…
100 million species(e.g. phylogeny)
7 billion individuals (SNP, personal genomics)
1013 cells in a human(e.g. somatic mutations
such as HIV, cancer) courtesy: Batzoglou
… but not all independent
![Page 61: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/61.jpg)
Reference Based Assembly: Formulation
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC
AssemblerSide Information
ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACCReference
Target
![Page 62: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/62.jpg)
Types of Variations
ACGTCCCATGCGTATGCATAATGCCACATATGGCTATGCGTAATGAGTACC
ACGTCCTATGCGTATGCGTAATGCCACATATTGCTATGCGTAATGCGTACC
Substitutions (Single Nucleotide Polymorphisms: SNP)
Reference
Target
![Page 63: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/63.jpg)
Types of Variations
ACGTCCATGCGTATGCTAATGCCACATATTGAGCTATGCGTAATGCTGTACC
ACGTCCTAGATGCGTATGCGTAATGCCACATATGCTATGCGTAATGGTACC
Small Indels (Insertions and Deletions)
ACGTCC___ATGCGTATGC_TAATGCCACATATTGAGCTATGCGTAATGCTGTACC
ACGTCCTAGATGCGTATGCGTAATGCCACATAT___GCTATGCGTAATG__GTACC
Reference
Target
![Page 64: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/64.jpg)
Types of Variations
Structural VariationReference
Inversion
Duplication
Duplication (dispersed)
Copy Number Variation
![Page 65: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/65.jpg)
Mathematical Formulation
Focus on SNP version Define SNP rate
Noiseless reads
What is Lcritical for this problem?
Want exact reconstruction
Algorithm
r (Reference DNA)
SNP Rate
Reads from target t
Estimate of Target DNA
Dense
![Page 66: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/66.jpg)
Mathematical Formulation
For any given reference DNA and SNP rate, what is the read length required for reconstruction? In the worst case among target DNA sequences
Lcritical is a function of r, SNP rate
Algorithm
r (Reference DNA)
SNP Rate
Reads from target t
Estimate of Target DNA
Dense
![Page 67: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/67.jpg)
Necessary Conditions
r
t1
Let the reference DNA have a repeat of size lrep > 2L
t2
Consider two possible target DNA sequences t1 and t2
Since L < lrep /2, the two targets D1 and D2 indistinguishable from reads
Sanity check: interleaved repeat of length lrep /2 in D1 and D2
lreplrep
L L
![Page 68: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/68.jpg)
Necessary Conditions
t1
Let the reference DNA have an approximate repeat of size lrep,app > 2L
t2
If L < lrep,app / 2: the two possible targets t1 and t2 indistinguishable
r’
r
Can create r’ close to r but having exact repeat of size lrep,app
Tolerance for approximate repeat depends on SNP rate
![Page 69: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/69.jpg)
Algorithm
r
Map reads to r
r
Keep only uniquely mapped reads
Estimate t
ť
t
lrep,app lrep,app
Let L > lrep,app / 2
![Page 70: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/70.jpg)
Condition for Success
r
Loci covered by uniquely mapped reads are correctly called.
Algorithm fails at a particular locus =>
None of the (L-1) possible reads uniquely mapped
2L 2L
Second case more typical in real genome =>
2L length approximate repeat in r
L > lrep,app / 2 => The algorithm succeeds.
Case 1 Case 2
![Page 71: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/71.jpg)
Assembly Vs. Alignment: I
Necessary condition L ≥ lrep,app (r) / 2
Sufficient condition L > lrep,app (r) / 2 (subject to the assumption)
=> Alignment near optimal and Lref = lrep,app (r) / 2.
De Novo algorithm achieves Lcrit (t) = max {linterleaved(t), ltriple(t) }
In terms of r, for worst case t Lde-novo = max {linterleaved,app (r), ltriple,app (r)}
![Page 72: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/72.jpg)
Assembly Vs. Alignment: II
1. Clearly Lde-novo ≥ Lref since Lref is necessary.
2. Lde-novo = max {linterleaved,app (r), ltriple,app (r)} ≤ lrep,app(r) = 2 Lref
Thus gain from reference is at-most a factor of 2 in the read length.
The maximal gain happens when linterleaved,app (r) = lrep,app (r), i.e., when the largest approximate repeat is an interleaved repeat.
This happens for example, when the DNA is an i.i.d. sequence
![Page 73: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/73.jpg)
Reference based Assembly: Reprise
• Complexity of alignment– Very fast aligners using fingerprinting available when SNP
rate small
• Better than alignment ?– Theory shows alignment near optimal– But alignment is what everyone uses anyway– Nothing better is possible?
• The limitations of the worst case formulation!• If we adopt a individual sequence analysis for both reference
and target, better solution possible.
![Page 74: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/74.jpg)
Part III:RNA (Transcriptome) Assembly
Kannan, Pachter, Tse Genome Informatics ‘13
![Page 75: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/75.jpg)
RNA: The RAM in Cells
• The instructions from DNA are copied to mRNA transcripts by transcription– RNA transcripts captures dynamics of cell
• RNA Sequencing: Importance– Clinical purposes– Research: Discovery of novel functions – Understanding gene regulation– Most popular *-Seq
DNA RNA Proteintranscription translation
![Page 76: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/76.jpg)
Alternative splicing
ATAC GAAT CAAT TCAG
ATAC CAAT TCAG GAAT TCAG
DNA
RNA Transcript 1 RNA Transcript 2
IntronExon
AC TGAA AGC
Alternative splicing yields different isoforms.
1000’s to 10,000’s symbols long
![Page 77: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/77.jpg)
RNA-Seq
ATAC CAAT TCAG
GAAT TCAG
ATAC CAAT TCAG
GAAT TCAG
GAAT TCAG
TCA
(Mortazavi et al,Nature Methods 08)
ATT
GAA
Reads
Assembler reconstructs
• Existing Assemblers– Genome guided: Cufflinks, Scripture, Isolasso,..– De novo: Trinity, Oasis, TransAbyss,…
![Page 78: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/78.jpg)
RNA Sequencing: Bottleneck
Source: Wei Li et al, JCB 2011, Data from ENCODE project
24243
7553
9741
6457 448216
59647
5588
IsoLasso
Scripture
Cufflinks
Popular assemblers diverge significantly when fed the same input
Is the bottleneck informational or computational or neither?
78
![Page 79: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/79.jpg)
Informational Limits
•Lcritical for transcriptome assembly
Read Length, L0
Lcritical No algo. can reconstruct
Proposed algo. can reconstruct in linear time
On many examples, these two bounds match, establishing Lcritical !
• Mouse transcriptome: Lcritical = 4077 revealing complex transcriptome structure
• What can we do at practical values of L?
79
![Page 80: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/80.jpg)
Near-Optimality at Practical L
Fraction of Transcripts Reconstructable
Read LengthRead Length80
![Page 81: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/81.jpg)
Fraction of Transcripts Reconstructable
Read Length
Upper bound on any algorithm
Read Length81
Upper bound without abundanceUpper Bound
Near-Optimality at Practical L
![Page 82: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/82.jpg)
Fraction of Transcripts Reconstructable
Read Length
Proposed Algorithm
Read Length82
Near-Optimality at Practical L
![Page 83: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/83.jpg)
Necessity of Abundance Information
Fraction of Transcripts Reconstructable
Read Length
Upper bound without abundance diversity
Read Length83
Upper bound without abundance
![Page 84: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/84.jpg)
Transcriptome Assembly: Formulation
• M transcripts s1,..,sM with relative abundances α1,..,αM
which are generic (rationally independent).– Dense read model: Look at Lcrit
– Get all substrings of length L along with their relative weights
.
.
.
s1
s2
sM
α1
α2
αMαM
α1+α2
![Page 85: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/85.jpg)
What is Lcritical for transcriptome?
• Lcritical is lower bounded by the length of the longest interleaved repeat in any transcript
• It can potentially be much larger due to inter-transcript repeats of exons across isoforms.
ATAC CAAT TCAG
GAAT TCAG
![Page 86: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/86.jpg)
s1 s3 s4
s2 s3 s5
The Information Bottleneck
86
s1 s3 s4
s2 s3 s5
![Page 87: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/87.jpg)
s1 s3 s4
s2 s3 s5
87
s1 s3s4
s2 s3s5
The Information Bottleneck
![Page 88: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/88.jpg)
s5
s4
88
s1 s3
s2 s3
s1 s3
s2 s3
s4
s5
Unless L > s3 these two transcriptomes are confused
The Information Bottleneck
![Page 89: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/89.jpg)
89
s5s1 s3
s4s2 s3
Sparsity can help rule out this four transcript alternative
But first two possibilities still confusable unless L > s3
s1 s3
s2 s3
s4
s5
The Information Bottleneck
![Page 90: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/90.jpg)
s1 s3 s5
s2 s3 s4
How to Distinguish the Two
90
s1 s3s4
s2 s3s5
![Page 91: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/91.jpg)
Abundance diversity
lymphoblastoid cell lineGeuvadis dataset
![Page 92: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/92.jpg)
s1 s4
s1 s5
92
Abundance Diversity
s3
s3
![Page 93: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/93.jpg)
s1 s4
s1 s5
93
s1
s4s1
s5
This transcriptome is not a viable alternative (non-uniform coverage)
Even if L < s3 these transcriptomes are distinguishable.
s3
s3
s3
s3
Abundance Diversity
![Page 94: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/94.jpg)
s5 s2 s3
s1 s2
s2
a
b
c
s3
s1 s4s4
Fooling Set under Abundance Diversity
These two transcriptomes are still confusable if L < s294
![Page 95: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/95.jpg)
Achievability: Algorithm
• From the reads– we construct a transcript graph
ATCCA
TCCAT CATTC
ATTCGReads
CATTC
ATTCG
0.3
0.1
0.3
ATCCA
TCCAT
GATTCGATTC
CCATT0.30.3
• Weight edges based on relative frequencies
95
![Page 96: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/96.jpg)
Achievability: Algorithm
• From the reads, we construct a transcript graph
ATCCA
TCCAT CATTC
ATTCGReads
0.3
0.1
0.3
GATTC
CCATT0.30.3
• Weight edges based on relative frequencies
96
![Page 97: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/97.jpg)
Achievability: Algorithm
• From the reads, we construct a transcript graph
ATC TCGReads
0.3
0.1
0.3
GAT
CAT
• Weight edges based on relative frequencies
97
![Page 98: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/98.jpg)
Transcripts from Graph
• Paths correspond to transcripts
• Naïve Algorithm: Output all paths from the graph
ATC
GAT
CAT
TCG
GAT TCG
ATC CAT TCG
98
0.3
0.1
0.3
![Page 99: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/99.jpg)
Utility of Abundance
• Consider the following splice-graph– Not all paths are transcripts– Node frequencies give abundance information
– First idea: Use continuity of copy counts
s1
s2
s3
s4
s5
0.12
0.88
0.12
0.88
s1 s3 s4
s2 s3 s5
0.12
0.88
99
![Page 100: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/100.jpg)
Utility of Abundance: Beyond Continuity
• More complex splice graphs:
s0
s1 s3
s4
s5
12
9
5
7
s2
6s6
15
In general, we are given values on nodes /edges.
Need to find sparsest flow (on fewest paths).100
9
5
7
6
![Page 101: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/101.jpg)
General Splice graphs
• Principle for general splice graphs:– Find the smallest set of paths that corresponds to the node /
edge copy counts• Network routing, snooping, societal networks
• How to split a flow?– Edge-flow: Flow value on each edge (satisfying conservation)– Path-flow: Flow value on each path– Given a edge-flow, find the sparsest path flow
s1
s2
s3
s4
s5
0.12
0.88
0.12
0.88
Start End
0.12
0.88
0.12
0.88
0.12
0.88101
![Page 102: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/102.jpg)
Sparsest Flow Decomposition
• Problem is NP-Hard. [Vatinlen et al’ 08, Hartman et al ’12] – Closer look at hard instances: most paths have same flow– Equivalent to: Most transcripts have same abundance (!)– This is not characteristic of the biological problem
• Our Result:– Assume that abundances are generic– Propose a provably correct algorithm that reconstructs
when: L > Lsuff
– Algorithm is linear time under this condition• Approximately satisfied by biological data !
102
![Page 103: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/103.jpg)
Iterative Algorithm
• The algorithm locally resolves paths using abundance diversity– Error propagation?
• Decompose a node only when sure• If unsure, decompose other nodes before coming back to this
node
• The algorithm solves paths like a sudoku puzzle– Solving one node can help uniquely resolve other nodes!– Can analyze conditions for correct recovery
• L > Lsuff
103
![Page 104: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/104.jpg)
46
47
a
b3
5
1
2
a
b+c c
346
35
a
c
1
347b
2
4
6
7
a+ba
b3
5
1
2
a
b+c c
4
6
7
a+ba
b3
5
1
2
a
b+c c
46
47
a
b3
5
1
2
a
b+c c
1346
235
2347
a
b
c
Algorithm: Example Run
104
![Page 105: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/105.jpg)
Practical Implementation
Transcripts as paths
Sparsest decomposition of edge-flow into paths
Deals with inter-transcript repeats
Aggregate abundance estimationNode-wise copy count
estimatesSmoothing CC estimates
using min-cost network flow
Multibridging to construct transcript graphCondensation and intra-
transcript repeat resolutionIdentify and discard sequencing errors
![Page 106: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/106.jpg)
Practical Performance
•Simulated reads from human chromosome 15, Gencode transcriptome
•Hard test case• 1700 transcripts chosen randomly from Chr 15• Abundance generated from log-uniform distribution• Read length=100, 1 Million reads• 1% error rates• Single-end reads / stranded protocol
106
![Page 107: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/107.jpg)
Coverage Depth of Transcripts
Fraction of Transcripts Missed
Trinity Our0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positives
107
1 to 10 10 to 25 25 to 50 50 to 100 100+0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TrinityOur
Practical Performance
![Page 108: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/108.jpg)
Complexity
• Sparsest flow problem known to be NP-Hard – Can show using similar reduction that RNA-Seq problem under
dense reads is also NP-Hard, assuming arbitrary abundances
• Reasons why our formulation leads to poly-time algorithm:– Our assumption that abundances are generic – Only worry about instances where there is enough information– Individual sequence formulation lets us focus on issues arising
only in real genomes.
![Page 109: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/109.jpg)
Confidence
• Can we be sure when the produced solution is correct?– Assume dense read model– We are finding the sparsest set of transcripts that satisfy the
given L spectrum
• Under the assumption of genericity– Theorem: If the sparsest solution is unique, then it is the only
generic solution satisfying the L-spectrum (!)
s1
s2
s3
s4
s5
0.12
0.88
0.12
0.88
![Page 110: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/110.jpg)
Summary
• An approach to assembly design based on principles of information theory.
• Driven by and tested on genomics and transcriptomics data.
• Ultimate goal is to build robust, scalable software with performance guarantees.
![Page 111: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/111.jpg)
Problem Landscape
measure data
manage data
utilize data
• Assembly (de Novo)• Noisy reads• RNA: Finite N
• Variant calling (reference-based assembly)
• Indels• Large variants
• Metagenomic assembly
• Compression• Compress
memory?
• Privacy• Information
theoretic methods?
• Genome wide association studies
• Information bounds
• Phylogenetic tree reconstruction
• Pathogen detection
![Page 112: High Throughput Sequencing: Microscope in the Big Data Era TexPoint fonts used in EMF: AAAAAAAAAAAAAAAA Sreeram Kannan and David Tse Tutorial ISIT 2014](https://reader038.vdocument.in/reader038/viewer/2022110207/56649d6e5503460f94a4f3a7/html5/thumbnails/112.jpg)
Acknowledgements
DNA Assembly RNA Assembly
Guy BreslerMIT
Ma’ayan BreslerBerkeley
Ka Kit LamBerkeley Asif Khalak
Pacific Biosciences
Lior PachterBerkeley
Joseph HuiBerkeley
Kayvon MazoojiBerkeley
Abolfazl MotahariSharif
Soheil Mohajer
Eren Sasoglu