genome sequencing algorithms - basavaraj...
TRANSCRIPT
![Page 1: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/1.jpg)
Genome Sequencing Algorithms
William Hamilton (1805 – 1865)
Leonhard Euler (1707 – 1783)
Nicolaas Govert de Bruijn (1918 – 2012)
Phillip Compaeu and Pavel Pevzner
Bioinformatics Algorithms: an Active Learning Approach
![Page 2: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/2.jpg)
The Genome Sequencing Problem
● Determining the order of nucleotides in a genome
● Human genome contains about 3 billion nucleotides– Ameoba dubia and Paris japonica contain 200
times more!
● Applications in Medicine, Agriculture, Biotechnology, ...
![Page 3: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/3.jpg)
The Genome Sequencing Problem
● There is no technology to read the genome from one end to another.– Short snippets, called reads (200-300 nucleotides),
can be identified.
– No info about a location of a read is known.
● Assembling individual reads into the entire genome is akin to solving a giant overlapping puzzle.
● The newspaper explosion analogy
![Page 4: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/4.jpg)
History of Genome Sequencing● 1977: Walter Gilbert and Frederick Sanger developed
independent DNA sequencing methods.● 1990: Human Genome Project, Francis Collins.● 1997: Celera Genomics, Craig Venter.● 2000: Human genome is sequenced.
![Page 5: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/5.jpg)
Next Generation Sequencing● Illumina sequences human genomes for
$10,000● Complete Genomics sequences 100s of
genomes per month● Beijing Genome Institute has 100s of
sequencing machines. Is the world's biggest sequencing center.
![Page 6: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/6.jpg)
Next Generation Sequencing
● Identification of mutations in personal genomes for health diagnosis
● Genome 10K project
![Page 7: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/7.jpg)
Genome Assembly – The Computational Problem
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
ATGTGCATACTAAGCATACTAAGCATACTAAGCATACTAATGTGCATACTAAGCATGCTA
![Page 8: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/8.jpg)
Genome Assembly – The Computational Problem
Sequencing Machine generates reads
A String Reconstruction ProblemA String Reconstruction Problem
![Page 9: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/9.jpg)
The Genome Sequencing Problem
Reconstruct a genome from readsReconstruct a genome from reads
Input: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the ReadsInput: A collection of strings, ReadsOutput: A string, Genome, reconstructed from all the Reads
![Page 10: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/10.jpg)
k-mer CompositionComposition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTT
Lexicographical ordering of k-mers
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
ATG
![Page 11: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/11.jpg)
The String Reconstruction Problem
Reconstruct a string from its k-mer composition.Reconstruct a string from its k-mer composition.
Input: A collection of k-mersOutput: A Genome, such that Composition
k(Genome)
is equal to the collection of k-mers
Input: A collection of k-mersOutput: A Genome, such that Composition
k(Genome)
is equal to the collection of k-mers
![Page 12: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/12.jpg)
Naive String Reconstruction Approach
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
TAAAAT
ATGTGT
GTTNo 3-mer begins with TT!No 3-mer begins with TT!
![Page 13: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/13.jpg)
Representing a Genome as a Path
Composition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GATATGTGTGTT
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG
The Genome
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Nodes in a Graph
![Page 14: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/14.jpg)
Path turns into a Graph
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA
GATATGTGTGTT
![Page 15: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/15.jpg)
Path turns into a Graph
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
Connect k-mer1 with k-mer2
if suffix(k-mer1) = prefix(k-mer2)
![Page 16: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/16.jpg)
Path turns into a Graph
Nodes are ordered lexicographically.
How does one find the genome string?
![Page 17: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/17.jpg)
Genome Path in the Graph
TAAAAT ATG TGCGCCCCACATATG TGGGGGGGAGATATG TGTGTT
TAATGCCATGGGATGTT
The genome string is a Hamiltonian walk in the graph
![Page 18: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/18.jpg)
Hamiltonian Path Problem
Find a Hamiltonian path in the graphFind a Hamiltonian path in the graph
Input: A graph.
Output: A path visiting every node in the graph exactly once.
Input: A graph.
Output: A path visiting every node in the graph exactly once.
Hamiltonian Path: A path in a graph that traverses every node exactly once
William R Hamilton (1805 – 1865)
![Page 19: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/19.jpg)
A Different Path
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
3-mers as nodes
3-mers as edges
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
![Page 20: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/20.jpg)
A Different Path
3-mers as edges and nodes as prefix and suffixes of the corresponding 3-mers
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
![Page 21: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/21.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
![Page 22: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/22.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
![Page 23: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/23.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
TAAAAT
TGC
GCC
CCA
CAT
ATGTGG
GGG
GGA
GAT ATG
TGTGTTTA AA AT
GCCC
CA
TG
GGGA
GT TT
TG
ATG
![Page 24: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/24.jpg)
De Bruijn Graph of the Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
![Page 25: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/25.jpg)
De Bruijn Graph of the Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
The genome string is an Eulerian walk in the De Bruijn graph
TAATGCCATGGGATGTT
ATG
![Page 26: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/26.jpg)
Eulerian Path Problem
Leonhard Euler (1707 – 1783)
Find an Eulerian path in a graphFind an Eulerian path in a graph
Input: A graph.
Output: A path visiting every edge in the graph exactly once.
Input: A graph.
Output: A path visiting every edge in the graph exactly once.
Eulerian Path: A path in a graph that traverses every edge exactly once.
![Page 27: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/27.jpg)
Hamiltonian Path vs. Eulerian Path
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
![Page 28: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/28.jpg)
Hamiltonian Path vs. Eulerian Path
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
Euler has presented an efficient solution to the Eulerian path problem. No fast algorithm exists to solve the Hamiltonian Path problem. The
Hamiltonian Path Problem is NP-Complete.
ATG
![Page 29: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/29.jpg)
The ObjectiveTAATGCCATGGGATGTT
AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT
![Page 30: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/30.jpg)
To Do ...TAATGCCATGGGATGTT
AATAAT ATGATG ATGATG ATGATG CATCAT CCACCA GATGAT GCCGCC GGAGGA GGGGGG GTTGTT TAATAA TGCTGC TGGTGG TGTTGT
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TTATG
![Page 31: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/31.jpg)
Constructing De Bruijn Graph
The composition of the genome is known
TAA AAT ATG TGC GCC CCA CAT ATG TGG GGG GGA GAT ATG TGT GTT
TAA
TA AA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
![Page 32: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/32.jpg)
Constructing De Bruijn Graph
TAA
TA AA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
![Page 33: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/33.jpg)
Constructing De Bruijn Graph
TAA
TA
AAT
AA AT
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
![Page 34: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/34.jpg)
Constructing De Bruijn Graph
TAA
TA
AAT
AA
ATG
AT TG
TGC
TG GC
GCC
GC CC
CCA
CC CA
CAT
CA AT
ATG
AT TG
TGG
TG GG
GGG
GG GG
GGA
GG GA
GAT
GA AT
ATG
AT TG
TGT
TG GT
GTT
GT TT
![Page 35: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/35.jpg)
Constructing De Bruijn Graph
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
![Page 36: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/36.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT TG GC CC CA AT TG GG GG GA AT TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
![Page 37: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/37.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT
TA AA AT
TG GC CC CA
AT
TG GG GG GA
AT
TG GT TT
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
![Page 38: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/38.jpg)
Glue Identical Nodes
TAA
AAT
ATG
TGC
GCC
CCA
CATATG
TGG
GGG
GGA
GAT
ATGTGT
GTT
TA AA AT
TG GC CC CA TG GG GG
GA
TG GT TT
TAAAAT
TGC
GCC
CCA
CAT
ATGTGG
GGG
GGA
GAT ATG
TGTGTTTA AA AT
GCCC
CA
TG
GGGA
GT TT
TG
ATG
![Page 39: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/39.jpg)
De Bruijn Graph of the Genome Composition
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)De Bruijn Graph(Genome Composition) == De Bruijn Graph(Genome)
ATG
![Page 40: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/40.jpg)
Constructing the De Bruijn Graph
● De Bruijn graph of a collection of k-mers:– Represent every k-mer as an edge between its
prefix and suffix
– Glue all nodes with identical labels
![Page 41: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/41.jpg)
Euler Cycle Problem
Find an Eulerian cycle in a graphFind an Eulerian cycle in a graph
Input: A graph.
Output: A cycle visiting every edge in the graph exactly once.
Input: A graph.
Output: A cycle visiting every edge in the graph exactly once.
![Page 42: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/42.jpg)
The Konigsberg Bridges
![Page 43: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/43.jpg)
Eulerian Graph
A graph is Eulerian if it contains an Eulerian cycle
Every balanced and strongly connected graph is Eulerian
1
23
4
5
6
7
8
9
10
11
![Page 44: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/44.jpg)
Algorithm to Find the Eulerian Cycle
1
2
3
45
6 7
8
1
2
3
4
![Page 45: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/45.jpg)
Algorithm to Find the Eulerian Cycle
5
6
7 8
9
10
111
2
3
4
![Page 46: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/46.jpg)
Algorithm to Find the Eulerian Cycle
EulerianCycle(Graph)EulerianCycle(Graph)
form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle
form a cycle Cycle by randomly walking in Graph while there are unexplored edges in Graph select a node newStart in Cycle with unexplored edges form Cycle' by traversing Cycle (starting at newStart) and then randomly walking Cycle ← Cycle' return Cycle
![Page 47: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/47.jpg)
From Reads to De Bruijn Graph to Genome
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
TAATGCCATGGGATGTTTAATGCCATGGGATGTT
AAT ATG ATG ATG CAT CCA GAT GCC GGA GGG GTT TAA TGC TGG TGT
ATG
![Page 48: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/48.jpg)
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATGCCATGGGATGT
ATG
Multiple Eulerian Paths
T
![Page 49: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/49.jpg)
Multiple Eulerian Paths
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATG CCATGGGAT GTT
TAA
AAT
TGC
GCCCCA
CAT
ATG
TGG
GGA
GAT ATG
TGTGTT
TA AA AT
GC
CC
CA
TG
GGGA
GT TT
GGG
TAATGCCATGGGATGTTTAATGCCATGGGATGTT
ATG
ATG
![Page 50: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/50.jpg)
DNA Sequencing with Read-pairs● Read-pair is a pair of reads separated by a
fixed distance d.
Genome: TAATGCCATGGGATGTT.AAT-CCA is a 3,1 read pair.
AAT-CCA represents the sequenceAATGCCA in the original deBruijn graph
![Page 51: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/51.jpg)
DNA Sequencing with Read-pairs
Composition3(TAATGCCATGGGATGTT) =
TAA AAT ATG TGCGCC CCA CAT ATG TGG GGG GGA GAT TGT GTTATG
PairedComposition3,1(TAATGCCATGGGATGTT) =
TAA|GCC
AAT|CCA
ATG|CAT
TGC|ATGGCC|TGG
CCA|GGGCAT|GGA
ATG|GATTGG|ATG
GGG|TGTGGA|GTT
![Page 52: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/52.jpg)
Paired Composition
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
Lexicographical order:
![Page 53: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/53.jpg)
String Reconstruction from Read-pairs
String reconstruction from read-pairsString reconstruction from read-pairs
Input: A collection of paired k-mers
Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers
Input: A collection of paired k-mers
Output: A string Text such that PairedComposition(Text) is equal to the collection of paired k-mers
![Page 54: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/54.jpg)
Paired De Bruijn Graphs
TAAGCC
AATCCA
ATGCAT
TGCATG
GCCTGG
CCAGGG
CATGGA
ATGGAT
TGGATG
GGGTGT
GGAGTT
AACC
ATCA
AATCCA
ATCA
TGAT
ATGCAT
ATGA
TGAT
CAGG
ATGA
ATGGAT
CATGGA
CCGG
CAGG
GCTG
CCGG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
AACC
TGAT
GCTG
TAAGCC
TGCATG
TGAT
GGTG
TGGATG
![Page 55: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/55.jpg)
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
ATCA
TGAT
ATGA
TGAT
CAGG
ATGA
ATGGAT
CATGGA
CCGG
CAGG
GCTG
CCGG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
AACC
TGAT
GCTG
TAAGCC
TGCATG
TGAT
GGTG
TGGATG
Combine nodes with identical labels
![Page 56: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/56.jpg)
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
TGAT
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
![Page 57: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/57.jpg)
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
Paired DeBruijn graphs obtained from the paired composition and the genome are identical.
![Page 58: Genome Sequencing Algorithms - Basavaraj Talawarbt.nitk.ac.in/c/14a/co457/notes/GenomeSequencing0.pdf · Genome Sequencing Algorithms William Hamilton (1805 – 1865) Leonhard Euler](https://reader035.vdocument.in/reader035/viewer/2022062311/5f882e7c75f83e314d40d269/html5/thumbnails/58.jpg)
Paired De Bruijn Graphs
ATGCAT
AACC
ATCA
AATCCA
TGAT
ATGA
CAGG
ATGGAT
CATGGA
CCGG
GCTG
GGGT
GATT
GGTG
GGGT
CCAGGG
GCCTGG
GGAGTT
GGGTGT
TAGC
TAAGCC
TGCATG
TGGATG
Unique genome string: TAATGCCATGGGATGTT