genome reconstruction: a puzzle with a billion pieces genome reconstruction: a puzzle with a billion...
TRANSCRIPT
Genome Reconstruction: A Puzzle With a Billion Pieces
Genome Reconstruction: A Puzzle with a Billion
Pieces
Phillip Compeau & Pavel PevznerUniversity of California-San Diego
Genome Reconstruction: A Puzzle With a Billion Pieces
Outline
1. Introduction to Genome Sequencing
2. The Newspaper Problem
3. DNA Chips: A First Shot at Sequencing with Short Reads
4. Two Mathematical Detours
5. Introduction to Graph Theory
6. Euler’s Theorem
7. ECP vs. HCP and Algorithmic Complexity
8. From Euler and Hamilton to Fragment Assembly
9. De Bruijn and a Final Solution to Fragment Assembly
10. Generalizing Fragment Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 1: Introduction to Genome Sequencing
Genome Reconstruction: A Puzzle With a Billion Pieces
What Is Genome Sequencing?
• A genome can be represented as a book written in an alphabet containing only 4 letters, called nucleotides: A,T,G, and C.• A human genome has roughly 3 billion nucleotides.
• Genome sequencing is the process of determining the sequence of nucleotides that make up a genome.
...CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATATAGCCGAGCGGCTACGATGATGCTAGCTGTACAGCTGATGATCTAGCTATCGATGCGATCGATGCGCGAGTGCGATCGATCACTTCGAGCTAGCTGATCGATCGATGCTAGCTAGCTGACTGATCATGGCGTTAGCTAGCTAGCTGATCGTCGATCGTACGTAGCTGATTACGATCGTCCGATCGTGCTATGACGTACGAGGCGGCTACGTAGCATGCTAGCTGACTGATGTAGCTAGCTATACGATACTATATATTCGATCGATTTATTACCATGACTGACGCGCATCGCTGTACACGTACTAGCTGATCGATGCTAGTCGATCGATCGATCATGTTATATATCGCGGCGCATCGATCGACTGCTCGATTATCGATACGTCGATCGCTGTATATACGTCTTTATAGCTAGGAGCATAGCGACGCGCTATCGATCGATCGTCTAGTCGACTGATCGTACTAGCTGACGCTGACGACTAGCTAGCTATCGACGATCGTAGTGCGATTACTAGCTAGGATCCTACTGTACGTCAGTCAGTCTGATCGATAGCGAGGAAAGCGAGACTGATCGTTCTCTAGATGTAGCTGATGTGACTACTATACTACTGGCAGCGATCGGGA…
Genome Reconstruction: A Puzzle With a Billion Pieces
What Is Genome Sequencing?
• Different people have slightly different genomes: all humans share 99.9% of the same genetic code.
• The 0.1% difference accounts for height, eye color, high cholesterol susceptibility, etc.
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACCACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGGGACTATTATCGACTACAGATAAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
CTGATGATGGACTACGCTACTACTGCTAGCTGTATTACGATCAGCTACAACATCGTAGCTACGATGCATTAGCAAGCTATCGATCGATCGATCGATTATCTACGATCGATCGATCGATCACTATACGAGCTACTACGTACGTACGATCGCGTGACTATTATCGACTACAGATGAAACATGCTAGTACAACAGTATACATAGCTGCGGGATACGATTAGCTAATAGCTGACGATATCCGAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Species Sequencing vs. Individual Genome Sequencing
• Species Sequencing: Determine the “consensus genome” of an entire species.
Genome Reconstruction: A Puzzle With a Billion Pieces
Species Sequencing vs. Individual Genome Sequencing
• Individual Sequencing: Determine how an individual differs from its species.
Genome Reconstruction: A Puzzle With a Billion Pieces
• Species genome sequencing:• Compare various species (e.g. human and chimpanzee) to
understand how their genes function (e.g. which genes are important for braindevelopment).
• Reveal evolutionaryrelationships betweenspecies.
• Determine the geneticmakeup of ourevolutionary ancestors.
Why Would We Want to Sequence a Genome?
Genome Reconstruction: A Puzzle With a Billion Pieces
Why Would We Want to Sequence a Genome?
• Individual genome sequencing:• Unearth the genetic basis of many diseases.• Forensics applications.
• Example: In 2010, 6-year old Nicholas Volker became the first human being to be saved because of genome sequencing.• Doctors could not diagnose his condition, which caused
strange infections; he went through nearly 100 surgeries. • Genome sequencing revealed a rare
mutation in a gene linked to a defect inhis immune system.
• This led doctors to use advancedimmunotherapy, which saved the child.
Genome Reconstruction: A Puzzle With a Billion Pieces
Brief History of Genome Sequencing
• Late 1970s: Walter Gilbert and Frederick Sanger develop independent sequencing methods.
• 1980: They share the Nobel Prize in Chemistry.
• Still, their sequencing methods were too expensive for large genomes: with a $1 per nucleotide cost, it would cost $3 billion to sequence the human genome.
Walter Gilbert
Frederick Sanger
Genome Reconstruction: A Puzzle With a Billion Pieces
Brief History of Genome Sequencing
• 1990: The public Human Genome Project, headed by Francis Collins, aims to sequence the human genome.
• 1997: Craig Venter founds Celera Genomics, a private firm, with the same goal.
Francis Collins
Craig Venter
Genome Reconstruction: A Puzzle With a Billion Pieces
Brief History of Mammalian Genome Sequencing
• 2000: The draft of the human genome is simultaneously completed by the (public) Human Genome Consortium and (private) Celera Genomics.
Genome Reconstruction: A Puzzle With a Billion Pieces
Brief History of Mammalian Genome Sequencing
• 2000s: Many more mammalian genomes are sequenced.
Genome Reconstruction: A Puzzle With a Billion Pieces
The Arrival of Personal Genomics
• 2000s: Many companies launch projects aimed at reducing sequencing costs by orders of magnitude.
• 2010: The market for sequencing machines takes off.• Illumina reduces the cost of sequencing an individual human
genome from $3 billion to $10,000.• Complete Genomics builds a genomic factory in Silicon
Valley that sequences hundreds of genomes per month.• Beijing Genome Institute orders hundreds of sequencing
machines, becoming the world’s largest sequencing center.• 23andMe offers partial genome sequencing for $499.• Many universities introduce new courses in which students
study their own genomes.
Genome Reconstruction: A Puzzle With a Billion Pieces
The Future of Genome Sequencing
• 2010s?: Genome sequencing will hopefully continue to bloom. • The $1,000 human genome may arrive as early as in 2012.• Hopefully, sequencing an individual genome will soon
become as routine as an X-ray.
Genome Reconstruction: A Puzzle With a Billion Pieces
What Makes Genome Sequencing So Difficult?
• When we read a book, we can read the entire book one letter at a time from the beginning to the end.
• However, modern sequencing machines cannot read an entire genome one nucleotide at a time from beginning to end. They can only shred the genome and read the short pieces. • Thus, we can identify very short fragments of DNA (~100
nucleotides long), called reads.• But we have no idea which genomic positions these reads
come from!• We must figure out how to put the reads back together to
assemble a genome.
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 2: The Newspaper Problem and
Genome Sequencing
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
The Newspaper Problem as an “Overlap Puzzle”
• The newspaper problem is not the same as a jigsaw puzzle:• We have multiple copies of the same
edition of a newspaper.• Plus, some pieces of paper got blown to
bits in the explosion.
• Instead, we must use overlapping shreds of paper to reconstruct what the newspaper said.
• This gives us a giant overlap puzzle!
Genome Reconstruction: A Puzzle With a Billion Pieces
• In the newspaper problem, we have the rules of language and common sense (e.g. “murder” and “suspect” would often appear near each other in a newspaper.)
• However, the “language” of DNA remains largely unknown.
Sequencing is Harder than Newspaper Problem
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing is Harder than Newspaper Problem
• There are lots of repeated substrings in every genome (50% of human genome is formed by repeats). • Example: GCTT is repeated 4 times in the following:
AAGCTTCTATTGCTTAATTGGCTTGCTTCGCTTTG
• Analogy: The Triazzle puzzle contains lots of repeated figures. This makes it very difficult tosolve (even with just 16 pieces).
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Lab + Computation
• Read Generation (Experimental):Generate many reads from multiplecopies of the same genome.
• Fragment Assembly (Computational):Use these reads to algorithmicallyput the genome back together.
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Illustration
Multiple (Unsequenced) Genome Copies
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Illustration
Multiple (Unsequenced) Genome Copies
Read Generation
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Illustration
Multiple (Unsequenced) Genome Copies
Reads
Read Generation
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Illustration
Multiple (Unsequenced) Genome Copies
Reads
Read Generation
Fragment Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
Sequencing a Genome: Illustration
Multiple (Unsequenced) Genome Copies
Reads
Sequenced Genome
…GGCATGCGTCAGAAACTATCATAGCTAGATCGTACGTAGCC…
Read Generation
Fragment Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 3: DNA Chips: A First Shot at Sequencing
with Short Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: From an Idea to a New Industry
• 1989: Radoje Drmanac, Andrey Mirzabekov, and Edwin Southern independently invent DNA chips (arrays) for read generation.
• Key Idea: Generate all k-mers (see below) from the genome in the hope that they can be assembled to reconstruct the genome.
• 1989: Science magazine writes, “Using DNA arrays for sequencing would simply be substituting one horrendous task for another.”• 2000: Arrays are a multi-billion dollar industry Southern
Mirzabekov
Drmanac
k-mer: A string of length k (in an alphabet of 4 nucleotides)
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Implementation
1. Synthesize a distinct k-mer in each of 4k cells in the array.
2. Cover the array with multiple copies of a fluorescently-labeled unknown DNA fragment.
3. DNA will hybridize witha k-mer if it contains thecomplement of that k-mer.
4. Use a spectroscope todetermine which sites emitlight …the complementsof these sites will reveal thek-mers in the unknownDNA fragment = our reads!
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Illustration
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads? AAA
AGA
CAA
CGA
GAA
GGA
TAA
TGA
AAC
AGC
CAC
CGC
GAC
GGC
TAC
TGC
AAG
AGG
CAG
CGG
GAG
GGG
TAG
TGG
AAT
AGT
CAT
CGT
GAT
GGT
TATTGT
ACA
ATA
CCA
CTA
GCA
GTA
TCA
TTA
ACC
ATC
CCC
CTC
GCC
GTC
TCC
TTC
ACG
ATG
CCG
CTG
GCG
GTG
TCG
TTG
ACT
ATT
CCT
CTT
GCT
GTT
TCT
TTT
Genome Reconstruction: A Puzzle With a Billion Pieces
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
DNA Chips: Example
• What are our reads?
CAT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
CAT|||
ATG
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
CAT
ATG
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
CAT
ATG
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
CAT
ATG
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
CAT
ATG
CAC
CGC
TGC
CAT
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
DNA Chips: Example
• What are our reads?
• So 3-mer ATG mustoccur in the genome!
ATG
CAC
CGC
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
• What are our reads?
CAC GTGCGC GCG•CAT ATGTGC GCAACG CGTATT AATCCA TGGGCA TGCGCC GGCTTG CAA
CAC
CGC
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
• What are our reads?• CACCGC GCG• CAT ATG
CAC
CGC
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
• What are our reads?• CAC GTGCGC GCG• CAT ATG
GTG
CGC
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
CGC
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
• What are our reads?• CAC GTG• CGC• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
TGC
ATG
CCA
GCA
GCC
ACG
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
ACG
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
ACG
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
CGT
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
CGT
TTG
ATT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
CCA
GCA
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
GCA
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
GCA
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
TGC
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
TGC
GCC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
TGC
GGC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
TGC
GGC
CGT
TTG
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG
Genome Reconstruction: A Puzzle With a Billion Pieces
Red 3-mers Must Occur in the Genome
GTG
GCG
GCA
ATG
TGG
TGC
GGC
CGT
CAA
AAT
• What are our reads?• CAC GTG• CGC GCG• CAT ATG• TGC GCA• ACG CGT• ATT AAT• CCA TGG• GCA TGC• GCC GGC• TTG CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
From Biological Data to Computational Problem
GTG
GCG
GCA
ATG
TGG
TGC
GGC
CGT
CAA
AAT
• Aim: Construct ashortest possible genomecontaining all our reads.
• This is now acomputational problem!
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 4: Two Mathematical Detours
Genome Reconstruction: A Puzzle With a Billion Pieces
The Bridges of Königsberg
• The people of Königsberg, Prussia (present-day Kaliningrad, Russia) enjoyed taking walks.
Genome Reconstruction: A Puzzle With a Billion Pieces
The Bridges of Königsberg
• They wondered if they could walk through the city, cross each bridge (blue) exactly once, and return where they started.
Genome Reconstruction: A Puzzle With a Billion Pieces
The Bridges of Königsberg
• 1735: Leonhard Euler develops an approach to answer this question for any city, even for a “city” with a million islands.
• We will soon discuss Euler’s method as well as how it applies to genome sequencing. Leonhard Euler
Genome Reconstruction: A Puzzle With a Billion Pieces
The Icosian Game
• Over a century passes…
• 1857: Irish mathematician William Hamilton designs a game consisting of a board representing 20 “islands” connected by “bridges.”
• Goal: find a walk that visits
every island exactly once and returns back where it started.
William Hamilton
Icosian Game
Genome Reconstruction: A Puzzle With a Billion Pieces
Similar Problems with Very Different Fates
• These two stories have something in common: • Find a walk that uses every bridge once
(Konigsberg Bridges Problem) • Find a walk that visits every island once (Hamilton
game)
• However, while Euler solved the first problem (even for a city with a million bridges), mathematicians still do not know how to solve the second problem, even for a city with a thousand islands.
• But where are the genomes???
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 5: Introduction to Graph Theory
Genome Reconstruction: A Puzzle With a Billion Pieces
Graphs
• A graph is a network composed of two sets of objects:• Vertices: each vertex is represented by a point. • Edges: each edge is represented by a
segment connecting two vertices.
• Graph theory can be applied to allkinds of different problems.• Transportation networks• Disease epidemics• Computer viruses spreading through the internet.• And, yes…genome sequencing!
Genome Reconstruction: A Puzzle With a Billion Pieces
Königsberg Bridges Graph
• For the Königsberg Bridge Problem, we create a graph:• Vertices = 4 land masses of the city• Edges = 7 bridges connecting land areas
Note: We don’t need to worry about the exact placement of vertices or the shape of bridges.
Genome Reconstruction: A Puzzle With a Billion Pieces
Icosian Game Graph
• For the Icosian Game, we create a graph:• Vertices = islands• Edges = bridges connecting the islands
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
“Here I go!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
“…He wakes up in the morning…”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
“…goes to visit his mommy…”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
“…when all the little ants are marching…”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G. “…they all do it the same way…”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Now, consider an ant standing on a vertex of a graph G.
• The ant can walk from vertex to vertex along the edges of G.• If the ant returns where it started, the result of its walk
forms a cycle of G.
“Oh no! I’m back where I started!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Two questions:
1. Is there a cycle of G in which the ant walks through each edge exactly once?
2. Is there a cycle of G in which the ant walks through each vertex exactly once?
“???!!!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian and Hamiltonian Cycles
• Two questions:
1. Is there a cycle of G in which the ant walks through each edge exactly once? Eulerian cycle
2. Is there a cycle of G in which the ant walks through each vertex exactly once? Hamiltonian cycle
“I wish someone would name a cycle after me…I’m the one doing all the walking here!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
2
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
4
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
45
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
45
6
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
45
6
7
Genome Reconstruction: A Puzzle With a Billion Pieces
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Eulerian Cycles
1
23
45
6
78
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles
1
23
45
6
78
9
• An Eulerian cycle is a cycle that travels to each edge exactly once.• A graph containing such a cycle is called Eulerian.
• If there were a solution to the KönigsbergBridge Problem, then we could find anEulerian cycle in this graph.
• However, no such cycle exists.
• If we add two more edges, there will be such a cycle; see it?
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
• For example, the graph correspondingto the Icosian game is Hamiltonian.
• This means that the Icosian gamehas a solution!
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 2
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
4
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
12
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
16
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
1617
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
1617
18
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
1617
1819
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
1617
1819
20
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles
• A Hamiltonian cycle in a graph is a cycle that uses each vertex exactly once.
• A graph containing such a cycle is called Hamiltonian.
1 23
45
6
7
8
9
10
11
1213
14
15
1617
1819
20
Genome Reconstruction: A Puzzle With a Billion Pieces
Finding Eulerian Cycles vs Hamiltonian Cycles
• Given a graph G, we now have two questions that we can program a computer to answer about G.
• Eulerian Cycle Problem (ECP): Find an Eulerian cycle in G or prove that G is not Eulerian.
• Hamiltonian Cycle Problem (HCP): Find a Hamiltonian cycle in G or prove that G is not Hamiltonian.
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 6: Euler’s Theorem
Genome Reconstruction: A Puzzle With a Billion Pieces
Euler’s Theorem
• We will now discuss how Euler solved the Königsberg Bridge Problem.• You might guess: He used graph theory!• This is not entirely accurate. A better statement would be:
He invented graph theory!
Genome Reconstruction: A Puzzle With a Billion Pieces
Directed Graphs
• Directed Graph: A graph in which each edge has a direction (represented by an arrow).• You might like to think of directed edges as “one-way
bridges.”
Undirected Graph Directed Graph
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in Directed Graphs
• An Eulerian cycle in a directed graph is simply a cycle that travels down all the edges in the correct direction.
• A directed graph is Eulerian if it contains an Eulerian cycle.
• Is this graph Eulerian? Why?
Genome Reconstruction: A Puzzle With a Billion Pieces
• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.
• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.
• Label each vertex v with(indegree(v), outdegree(v))
• This graph isn’t balanced sincesome vertices don’t have equalindegree and outdegree.
Balanced Graphs
(1, 2)
(2, 1)
(1, 0)
(2, 1)
(1, 1)
(0, 2)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• indegree(v) = the number of edges leading into vertex v.• outdegree(v) = the number of edges leading out of v.
• A graph is balanced if indegree(v) = outdegree(v) for every vertex v.
• Label each vertex v with(indegree(v), outdegree(v))
• Adding some edges makesthe graph balanced.
Balanced Graphs
(2, 2)
(2, 2)
(1, 1)
(2, 2)
(1, 1)
(2, 2)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
Euler’s Theorem
• Euler’s Theorem: A connected directed graph G contains an Eulerian cycle precisely when G is balanced.• A graph is connected if for every pair of vertices {u, v}, an
ant can travel either from u to v or from v to u.(2, 2)
(2, 2)
(1, 1)
(2, 2)
(1, 1)
(2, 2)(1, 1)
Not Connected Connected+ Balanced= Eulerian
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 7: ECP vs. HCP and Algorithmic Complexity
Genome Reconstruction: A Puzzle With a Billion Pieces
Solving the ECP
• By Euler’s Theorem, to determine whether G contains an Eulerian cycle, we only need to check if G is balanced.
• So we simply go to each vertex and perform this simple check:• If every vertex is balanced, then G must contain an Eulerian
cycle.• If some vertex is not balanced, then G cannot contain an
Eulerian cycle.
Genome Reconstruction: A Puzzle With a Billion Pieces
Connected + Balanced = Eulerian
(1, 2)
(2, 1)
(1, 0)
(1, 1)
(0, 2)(1, 1)
• Recall our example directed graph from before.
• Here the graph is not balanced, and so it clearly isn’t Eulerian.
(2, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Recall our example directed graph from before.
• Here the graph is not balanced, and so it clearly isn’t Eulerian.
• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.
Connected + Balanced = Eulerian
(2, 2)
(2, 2)
(1, 1)
(1, 1)
(2, 2)(1, 1)
1
2
3
7
65
4
89
10
11
(2, 2)
Genome Reconstruction: A Puzzle With a Billion Pieces
Connected + Balanced = Eulerian
• Recall our example directed graph from before.
• Here the graph is not balanced, and so it clearly isn’t Eulerian.
• Adding the edges to make thegraph balanced will mean thatan Eulerian cyclemust exist.
• One vital question remains:Where did this Eulerian cyclecome from?
(2, 2)
(2, 2)
(1, 1)
(2, 2)(1, 1)
1
2
7
65
4
89
10
11(1, 1)
(2, 2)3
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been
previously traversed.• The ant must always walk along
edges in the legal direction.
(2, 2)
(2, 2)
(1, 1)
(2, 2)(1, 1)
(1, 1)
(2, 2)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been
previously traversed.• The ant must always walk along
edges in the legal direction.
• At each step, we updatethe remaining indegree andoutdegree of each vertex.
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(0, 1)
(2, 1)(1, 1)
(1, 1)
(2, 2)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been
previously traversed.• The ant must always walk along
edges in the legal direction.
• At each step, we updatethe remaining indegree andoutdegree of each vertex.
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 2)
(0, 0)
(2, 1)(1, 1)
(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been
previously traversed.• The ant must always walk along
edges in the legal direction.
• At each step, we updatethe remaining indegree andoutdegree of each vertex.
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(0, 0)
(2, 1)(1, 1)
(0, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Place an ant on an arbitrary vertex v of the graph and let it walk along any edges it likes.• The ant cannot walk along any edge that has been
previously traversed.• The ant must always walk along
edges in the legal direction.
• At each step, we updatethe remaining indegree andoutdegree of each vertex.
• Cycle! But not Eulerian yet…
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(0, 0)
(1, 1)(1, 1)
(0, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(0, 0)
(1, 1)(1, 1)
(0, 0)
• Let’s cut out the cycle that the ant has found.
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(1, 1)(1, 1)
• Let’s cut out the cycle that the ant has found.
(0, 0)
(0, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(1, 1)(1, 1)
• Let’s cut out the cycle that the ant has found.
• Next delete vertices that are no longer connected to anything.
(0, 0)
(0, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(1, 1)(1, 1)
• Let’s cut out the cycle that the ant has found.
• Next delete vertices that are no longer connected to anything.
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(2, 2)
(2, 2)
(1, 1)
(1, 1)(1, 1)
• Again, let the ant walk through the graph however it chooses.
Genome Reconstruction: A Puzzle With a Billion Pieces
• Again, let the ant walk through the graph however it chooses.
• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.
Making an Eulerian Cycle from a Balanced Graph
(1, 2)
(2, 2)
(1, 1)
(1, 0)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Again, let the ant walk through the graph however it chooses.
• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 2)
(1, 1)
(1, 0)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Again, let the ant walk through the graph however it chooses.
• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)
(0, 1)
(1, 0)(1, 1)
“I really don’t see how this is going to give us an Eulerian cycle in the original graph…I knew I shouldn’t have left the house this morning!”
Genome Reconstruction: A Puzzle With a Billion Pieces
• Again, let the ant walk through the graph however it chooses.
• We always start with a balanced graph, which means thatthe ant can never “get stuck”at a vertex along the way,because it will always have anedge leading out of anyvertex that it enters.
• Cycle! But still not Eulerian…
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)
(0, 0)
(0, 0)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)
(0, 0)
(0, 0)(1, 1)
• Let’s trim out this cycle one more time.
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)(1, 1)
“Hmph! Dragged halfway across the screen…I guess I don’t have any say in the matter…”
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
• Now there’s only one way that theant can walk through the graph.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(1, 1)(1, 1)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
• Now there’s only one way that theant can walk through the graph.
Making an Eulerian Cycle from a Balanced Graph
(1, 1)
(0, 1)(1, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
• Now there’s only one way that theant can walk through the graph.
Making an Eulerian Cycle from a Balanced Graph
(0, 1)
(0, 0)(1, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
• Now there’s only one way that theant can walk through the graph.
• Cycle! And Eulerian to boot…sowe have run out of edges.
Making an Eulerian Cycle from a Balanced Graph
(0, 0)
(0, 0)(0, 0)
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s trim out this cycle one more time.
• The ant is stranded, so let’s move it to a vertex.
• Now there’s only one way that theant can walk through the graph.
• Cycle! And Eulerian to boot…sowe have run out of edges.
• What do we do now?
Making an Eulerian Cycle from a Balanced Graph
(0, 0)
(0, 0)(0, 0)
“Yes! What DO we do now?”
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s bring back our original graph.
Making an Eulerian Cycle from a Balanced Graph
Genome Reconstruction: A Puzzle With a Billion Pieces
• Let’s bring back our original graph.
• Highlight the three cycles that the ant found.
Making an Eulerian Cycle from a Balanced Graph
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
Making an Eulerian Cycle from a Balanced Graph
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
Making an Eulerian Cycle from a Balanced Graph
1
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
• Cycle formed: we can continue along the blue cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
• Cycle formed: we can continue along the blue cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
• Cycle formed: we can continue along the blue cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
• Cycle formed: we can continue along the blue cycle.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
7
Genome Reconstruction: A Puzzle With a Billion Pieces
• Start at the ant’s original position, and follow the green cycle.
• Cycle formed: we can continue along the blue cycle.
• Cycle formed; however, wenow have no new edgesto follow!
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
7
8
“???”
Genome Reconstruction: A Puzzle With a Billion Pieces
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
7
8
“Backtracking? But I’m not evolved to walk backwards!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
7
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.7
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.7
8
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.
• Rejoin the blue cycle…
7
89
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.
• Rejoin the blue cycle…
7
89
10
“I smell something good!”
Genome Reconstruction: A Puzzle With a Billion Pieces
Making an Eulerian Cycle from a Balanced Graph
1
2
3
4
5 6
• To remedy this, let’s start backtracking along the blue cycle until we reach a vertex with an orange edge leaving it.
• Success! Now let’s follow the orange cycle.
• Rejoin the blue cycle…
• And we have the sameEulerian cycle from before!
7
89
10
11
“Yay! Now can I go home please?”
Genome Reconstruction: A Puzzle With a Billion Pieces
What’s the Big Deal?
• The great thing about this method is that it can be easily generalized to any balanced graph to give an Eulerian cycle.
• “Yeah, but this Eulerian cycle wasn’t that hardto find anyway! So why shouldwe care about the method?”
• Think about trying toeyeball an Eulerian cyclein a graph containingbillions of edges. Not so easy…
1
2
3
4
5 6
7
89
10
11
Genome Reconstruction: A Puzzle With a Billion Pieces
What’s the Big Deal?
• More profoundly, this method to find an Eulerian cycle in a balanced graph can be implemented extremely efficiently on a computer.
• Example: A modern computer canfind an Eulerian cycle in abalanced graph containingbillions of edges in undera minute!
1
2
3
4
5 6
7
89
10
11
Genome Reconstruction: A Puzzle With a Billion Pieces
What’s the Big Deal?
• “Yeah, but computers are supermachines! They don’t really need 300-year old mathematics to help them solve problems. Aren’t they going to take over the world anyway?”
• So let’s examine the case of finding a Hamiltonian cycle…
Genome Reconstruction: A Puzzle With a Billion Pieces
Searching for an Efficient Algorithm for HCP
• Key Point: No one has ever founda similar efficient test to determinewhether a graph is Hamiltonian.
• Of course, we could examine everypossible (ant) walk through thegraph to solve the HCP.
• However, this brute force approachis just not efficient: there are morewalks through a graph on just 1,000vertices than there are atoms in the universe!
Genome Reconstruction: A Puzzle With a Billion Pieces
NP-Complete Problems
• In fact, the HCP has been classified as NP-Complete.
• In laymen’s terms, this means that the HCP belongs to a collection containing thousands of computational problems that cannot be solved quickly for large input data sets.
• NP-Complete problems are all equivalent to each other: find an efficient solution to one, and you have an efficient solution to them all.
Genome Reconstruction: A Puzzle With a Billion Pieces
NP-Complete Problems
“I can't find an efficient algorithm, I guess I'm just too dumb.”
From Garey and Johnson. Computers and Intractability. 1979
• Attempting to solve any NP-Complete problem is difficult.
Genome Reconstruction: A Puzzle With a Billion Pieces
NP-Complete Problems
“I can't find an efficient algorithm, because no such algorithm is possible.”
• Attempting to solve any NP-Complete problem is difficult.
• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.
From Garey and Johnson. Computers and Intractability. 1979
Genome Reconstruction: A Puzzle With a Billion Pieces
NP-Complete Problems
“I can't find an efficient algorithm, but neither can all these famous people.”
• Attempting to solve any NP-Complete problem is difficult.
• The hope is that you could verify that you failed because an efficient algorithm to an NP-Complete problem doesn’t exist.
• The present state of affairs is somewhere in between.
From Garey and Johnson. Computers and Intractability. 1979
Genome Reconstruction: A Puzzle With a Billion Pieces
The NP-Completeness of the HCP
• The question of whether or not NP-Complete problems (including the HCP) can be solved efficiently is one of seven Millennium Problems in mathematics.
• Find an efficient algorithm for the HCP, or demonstrate that no such algorithm exists, and you will get $1 million.
• However, if you become amathematician, odds are that you arenot in it for the $$$...recently, GrigoryPerelman solved one of theseproblems but turned down the prize.
Grigory Perelman, True Legend
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 8: From Euler and Hamilton to Fragment
Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
Simplifying Assumptions for Fragment Assembly
1. Every k-mer occurring in the genome is generated by some read.
2. Reads are error-free.
3. Every k-mer occurring in the genome occurs exactly once.
4. The underlying genome consists of a single circular-shaped chromosome.
• Note: In the final section, we will relax these assumptions.
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
GTGGCG GCA
ATG
TGG TGC
GGC
CGT CAA
AAT
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
GTGGCG GCA
ATG
TGG TGC
GGC
CGT CAA
AAT
GTG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
GCG GCA
ATG
TGG TGC
GGC
CGT CAA
AAT
GTG GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
GCA
ATG
TGG TGC
GGC
CGT CAA
AAT
GTG GCGGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
ATG
TGG TGC
GGC
CGT CAA
AAT
GTG GCGGCAATG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
TGG TGC
GGC
CGT CAA
AAT
GTG GCGGCAATG TGG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
TGC
GGC
CGT CAA
AAT
GTG GCGGCAATG TGG TGC
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
GGC
CGT CAA
AAT
GTG GCGGCAATG TGG TGCGGC
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
CGT CAA
AAT
GTG GCGGCAATG TGG TGCGGCCGT
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
CAA
AAT
GTG GCGGCAATG TGG TGCGGCCGT CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
AAT
GTG GCGGCAATG TGG TGCGGCCGT CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every readdetected by our array.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• Create a vertex for every k-merdetected by our array.• Prefix: First k – 1 nucleotides of a k-mer (CAA)• Suffix: Last k – 1 nucleotides of a k-mer (CAA)
• Different 3-mers may share a prefix/suffix: ATG, TGA, CTG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
First Try: The Graph H
• As for the edges of H, connect vertex v to vertex w with a directed edge if suffix of v matches the prefix of w.
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG CGT GGC AAT GTG TGG TGC CAA GCA GCG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG
ATGGenome:
T
G
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG
ATGGGenome:
T
G
G
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC
ATGGCGenome:
T
G
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG
ATGGCGGenome:
T
G
G
CG
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT
ATGGCGTGenome:
T
G
G
CG
T
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG
ATGGCGTG Genome:
T
G
G
CG
T
G
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC
ATGGCGTGC Genome:
T
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA
ATGGCGTGCAGenome:
T
G
G
CG
T
G
C
AA
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA
ATGGCGTGCAAGenome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT
ATGGCGTGCAATGenome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG
Genome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG
Genome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATG
Genome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Hamiltonian Cycles in H
• Here we have a Hamiltonian cycle in H:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT ATG
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA
Genome:
AT
G
G
CG
T
G
C
A
Genome Reconstruction: A Puzzle With a Billion Pieces
Problem with H
• Ultimately, we must solve the HCP on H in order to find a candidate DNA sequence…
• This idea motivated the method usedfor assembling the human genomefrom 50 million (long and expensive)reads in 2000, but the computational strain was overwhelming: sequencing the human genome took several computers a period of months, working around the clock.
• For that matter, newer sequencing technologies produce billions of (short and inexpensive) reads: we need a new idea.
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TGTGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GCTGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CATGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAATTGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAATTGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAATTGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
GT
TG GC
CG
CAAT
GG
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
GTG
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
GCGGTG
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
GCGGTG
GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
GCGGTG
GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGGGCGGTG
GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGGGCGGTG
TGC GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGG GGCGCGGTG
TGC GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Second Try: The Graph E
• Form a different graph E as follows:• Create a vertex for each distinct prefix/suffix from reads.• Connect vertex v to vertex
w with a directed edge ifthere is a read whoseprefix is v and whosesuffix is w.
CAGC
CG
TG
GT
GG
AT
AA
TGCGGCCGTCAAAAT
GTGGCGGCAATGTGG
Reads
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
9
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT• This is the same sequence
of 3-mers that we had in H!ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATG
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT• This is the same sequence
of 3-mers that we had in H!• Thus we will obtain the same
sequenced genome as before.
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCAATGGenome:
A TG
GCGT
G
CA
Genome Reconstruction: A Puzzle With a Billion Pieces
Eulerian Cycles in E
• We have an Eulerian cycle in E:• ATG TGG GGC GCG CGT GTG TGC
GCA CAA AAT• This is the same sequence
of 3-mers that we had in H!• Thus we will obtain the same
sequenced genome as before.
ATG TGG GGC GCG CGT GTG TGC GCA CAA AAT ATGATGGCGTGCA
Genome:
A TG
GCGT
G
CA
Genome Reconstruction: A Puzzle With a Billion Pieces
Analysis of E
• Good News: We now only have to find an Eulerian cycle in the graph E, which could be done on this computer.
• Bad News:
1. There may be more than one Eulerian cycle in E.• We won’t discuss this issue here, but it can be resolved.
2. How do we know that E even has an Eulerian cycle?• By Euler’s Theorem, we only need to show that E is a
balanced graph.• To do this, we need one more piece of mathematical
history…
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 9: De Bruijn and Fragment Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• 1946: The Dutch mathematician Nicolaas de Bruijn asks: can we design a circular superstring of minimal length that contains every binary string of length k?
• Example for k = 3. The circular superstring ‘00011101’ contains all eight binary strings of length 3. We illustrate the locations of ‘000’ and ’110’ on the string.
Nicolaas de Bruijn
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• De Bruijn introduced a special class of graph B(n, k):• Vertices = all nk – 1 possible (k – 1)-mers in n-letter alphabet.• An edge connects v to w
if there is a k-merwhose prefix = v andwhose suffix = w.
• At right is B(2, 4),assuming that ouralphabet contains 0and 1.
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• For any choice of n and k, B(n, k) must be balanced/Eulerian.
• Why? Because both the indegreeand the outdegree of everyvertex is equal to the sizeof the alphabet (n), sinceevery (k – 1)-mer willoccur as the prefix orsuffix of n different k-mers.
• Red numbers show the orderof edges in an Eulerian cycle.
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
De Bruijn’s Question
• The graph E we have constructed is contained in the graph B(4, k).• We have n = 4 since there are four possible nucleotides.
• E must be balanced/Eulerian too!• The indegree and outdegree
of any (k – 1)-mer vertexboth equal howmany timesthis (k - 1)-merappears in thegenome.
3
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGC GCA
CAAAAT
1
2
4
5
6
7 8
910
ATGGCGTGCAGenome:
Genome Reconstruction: A Puzzle With a Billion Pieces
Section 10: Generalizing Fragment Assembly
Genome Reconstruction: A Puzzle With a Billion Pieces
Simplifying Assumptions for Fragment Assembly
• Recall the assumptions we have already made:
1. Every k-mer occurring in the genome is generated by some read.
2. Reads are error-free.
3. Every k-mer occurring in the genome occurs exactly once.
4. The underlying genome consists of a single circular-shaped chromosome.
• Our aim is to relax each of these assumptions and determine how the problem changes.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 1: Generating (nearly) all k-mers
• 100-nucleotide reads generated by Illumina sequencing technology capture only a small fraction of 100-mers from the genome (even for high-coverage sequencing projects), thus violating this key assumption of the de Bruijn graphs.
• However, if we break these reads into shorter k-mers, the resulting k-mers often represent nearly all k-mers from the genome for sufficiently small k.
• For example, modern assemblers often break every 100-nucleotide read into 46 overlapping 55-mers and further assemble the resulting 55-mers using de Bruijn graphs.
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
• Example: Say our genome is ATGCAAGCTAGCT, and we generate four reads of length 6:
• We didn’t generate all possible 6-mers from the genome as reads (e.g. TGCAAG), but we will have all possible 3-mers by splitting the reads into pieces.
Assumption 1: Generating (nearly) all k-mers
ATGCAAGCTAGCT
ATGCAA CAAGCT CTAGCTATGC CT
Reads
Genome
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 2: Handling Errors in Reads
• What happens to the graph E when some reads have errors?
• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 2: Handling Errors in Reads
• What happens to the graph E when some reads have errors?
• Example: Say our graph E for genome ATGGCGTGCAATG should look like this.• If read TGGCGTG is mistakenly sequenced as TGGAGTG ,
then the graph will look like this instead.• This is called a bulge in the graph E.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 2: Handling Errors in Reads
• Most reads have errors, resulting in millions of bulges in E.
• 2004: Pevzner et al. provide algorithm for bulge removal.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• The genome ACGTACGT has only four 3-mers: ACG, CGT, GTA, and TAC.
• We would obtain the graph E below and reconstruct thisgenome as: ACGT
• In other words, we can’t representrepeated k-mers in the genome!
AC CG
GTTA
TAC
ACG
CGT
GTA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Define the multiplicity of a k-mer as the number of times it occurs in a genome.
• We will add edges to E in order to form a new graph E* for which the number of edges connecting two vertices represents the multiplicity of the k-mer on that edge.
• An Eulerian cycle in E* still gives a candidate genome.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 3: Handling Repeated k-mers
• Say that we have the following read multiplicities:• Multiplicity 1: ATG, GGC, AAT, TGG, CAA, CAA, GCA
• Multiplicity 2: GCG, CGT,GTG, TGC
• We reflect multiplicities as
multiple edges • Candidate genome:
• E* is balanced becauseindegree(v) and outdegree(v)still equal the # of times v appears.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
ATGCGTGGCGTGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Determining k-mer multiplicities
• How can we find the multiplicity of a k-mer in the genome?
• The multiplicity of a k-mer willbe directly related to thefrequency with which thatk-mer occurs in our reads.
• So a k-mer thatappears 5 times inthe genome isexpected to occur 5 timesas often in our reads.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• The genomes for all complex organisms are split across a number of linear chromosomes (46 in humans).
• So in order to sequence thehuman genome, geneticistssimply sequenced all of theselinear chromosomes.
• Question: How do we sequencea linear segment of DNA?
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Say our linear DNA segment is ATGCGTGGCGTGCA.
• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Say our linear DNA segment is ATGCGTGGCGTGCA.
• Then the 3-mers from this segment are the same as for the circular segment before, but the segment doesn’t “wrap around,” so we will lose two 3-mers:• CAA• AAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
CAAAAT
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.
• Get rid of the vertex AA as well.
CAGC
CG
TG
GT
GG
AT
AA
ATG
TGG GGCGCG
CGT
GTG
TGCGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.
• Get rid of the vertex AA as well.
CAGC
CG
TG
GT
GG
ATATG
TGG GGCGCG
CGT
GTG
TGCGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Let’s depict the loss of CAA and AAT from our 3-mers by deleting these edges from E*.
• Get rid of the vertex AA as well.
• So to sequence our segmentATGCGTGGCGTGCA,we need to find apath through E* thatstarts with AT, ends at CA,and uses every edge in between.
CAGC
CG
TG
GT
GG
ATATG
TGG GGCGCG
CGT
GTG
TGCGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• An Eulerian path in a directed graph G is a path through the graph that uses every edge exactly once.• So an Eulerian path is just like an Eulerian cycle, except
that we don’t have to start and end at the same vertex.
• Luckily, Euler’s Theorem generalizes to efficiently determine whether a graph has an Eulerian path and then find this path.
• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.
Genome Reconstruction: A Puzzle With a Billion Pieces
Assumption 4: From Circular to Linear Genomes
• Euler’s Theorem II: A connected directed graph has an Eulerian path precisely when either all vertices are balanced or exactly two vertices are not balanced.
• So E* must contain anEulerian path, because ATand CA (the endpoints ofour segment) are theonly two verticesthat aren’t balanced.
• Hence in every case we have solved our giant puzzle!
CAGC
CG
TG
GT
GG
ATATG
TGG GGCGCG
CGT
GTG
TGCGCA
Genome Reconstruction: A Puzzle With a Billion Pieces
What’s Next?
Genome Reconstruction: A Puzzle With a Billion Pieces
Personal Genomics: Millions of Human Genomes
• Personal genome sequencing startedfrom sequencing the genomes of afew scientists in 2009 and will soonexpand to millions of individuals.
• Thousands of cancer genomes havealready been sequenced, and genomesequencing will soon become aroutine technique in medicine.
• At the heart of this revolution are bioinformaticians, who must harness precise methods in order to analyze the growing data.
10 scientists and entrepreneurs who made their genomes publicly available in 2009
Genome Reconstruction: A Puzzle With a Billion Pieces
Genome 10K and Beyond
• 2010: Scientists launch anambitious project to sequence10,000 species genomes.
• 201x?: We will hopefullybe able to reconstruct the“tree of life” and uncover thegenomes of ancestors thatlived millions of years ago.
• 20xx?: Maybe, just maybe, we will be able to discover why giraffes grew necks and humans grew brains.