lecture 4
DESCRIPTION
Burrows Wheeler TransformTRANSCRIPT
-
Burrows-Wheeler TransformBurrows-Wheeler Transform
-
DNA sequencing
How we obtain the sequence of nucleotides of a species
ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT
-
DNA Sequencing
Goal: Find the complete sequence of A, C, G, Ts in DNA
Challenge:
There is no machine that takes long DNA as an input, and gives the complete sequence as output
Can only sequence ~150 letters at a time
-
Method to sequence longer regions
-
Two main assembly problems
De Novo Assembly
Resequencing
-
Read Mapping
Want ultra fast, highly similar alignment Detection of genomic variation
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
-
Human Genome Variation
SNP TGCTGAGA TGCCGAGA Novel Sequence TGCTCGGAGA TGC - - - GAGA
Inversion Mobile Element or Pseudogene Insertion
Translocation Tandem Duplication
Microdeletion TGC - - AGA TGCCGAGA Transposition
Large Deletion Novel Sequence at Breakpoint TGC
-
Read Mapping
Modern fast read aligners: BWT, Bowtie, SOAP " Based on Burrows-Wheeler transform
......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......
-
Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)
The BWT of a string S is a
reversible permutation of S that enables the search for a
pattern P in S to take linear
time with respect to the
length of P (O(|P|) time, independent of the length of S)
-
Burrows-Wheeler Transform
BANANA
ANA BANANA ANANA NANA ANA NA A
BANANA ANANA NANA ANA NA A
suffixes of BANANA
X =
-
Burrows-Wheeler Transform
BANANA$
ANA BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $
BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $
X =
-
Burrows-Wheeler Transform
BANANA$
ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
X =
-
Burrows-Wheeler Transform
BANANA$
ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
X =
-
Burrows-Wheeler Transform
BANANA$
ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
X =
-
Burrows-Wheeler Transform
BANANA$
ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN
BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
BWT(BANANA) = ANNB$AA
BWT matrix of string BANANA
X =
-
Suffix Arrays
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
1 $BANANA 2 A$BANAN 3 ANA$BAN 4 ANANA$B 5 BANANA$ 6 NA$BANA 7 NANA$BA
Suffixes are sorted in the BWT matrix S(i) = j, where Xj Xn is the i-th suffix lexicographically
S
B A N A N A $ X 1 2 3 4 5 6 7
7 6 4 2 1 5 3
A N N B $ A A BWT(X)
BWT(X) constructed from S: At each position, take the letter to the left of the one pointed by S
-
Reconstructing BANANA
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
A N N B $ A A
$ A A A B N N
A$ NA NA BA $B AN AN
$B A$ AN AN BA NA NA
A$B NA$ NAN BAN $BA ANA ANA
sort append BWT
sort append BWT
$BA A$B ANA ANA BAN NA$ NAN
sort
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column
$BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B
A N N B $ A A
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column
$BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B
A N N B $ A A
A$BANAN ANA$BAN ANANA$B
Same words, same sorted order
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
Lemma. The i-th occurrence of character a in last column is the same text character as the i-th occurrence of a in the first column LF(): Map the i-th occurrence of character a in last column to the first column LF(r): Let row r contain the i-th occurrence of a in last
column Then, LF(r) = r; r: i-th row starting with a
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
LF(r): Let row r be the i-th occurrence of a in last column Then, LF(r) = r; r: i-th row starting with a
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
LF[] = [2, 6, 7, 5, 1, 3, 4]
Row LF(r) is obtained by rotating row r one position to the right
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
LF[] = [2, 6, 7, 5, 1, 3, 4]
Computing LF() is easy: Let C(a): # of characters smaller than a
Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5
Let row r end with the i-th occurrence of a in last column Then, LF(r) = C(a) + i (why?)
-
Reconstructing BANANA - faster
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
A N N B $ A A
C() 1 5 5 4 0 1 1 C() copied for convenience
index i 1 1 2 1 1 2 3 indicating this is i-th occurrence of c
LF() 2 6 7 5 1 3 4 LF() = C() + i
Reconstruct BANANA: S := ; r := 1; c := BWT[r]; UNTIL c = $ {
S := cS; r := LF(r); c := BWT(r); }
Credit: Ben Langmead thesis
-
Searching for ANA
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
L(W): lowest index in BWT matrix where W is prefix U(W): highest index in BWT matrix where W is prefix Example: L(NA) = 6 U(NA) = 7 Lemma (prove as exercise) L(aW) = C(a) + i +1,
where i = # as up to L(W) 1 in BWT(X) U(aW) = C(a) + j,
where j = # as up to U(W) in BWT(X) Example: L(ANA) = C(A) + # As up to (L(NA) 1) + 1
= 1 + (# As up to 5) + 1 = 1 + 1 + 1 = 3
U(ANA) = 1 + # As up to U(NA) = 1 + 3 = 4
-
Searching for ANA
BWT matrix of string BANANA
$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA
Let LFC(r, a) = C(a) + i, where i = #as up to r in BWT ExactMatch(W[1k]) { a := W[k]; low := C(a) +1; high := C(a+1); // a+1: lexicographically next char i := k 1; while (low = 1) {
a = W[i]; low = LFC(low 1, a) + 1; high = LFC(high, a); i := i 1; }
return (low, high); }
Credit: Ben Langmead thesis
-
Summary of BWT algorithm
Suffix array of string X: S(i) = j, where Xj Xn is the j-th suffix lexicographically BWT follows immediately from suffix array
" Suffix array construction possible in O(n), many good O(n log n) algorithms
Reconstruct X from BWT(X) in time O(n)
Search for all exact occurrences of W in time O(|W|)
BWT(X) is easier to compress than X
-
Inexact matching: allow mismatches and gaps during alignment
=> keep track of a set of SA intervals
Heuristics: seeds, bounds on number of allowed differences, scoring (gap open/extend, mismatches)
Memory considerations: sampling the suffix array and the occurrence array, compression
Typical aligner phases
- stage 1: BWT index construction
- stage 2: short-Read Mapping
- stage 3: alignment results reporting/evaluation
BWT-based Aligners In Practice
-
Short Read
Alignment
with
Populations
Of Genomes(ISMB 2013)
-
Next-Generation SequencingNext-Generation SequencingNext-Generation SequencingNext-Generation Sequencing
Sequencing Protocol (e.g. Illumina)
-
10001000 Genomes Project Genomes Project10001000 Genomes Project Genomes Project
-
Many Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing Efforts
1000 Plant Genomes Project
Genome 10K (vertebrates)
100K Genome Project
(infectious microorganisms)
i5K Project (arthropods)
1001 Genomes Project
(A. thaliana)
. . .
-
TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC
Step 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read Alignment
New
donor
genome
Short
Reads
(~125bp) AAATTCCGAAATTCCGTTCGATCGTTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCTTTAAGCTAGACTTTAGTGCTTTAGTG
Known
reference
genome
-
TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC
AAATTCCAAATTCCGTTCGATCGTGTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCCTTTAAGCTAGACTTTAGTGTTTAGTG
Single Reference AlignmentSingle Reference AlignmentSingle Reference AlignmentSingle Reference Alignment
Reference-Biased
Results
Large
Database of
Genomic Data
-
Nave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference Alignment
Highly
Redundant
Data
Collection
Size
Time/
Space
Instead: create a compressed reference representation
(reference multi-genome) that captures all the variations in the genome collection and can be used for short-read alignment
-
Single Reference AlignersSingle Reference AlignersSingle Reference AlignersSingle Reference Aligners
Li H. and Durbin R. Fast and accurate short read alignment with Burrows-
Wheeler Transform. Bioinformatics, 2009 (BWA)
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.
Li et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009
Use the Burrows-Wheeler Transform (BWT) to efficiently
map reads to a single reference genome
Idea: adapt the BWT to operate on the multi-genome
representation
-
Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome
AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA
AAAA TTTT GGGG TTTT AAAA CCCC AAAA
AAAA TTTT GGGG TTTT CCCC CCCC AAAA
AAAA GGGG TTTT CCCC CCCC AAAA
TTTT TTTT CCCC GGGG AAAA CCCC CCCC
AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC
TTTT TTTT GGGG AAAA CCCC CCCC
TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT
GGGG
CCCC
-
Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome
collapse
SNPs INDEL bubbles
AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA
AAAA TTTT GGGG TTTT AAAA CCCC AAAA
AAAA TTTT GGGG TTTT CCCC CCCC AAAA
AAAA GGGG TTTT CCCC CCCC AAAA
TTTT TTTT CCCC GGGG AAAA CCCC CCCC
AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC
TTTT TTTT GGGG AAAA CCCC CCCC
TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT
GGGG
CCCC
AAAA GGGG TTTTCCCC
CCCC
CCCCTTTT
CCCC
GGGG AAAA CCCC CCCCTTTT
AAAA GGGGAAAA TTTT
AAAA TTTT
TTTT
AAAA TTTT AAAA TTTT
-
Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome
AAAA GGGG TTTTCCCC
CCCC
CCCCTTTT
CCCC
GGGG AAAA CCCC CCCCTTTT
AAAA GGGGAAAA TTTT
AAAA TTTT
TTTT
AAAA TTTT AAAA TTTT
-
Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome
AAAA GGGG TTTTCCCC
CCCC
CCCCTTTT
CCCC
GGGG AAAA CCCC CCCCTTTT
AAAA GGGGAAAA TTTT
AAAA TTTT
TTTT
AAAA TTTT AAAA TTTT
1. Encode SNPs 1. Encode SNPs
with IUPAC codeswith IUPAC codes
AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT
AAAA TTTT
TTTT
AAAA TTTT AAAA TTTT
IUPACIUPAC -- AA BB CC DD GG HH KK MM NN RR SS TT VV WW YY
basebase ## AA C|G|TC|G|T CC A|G|TA|G|T GG A|C|TA|C|T G|TG|T A|CA|C A|C|G|TA|C|G|T A|GA|G C|GC|G TT A|C|GA|C|G A|TA|T C|TC|T
-
Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome
AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT
AAAA TTTT
TTTT
AAAA TTTT AAAA TTTT
2. Append INDEL branches 2. Append INDEL branches
padded with surrounding contextpadded with surrounding context
AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT
AAAA TTTT
AAAA TTTT AAAA TTTT
CCCC AAAAMMMM SSSS#### TTTTTTTTTTTTCCCC AAAAMMMM SSSS#### TTTTTTTT
####CCCC AAAAMMMM SSSS#### TTTTTTTT
-
Adapted Backwards SearchAdapted Backwards SearchAdapted Backwards SearchAdapted Backwards Search
A read can now match multiple substrings in the reference:
e.g. read R = AT can match AT, RW, AY, etc
Given the SA interval set of W, SA(W) = {[L
1(W), U
1(W)], [L
2(W), U
2(W)], ...},
we can compute SA(W) as:
UNION of SA(sW) for s S (set of IUPAC characters matching )