lecture 4

Burrows-Wheeler TransformBurrows-Wheeler Transform

DNA sequencing

How we obtain the sequence of nucleotides of a species

ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT

DNA Sequencing

Goal: Find the complete sequence of A, C, G, Ts in DNA

Challenge:

There is no machine that takes long DNA as an input, and gives the complete sequence as output

Can only sequence ~150 letters at a time

Method to sequence longer regions

Two main assembly problems

De Novo Assembly

Resequencing

Read Mapping

Want ultra fast, highly similar alignment Detection of genomic variation

......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......

Human Genome Variation

SNP TGCTGAGA TGCCGAGA Novel Sequence TGCTCGGAGA TGC - - - GAGA

Inversion Mobile Element or Pseudogene Insertion

Translocation Tandem Duplication

Microdeletion TGC - - AGA TGCCGAGA Transposition

Large Deletion Novel Sequence at Breakpoint TGC

Read Mapping

Modern fast read aligners: BWT, Bowtie, SOAP " Based on Burrows-Wheeler transform

......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......

Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)

The BWT of a string S is a

reversible permutation of S that enables the search for a

pattern P in S to take linear

time with respect to the

length of P (O(|P|) time, independent of the length of S)

Burrows-Wheeler Transform

BANANA

ANA BANANA ANANA NANA ANA NA A

BANANA ANANA NANA ANA NA A

suffixes of BANANA

X =


BANANA$

ANA BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $

BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $

X =


BANANA$

ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA


X =


BANANA$

ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA


$BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

X =


BANANA$

ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN

BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN


BWT(BANANA) = ANNB$AA

BWT matrix of string BANANA

X =

Suffix Arrays


1 $BANANA 2 A$BANAN 3 ANA$BAN 4 ANANA$B 5 BANANA$ 6 NA$BANA 7 NANA$BA

Suffixes are sorted in the BWT matrix S(i) = j, where Xj Xn is the i-th suffix lexicographically

S

B A N A N A $ X 1 2 3 4 5 6 7

7 6 4 2 1 5 3

A N N B $ A A BWT(X)

BWT(X) constructed from S: At each position, take the letter to the left of the one pointed by S

Reconstructing BANANA



A N N B $ A A

$ A A A B N N

A$ NA NA BA $B AN AN

$B A$ AN AN BA NA NA

A$B NA$ NAN BAN $BA ANA ANA

sort append BWT

sort append BWT

$BA A$B ANA ANA BAN NA$ NAN

sort

Reconstructing BANANA - faster



Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column






$BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B

A N N B $ A A





$BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B

A N N B $ A A

A$BANAN ANA$BAN ANANA$B

Same words, same sorted order




Lemma. The i-th occurrence of character a in last column is the same text character as the i-th occurrence of a in the first column LF(): Map the i-th occurrence of character a in last column to the first column LF(r): Let row r contain the i-th occurrence of a in last

column Then, LF(r) = r; r: i-th row starting with a




LF(r): Let row r be the i-th occurrence of a in last column Then, LF(r) = r; r: i-th row starting with a



LF[] = [2, 6, 7, 5, 1, 3, 4]

Row LF(r) is obtained by rotating row r one position to the right






LF[] = [2, 6, 7, 5, 1, 3, 4]

Computing LF() is easy: Let C(a): # of characters smaller than a

Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5

Let row r end with the i-th occurrence of a in last column Then, LF(r) = C(a) + i (why?)




A N N B $ A A

C() 1 5 5 4 0 1 1 C() copied for convenience

index i 1 1 2 1 1 2 3 indicating this is i-th occurrence of c

LF() 2 6 7 5 1 3 4 LF() = C() + i

Reconstruct BANANA: S := ; r := 1; c := BWT[r]; UNTIL c = $ {

S := cS; r := LF(r); c := BWT(r); }

Credit: Ben Langmead thesis

Searching for ANA



L(W): lowest index in BWT matrix where W is prefix U(W): highest index in BWT matrix where W is prefix Example: L(NA) = 6 U(NA) = 7 Lemma (prove as exercise) L(aW) = C(a) + i +1,

where i = # as up to L(W) 1 in BWT(X) U(aW) = C(a) + j,

where j = # as up to U(W) in BWT(X) Example: L(ANA) = C(A) + # As up to (L(NA) 1) + 1

= 1 + (# As up to 5) + 1 = 1 + 1 + 1 = 3

U(ANA) = 1 + # As up to U(NA) = 1 + 3 = 4

Searching for ANA



Let LFC(r, a) = C(a) + i, where i = #as up to r in BWT ExactMatch(W[1k]) { a := W[k]; low := C(a) +1; high := C(a+1); // a+1: lexicographically next char i := k 1; while (low = 1) {

a = W[i]; low = LFC(low 1, a) + 1; high = LFC(high, a); i := i 1; }

return (low, high); }

Credit: Ben Langmead thesis

Summary of BWT algorithm

Suffix array of string X: S(i) = j, where Xj Xn is the j-th suffix lexicographically BWT follows immediately from suffix array

" Suffix array construction possible in O(n), many good O(n log n) algorithms

Reconstruct X from BWT(X) in time O(n)

Search for all exact occurrences of W in time O(|W|)

BWT(X) is easier to compress than X

Inexact matching: allow mismatches and gaps during alignment

=> keep track of a set of SA intervals

Heuristics: seeds, bounds on number of allowed differences, scoring (gap open/extend, mismatches)

Memory considerations: sampling the suffix array and the occurrence array, compression

Typical aligner phases

- stage 1: BWT index construction

- stage 2: short-Read Mapping

- stage 3: alignment results reporting/evaluation

BWT-based Aligners In Practice

Short Read

Alignment

with

Populations

Of Genomes(ISMB 2013)

Next-Generation SequencingNext-Generation SequencingNext-Generation SequencingNext-Generation Sequencing

Sequencing Protocol (e.g. Illumina)

10001000 Genomes Project Genomes Project10001000 Genomes Project Genomes Project

Many Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing Efforts

1000 Plant Genomes Project

Genome 10K (vertebrates)

100K Genome Project

(infectious microorganisms)

i5K Project (arthropods)

1001 Genomes Project

(A. thaliana)

. . .

TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC

Step 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read Alignment

New

donor

genome

Short

Reads

(~125bp) AAATTCCGAAATTCCGTTCGATCGTTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCTTTAAGCTAGACTTTAGTGCTTTAGTG

Known

reference

genome

TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC

AAATTCCAAATTCCGTTCGATCGTGTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCCTTTAAGCTAGACTTTAGTGTTTAGTG

Single Reference AlignmentSingle Reference AlignmentSingle Reference AlignmentSingle Reference Alignment

Reference-Biased

Results

Large

Database of

Genomic Data

Nave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference Alignment

Highly

Redundant

Data

Collection

Size

Time/

Space

Instead: create a compressed reference representation

(reference multi-genome) that captures all the variations in the genome collection and can be used for short-read alignment

Single Reference AlignersSingle Reference AlignersSingle Reference AlignersSingle Reference Aligners

Li H. and Durbin R. Fast and accurate short read alignment with Burrows-

Wheeler Transform. Bioinformatics, 2009 (BWA)

Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

Li et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009

Use the Burrows-Wheeler Transform (BWT) to efficiently

map reads to a single reference genome

Idea: adapt the BWT to operate on the multi-genome

representation

Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome

AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA

AAAA TTTT GGGG TTTT AAAA CCCC AAAA

AAAA TTTT GGGG TTTT CCCC CCCC AAAA

AAAA GGGG TTTT CCCC CCCC AAAA

TTTT TTTT CCCC GGGG AAAA CCCC CCCC

AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC

TTTT TTTT GGGG AAAA CCCC CCCC

TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT

GGGG

CCCC

Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome

collapse

SNPs INDEL bubbles

AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA

AAAA TTTT GGGG TTTT AAAA CCCC AAAA

AAAA TTTT GGGG TTTT CCCC CCCC AAAA

AAAA GGGG TTTT CCCC CCCC AAAA

TTTT TTTT CCCC GGGG AAAA CCCC CCCC

AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC

TTTT TTTT GGGG AAAA CCCC CCCC

TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT

GGGG

CCCC

AAAA GGGG TTTTCCCC

CCCC

CCCCTTTT

CCCC

GGGG AAAA CCCC CCCCTTTT

AAAA GGGGAAAA TTTT

AAAA TTTT

TTTT

AAAA TTTT AAAA TTTT

Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome

AAAA GGGG TTTTCCCC

CCCC

CCCCTTTT

CCCC


AAAA GGGGAAAA TTTT

AAAA TTTT

TTTT

AAAA TTTT AAAA TTTT


AAAA GGGG TTTTCCCC

CCCC

CCCCTTTT

CCCC


AAAA GGGGAAAA TTTT

AAAA TTTT

TTTT

AAAA TTTT AAAA TTTT

1. Encode SNPs 1. Encode SNPs

with IUPAC codeswith IUPAC codes

AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT

AAAA TTTT

TTTT

AAAA TTTT AAAA TTTT

IUPACIUPAC -- AA BB CC DD GG HH KK MM NN RR SS TT VV WW YY

basebase ## AA C|G|TC|G|T CC A|G|TA|G|T GG A|C|TA|C|T G|TG|T A|CA|C A|C|G|TA|C|G|T A|GA|G C|GC|G TT A|C|GA|C|G A|TA|T C|TC|T



AAAA TTTT

TTTT

AAAA TTTT AAAA TTTT

2. Append INDEL branches 2. Append INDEL branches

padded with surrounding contextpadded with surrounding context


AAAA TTTT

AAAA TTTT AAAA TTTT

CCCC AAAAMMMM SSSS#### TTTTTTTTTTTTCCCC AAAAMMMM SSSS#### TTTTTTTT

####CCCC AAAAMMMM SSSS#### TTTTTTTT

Adapted Backwards SearchAdapted Backwards SearchAdapted Backwards SearchAdapted Backwards Search

A read can now match multiple substrings in the reference:

e.g. read R = AT can match AT, RW, AY, etc

Given the SA interval set of W, SA(W) = {[L

1(W), U

1(W)], [L

2(W), U

2(W)], ...},

we can compute SA(W) as:

UNION of SA(sW) for s S (set of IUPAC characters matching )

lecture 4

Documents