lecture 4

42
Burrows-Wheeler Transform Burrows-Wheeler Transform

Upload: ravikumarsid2990

Post on 28-Sep-2015

215 views

Category:

Documents


3 download

DESCRIPTION

Burrows Wheeler Transform

TRANSCRIPT

  • Burrows-Wheeler TransformBurrows-Wheeler Transform

  • DNA sequencing

    How we obtain the sequence of nucleotides of a species

    ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT

  • DNA Sequencing

    Goal: Find the complete sequence of A, C, G, Ts in DNA

    Challenge:

    There is no machine that takes long DNA as an input, and gives the complete sequence as output

    Can only sequence ~150 letters at a time

  • Method to sequence longer regions

  • Two main assembly problems

    De Novo Assembly

    Resequencing

  • Read Mapping

    Want ultra fast, highly similar alignment Detection of genomic variation

    ......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......

  • Human Genome Variation

    SNP TGCTGAGA TGCCGAGA Novel Sequence TGCTCGGAGA TGC - - - GAGA

    Inversion Mobile Element or Pseudogene Insertion

    Translocation Tandem Duplication

    Microdeletion TGC - - AGA TGCCGAGA Transposition

    Large Deletion Novel Sequence at Breakpoint TGC

  • Read Mapping

    Modern fast read aligners: BWT, Bowtie, SOAP " Based on Burrows-Wheeler transform

    ......AGGTGCATGCCGCATCGATCGAGCGCGATGCTAGCTAGCTGATCGT...... GTGCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATC GCATGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT TGCCGCATCGACCGAGCGCGATGCTAGCTAGGTGATCGT... CATCGACCGAGCGCGATGCTAGCTAGGTGATCGT......

  • Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)Burrows-Wheeler Transform (BWT)

    The BWT of a string S is a

    reversible permutation of S that enables the search for a

    pattern P in S to take linear

    time with respect to the

    length of P (O(|P|) time, independent of the length of S)

  • Burrows-Wheeler Transform

    BANANA

    ANA BANANA ANANA NANA ANA NA A

    BANANA ANANA NANA ANA NA A

    suffixes of BANANA

    X =

  • Burrows-Wheeler Transform

    BANANA$

    ANA BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $

    BANANA$ ANANA$ NANA$ ANA$ NA$ A$ $

    X =

  • Burrows-Wheeler Transform

    BANANA$

    ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    X =

  • Burrows-Wheeler Transform

    BANANA$

    ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    X =

  • Burrows-Wheeler Transform

    BANANA$

    ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN $BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    X =

  • Burrows-Wheeler Transform

    BANANA$

    ANA BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN

    BANANA$ ANANA$B NANA$BA ANA$BAN NA$BANA A$BANAN

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    BWT(BANANA) = ANNB$AA

    BWT matrix of string BANANA

    X =

  • Suffix Arrays

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    1 $BANANA 2 A$BANAN 3 ANA$BAN 4 ANANA$B 5 BANANA$ 6 NA$BANA 7 NANA$BA

    Suffixes are sorted in the BWT matrix S(i) = j, where Xj Xn is the i-th suffix lexicographically

    S

    B A N A N A $ X 1 2 3 4 5 6 7

    7 6 4 2 1 5 3

    A N N B $ A A BWT(X)

    BWT(X) constructed from S: At each position, take the letter to the left of the one pointed by S

  • Reconstructing BANANA

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    A N N B $ A A

    $ A A A B N N

    A$ NA NA BA $B AN AN

    $B A$ AN AN BA NA NA

    A$B NA$ NAN BAN $BA ANA ANA

    sort append BWT

    sort append BWT

    $BA A$B ANA ANA BAN NA$ NAN

    sort

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column

    $BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B

    A N N B $ A A

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    Lemma. The i-th occurrence of character c in last column is the same text character as the i-th occurrence of c in the first column

    $BANAN A$BANA ANA$BA ANANA$ BANANA NA$BAN NANA$B

    A N N B $ A A

    A$BANAN ANA$BAN ANANA$B

    Same words, same sorted order

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    Lemma. The i-th occurrence of character a in last column is the same text character as the i-th occurrence of a in the first column LF(): Map the i-th occurrence of character a in last column to the first column LF(r): Let row r contain the i-th occurrence of a in last

    column Then, LF(r) = r; r: i-th row starting with a

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    LF(r): Let row r be the i-th occurrence of a in last column Then, LF(r) = r; r: i-th row starting with a

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    LF[] = [2, 6, 7, 5, 1, 3, 4]

    Row LF(r) is obtained by rotating row r one position to the right

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    LF[] = [2, 6, 7, 5, 1, 3, 4]

    Computing LF() is easy: Let C(a): # of characters smaller than a

    Example: C($) = 0; C(A) = 1; C(B) = 4; C(N) = 5

    Let row r end with the i-th occurrence of a in last column Then, LF(r) = C(a) + i (why?)

  • Reconstructing BANANA - faster

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    A N N B $ A A

    C() 1 5 5 4 0 1 1 C() copied for convenience

    index i 1 1 2 1 1 2 3 indicating this is i-th occurrence of c

    LF() 2 6 7 5 1 3 4 LF() = C() + i

    Reconstruct BANANA: S := ; r := 1; c := BWT[r]; UNTIL c = $ {

    S := cS; r := LF(r); c := BWT(r); }

    Credit: Ben Langmead thesis

  • Searching for ANA

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    L(W): lowest index in BWT matrix where W is prefix U(W): highest index in BWT matrix where W is prefix Example: L(NA) = 6 U(NA) = 7 Lemma (prove as exercise) L(aW) = C(a) + i +1,

    where i = # as up to L(W) 1 in BWT(X) U(aW) = C(a) + j,

    where j = # as up to U(W) in BWT(X) Example: L(ANA) = C(A) + # As up to (L(NA) 1) + 1

    = 1 + (# As up to 5) + 1 = 1 + 1 + 1 = 3

    U(ANA) = 1 + # As up to U(NA) = 1 + 3 = 4

  • Searching for ANA

    BWT matrix of string BANANA

    $BANANA A$BANAN ANA$BAN ANANA$B BANANA$ NA$BANA NANA$BA

    Let LFC(r, a) = C(a) + i, where i = #as up to r in BWT ExactMatch(W[1k]) { a := W[k]; low := C(a) +1; high := C(a+1); // a+1: lexicographically next char i := k 1; while (low = 1) {

    a = W[i]; low = LFC(low 1, a) + 1; high = LFC(high, a); i := i 1; }

    return (low, high); }

    Credit: Ben Langmead thesis

  • Summary of BWT algorithm

    Suffix array of string X: S(i) = j, where Xj Xn is the j-th suffix lexicographically BWT follows immediately from suffix array

    " Suffix array construction possible in O(n), many good O(n log n) algorithms

    Reconstruct X from BWT(X) in time O(n)

    Search for all exact occurrences of W in time O(|W|)

    BWT(X) is easier to compress than X

  • Inexact matching: allow mismatches and gaps during alignment

    => keep track of a set of SA intervals

    Heuristics: seeds, bounds on number of allowed differences, scoring (gap open/extend, mismatches)

    Memory considerations: sampling the suffix array and the occurrence array, compression

    Typical aligner phases

    - stage 1: BWT index construction

    - stage 2: short-Read Mapping

    - stage 3: alignment results reporting/evaluation

    BWT-based Aligners In Practice

  • Short Read

    Alignment

    with

    Populations

    Of Genomes(ISMB 2013)

  • Next-Generation SequencingNext-Generation SequencingNext-Generation SequencingNext-Generation Sequencing

    Sequencing Protocol (e.g. Illumina)

  • 10001000 Genomes Project Genomes Project10001000 Genomes Project Genomes Project

  • Many Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing EffortsMany Other Sequencing Efforts

    1000 Plant Genomes Project

    Genome 10K (vertebrates)

    100K Genome Project

    (infectious microorganisms)

    i5K Project (arthropods)

    1001 Genomes Project

    (A. thaliana)

    . . .

  • TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC

    Step 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read AlignmentStep 1: Short Read Alignment

    New

    donor

    genome

    Short

    Reads

    (~125bp) AAATTCCGAAATTCCGTTCGATCGTTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCTTTAAGCTAGACTTTAGTGCTTTAGTG

    Known

    reference

    genome

  • TCGTTCGTTTGAAGGGAAGGAACCCTTTAAGCCCCTTTAAGC

    AAATTCCAAATTCCGTTCGATCGTGTTCGATCGTCCGAAGGGAAGGGGCCCTTTAAGCTAGACCCCTTTAAGCTAGACTTTAGTGTTTAGTG

    Single Reference AlignmentSingle Reference AlignmentSingle Reference AlignmentSingle Reference Alignment

    Reference-Biased

    Results

    Large

    Database of

    Genomic Data

  • Nave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference AlignmentNave Multiple Reference Alignment

    Highly

    Redundant

    Data

    Collection

    Size

    Time/

    Space

    Instead: create a compressed reference representation

    (reference multi-genome) that captures all the variations in the genome collection and can be used for short-read alignment

  • Single Reference AlignersSingle Reference AlignersSingle Reference AlignersSingle Reference Aligners

    Li H. and Durbin R. Fast and accurate short read alignment with Burrows-

    Wheeler Transform. Bioinformatics, 2009 (BWA)

    Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

    Li et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009

    Use the Burrows-Wheeler Transform (BWT) to efficiently

    map reads to a single reference genome

    Idea: adapt the BWT to operate on the multi-genome

    representation

  • Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome

    AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA

    AAAA TTTT GGGG TTTT AAAA CCCC AAAA

    AAAA TTTT GGGG TTTT CCCC CCCC AAAA

    AAAA GGGG TTTT CCCC CCCC AAAA

    TTTT TTTT CCCC GGGG AAAA CCCC CCCC

    AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC

    TTTT TTTT GGGG AAAA CCCC CCCC

    TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT

    GGGG

    CCCC

  • Reference Multi-GenomeReference Multi-GenomeReference Multi-GenomeReference Multi-Genome

    collapse

    SNPs INDEL bubbles

    AAAA TTTTTTTT GGGG TTTT CCCC CCCC AAAA

    AAAA TTTT GGGG TTTT AAAA CCCC AAAA

    AAAA TTTT GGGG TTTT CCCC CCCC AAAA

    AAAA GGGG TTTT CCCC CCCC AAAA

    TTTT TTTT CCCC GGGG AAAA CCCC CCCC

    AAAA TTTT AAAA TTTT TTTT TTTT CCCC GGGG AAAA CCCC CCCC

    TTTT TTTT GGGG AAAA CCCC CCCC

    TTTT TTTT CCCC GGGG AAAA CCCC CCCCAAAA TTTT

    GGGG

    CCCC

    AAAA GGGG TTTTCCCC

    CCCC

    CCCCTTTT

    CCCC

    GGGG AAAA CCCC CCCCTTTT

    AAAA GGGGAAAA TTTT

    AAAA TTTT

    TTTT

    AAAA TTTT AAAA TTTT

  • Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome

    AAAA GGGG TTTTCCCC

    CCCC

    CCCCTTTT

    CCCC

    GGGG AAAA CCCC CCCCTTTT

    AAAA GGGGAAAA TTTT

    AAAA TTTT

    TTTT

    AAAA TTTT AAAA TTTT

  • Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome

    AAAA GGGG TTTTCCCC

    CCCC

    CCCCTTTT

    CCCC

    GGGG AAAA CCCC CCCCTTTT

    AAAA GGGGAAAA TTTT

    AAAA TTTT

    TTTT

    AAAA TTTT AAAA TTTT

    1. Encode SNPs 1. Encode SNPs

    with IUPAC codeswith IUPAC codes

    AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT

    AAAA TTTT

    TTTT

    AAAA TTTT AAAA TTTT

    IUPACIUPAC -- AA BB CC DD GG HH KK MM NN RR SS TT VV WW YY

    basebase ## AA C|G|TC|G|T CC A|G|TA|G|T GG A|C|TA|C|T G|TG|T A|CA|C A|C|G|TA|C|G|T A|GA|G C|GC|G TT A|C|GA|C|G A|TA|T C|TC|T

  • Linear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-GenomeLinear Reference Multi-Genome

    AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT

    AAAA TTTT

    TTTT

    AAAA TTTT AAAA TTTT

    2. Append INDEL branches 2. Append INDEL branches

    padded with surrounding contextpadded with surrounding context

    AAAA GGGG TTTT CCCCYYYY TTTT GGGG AAAA CCCC CCCCMMMM SSSSAAAA TTTT

    AAAA TTTT

    AAAA TTTT AAAA TTTT

    CCCC AAAAMMMM SSSS#### TTTTTTTTTTTTCCCC AAAAMMMM SSSS#### TTTTTTTT

    ####CCCC AAAAMMMM SSSS#### TTTTTTTT

  • Adapted Backwards SearchAdapted Backwards SearchAdapted Backwards SearchAdapted Backwards Search

    A read can now match multiple substrings in the reference:

    e.g. read R = AT can match AT, RW, AY, etc

    Given the SA interval set of W, SA(W) = {[L

    1(W), U

    1(W)], [L

    2(W), U

    2(W)], ...},

    we can compute SA(W) as:

    UNION of SA(sW) for s S (set of IUPAC characters matching )