sufﬁx arrays, bwt and fm-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/sa... · fm-index...

Suffix arrays, BWT and FM-index

Alan Medlar Wednesday 16th March 2016

Outline

• Lecture: Technical background for read mapping tools used in this course

• Suffix array • Burrows-Wheeler transform (BWT) • FM-index

• Lab session: Using BWA to map paired-end data against the human genome, SAM/BAM files, etc

Read mapping

• Sequencers can generate up to 100 million reads per sample

• Human genome is ~3 billion basepairs

• Need to map reads to the genome to discover variants (SNVs, indels), counts (gene expression)

Preliminaries

• String

• sequence of characters,

• e.g. "banana", "ATGC", "MDLISTFS"

• Alphabet { A, C, G, T, $ }, { A-Z, a-z, $ }

• Lexicographical order

• $ < A < C < G < T

Preliminaries• Prefix

• non-empty substring that is the beginning of another string (left-to-right)


• Suffix

• non-empty substring that is the ending of another string (right-to-left)


Naïve exact search

• Text = "banana"

• Query = "nana"

• Linear search

Naïve exact search

B A N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Naïve exact search

B A N A N A

N A N A

N A N A• Text = "banana"

• Query = "nana"

• Linear search

Naïve exact search

B A N A N A

N A N A

N A N A

N A N A

• Text = "banana"

• Query = "nana"

• Linear search

Naïve search is too slow

• Human genome ~3 billion basepairs

• Read 100 basepairs

• Complexity of search scales linearly with the length of the text!

Suffix array

• Introduced by Manber and Myers (1990) as a space efficient alternative to suffix tree (independently by Gonnet (1987))

• Sorted array of all suffixes of a given text

• Allows fast search of very large texts (e.g. genomes)

SA: building

B A N A N A $$ is lexicographically lower than all other characters in the alphabet and cannot appear in the text otherwise

SA: building

B A N A N A $A N A N A $

SA: building

B A N A N A $A N A N A $N A N A $A N A $N A $A $$

SA: building

B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6

SA: building

B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2

SA: querying

• Search for prefixes in the suffix array that match our query string

• SA is sorted, so we can use binary search!


N A N A

$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A

SA vs. naïve search

• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)

• Naïve search O(mn)

• Suffix array search O(m log(n))

SA vs. naïve search

• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)

• Naïve search O(mn)

• Suffix array search O(m log(n))

n O(n) O(log(n))

8 8 3

16 16 4

32 32 5

64 64 6

128 128 7

256 256 8

512 512 9

Good enough for read mapping?

• Human genome is ~3 billion basepairs

• Assume 5 bytes per basepair (1 byte characters, 4 byte integers) = ~14 GB

• NGS data really hit in 2009 (16 GB RAM at the time was a luxury!)

Burrows-Wheeler transform

• Invented by Burrows and Wheeler (1994) while working at DEC

• Used in compression (.bz2 files)

• Interested in three things:

• how to perform BWT • why BWT is useful for compression • how to reverse BWT

BWT

B A N A N A $

BWT

B A N A N A $A N A N A $

BWT

B A N A N A $A N A N A $ B

BWT

B A N A N A $A N A N A $ BN A N A $

BWT

B A N A N A $A N A N A $ BN A N A $ B A

BWT

B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A

BWT

B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A

$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

BWT


BWT compression

• T = "banana$"

• BWT(T) = "annb$aa"

BWT compression

• T = "peter_piper_picked_a_peck_of_pickled_peppers_a_peck_of_pickled_peppers_peter_piper_picked_if_peter_piper_picked_a_peck_of_pickled_peppers_wheres_the_peck_of_pickled_peppers_peter_piper_picked"

• BWT(T) = "ddsddkkkkaeaaddddsfsrrrrffffrrrrss___eeeeiiiiiiiieeeeeeeehppppkkkkllllpppppppptttthpppprppppiooootwpppppppp_ppppcccccccccccckkkk____________iiiipppp_______________eeeeeeeeeeeeeeeeerrrereeee__"

Relation to suffix array


• BWT matrix truncated at "$" in each row is the suffix array of the same text

• BWT can be computed directly from the suffix array

Reverse BWT

• It not very useful to compress something if we cannot get the original text back!

• BWT'(BWT(T)) = T

LF mapping (T-rank)

B A N A N A $

LF mapping (T-rank)

B A N A N A $

B0 A0 N0 A1 N1 A2 $T-rank

LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

B A N A N A $


LF mapping (T-rank)

F L$ B0 A0 N0 A1 N1 A2

A2 $ B0 A0 N0 A1 N1

A1 N1 A2 $ B0 A0 N0

A0 N0 A1 N1 A2 $ B0

B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1

N0 A1 N1 A2 $ B0 A0

Ns in the L column are sorted by their

"right context", same as Ns in F column!

B A N A N A $


LF mapping (B-rank)

B0 A2 N1 A1 N0 A0 $

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

B A N A N A $


B-rank

LF mapping (B-rank)

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

B0 A2 N1 A1 N0 A0 $

B A N A N A $


B-rank

LF mapping (B-rank)

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

• F column contains very little information, just counts of each character

LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

Which row contains N1 in the F column?

{ $:1, A:3, B:1, N:2 }

LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6

{ $:1, A:3, B:1, N:2 }


LF mapping (B-rank)

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6

{ $:1, A:3, B:1, N:2 }


Reverse BWT• Use B-ranking to reverse BWT, recreating the text T

from right-to-left

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

$

{ $:1, A:3, B:1, N:2 }

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A0 $

{ $:1, A:3, B:1, N:2 }

• Use B-ranking to reverse BWT, recreating the text T from right-to-left

Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

N0 A0 $

{ $:1, A:3, B:1, N:2 }


Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }


Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }


Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

A2 N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }


Reverse BWT

LA0

N0

N1

B0

$A1

A2

F0 $1 A0

2 A1

3 A2

4 B0

5 N0

6 N1

B0 A2 N1 A1 N0 A0 $

B0 A2 N1 A1 N0 A0 $

{ $:1, A:3, B:1, N:2 }


FM-index• All BWT allows us to do is compress text

• Ferragina and Manzini (2000) "Full-text index in Minute space"

• Combine BWT with other auxiliary data structures to get an index

• Space savings: e.g. Human genome (3 billion bp)

• SA = ~14 GB (5 bytes/bp) • FM = ~1.5 GB (2 bits/bp)

Cannot search BWT like SA


• Rotation matrix contains the suffix array

• But we only store F and L columns, so binary search of prefixes not possible

BWT search

• In SA, we matched successively longer prefixes (left-to-right) of query string (binary search)

• In BWT, we will match successively longer suffixes (right-to-left) of query string (reverse BWT transform)

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2

N A N A

We know BWT contains the query, but unlike SA, we do not know the location

of the match in T!

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A


Idea: just store SA as well?

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A

$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $

Idea 2: store part of SA?

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A


... and walk backwards through the BWT!

+1

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A



+1+1

BWT search

F L$ B0 A2 N1 A1 N0 A0

A0 $ B0 A2 N1 A1 N0

A1 N0 A0 $ B0 A2 N1

A2 N1 A1 N0 A0 $ B0

B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1

N1 A1 N0 A0 $ B0 A2N A N A



+1+1

6531042

BWT search

• Finding location takes constant time if the offsets into T are evenly spaced in T, not in the SA!

• Make tradeoff between space (RAM) and time (how long lookups take)

Things we left out

• Rank calculations in the BWT need to be fast! Needs another auxiliary data structure

• Only covered exact matching, read alignment requires mismatches (e.g. SNP in read, not in genome)

• Other details: store forwards and backwards indices of genome due to sequencing error profile

Lab exercisesBWA, SAM/BAM format, samtools

Reference Data• Reference genome

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.fa.gz

• md5sum chromosome22.fa.gz (168c78298e731128ee622cf422e70f1el)

• gunzip chromosome22.fa.gz

• du -h chromosome22.fa (49 MB, genome is 2.9 GB)

• less chromosome22.fa (where's the DNA?)

• grep -nv "^N" chromosome22.fa | head (line 175169!)

bwa index

• Indexing options:

• bwa index

• Index human chromosome 22 (~1.5 mins, genome takes ~1.5 hours):

• bwa index chromosome22.fa

bwa index outputbash-3.2$ bwa index chromosome22.fa

[bwa_index] Pack FASTA... 0.52 sec

[bwa_index] Construct BWT for the packed sequence...

[BWTIncCreate] textLength=101636936, availableWord=19151484

[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.




[bwt_gen] Finished constructing BWT in 40 iterations.

[bwa_index] 74.38 seconds elapse.

[bwa_index] Update BWT... 0.38 sec

[bwa_index] Pack forward-only FASTA... 0.37 sec

[bwa_index] Construct SA from BWT and Occ... 13.67 sec

[main] Version: 0.7.12-r1039

[main] CMD: bwa index chromosome22.fa

[main] Real time: 93.689 sec; CPU: 89.320 sec

Read Data• Paired-end reads

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_1.fastq.gz

• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_2.fastq.gz

• md5sum chromosome22.reads_1.fastq.gz chromosome22.reads_2.fastq.gz

de1cd26056c61571de5cdf246ede60d3 chromosome22.reads_1.fastq.gz

2be64fb5848c2997af0ab8fab416d539 chromsome22.reads_2.fastq.gz

• gunzip chromosome22.reads_1.fastq.gz (and the other file)

• less chromosome22.reads_1.fastq

bwa mapping options

• Several alignment options:

• bwa mem (70bp+ Illumina, 454, IonTorrent, Sanger)

• bwa bwasw (Smith-Waterman, frequent gaps)

• bwa aln/samse/sampe (short reads, original)

bwa mem

• Mapping paired-end data • bwa mem [options] <idxbase> <in1.fq> <in2.fq>

• bwa mem -t 4 chromosome22.fa chromosome22.reads_1.fastq chromosome22.reads_2.fastq > chromosome22.sam

• -t specifies the number of CPUs to use

Sequence Alignment/Map format (SAM)

• SAM format is a TAB-delimited text file, we can inspect with a pager: • less -S chromosome22.sam

• Each row represents an alignment, at least 11 fields

• Specification: https://samtools.github.io/hts-specs/SAMv1.pdf

https://samtools.github.io/hts-specs/SAMv1.pdf

SAM fields• Column 1: read name

• Column 3: reference sequence name (in our case "22")

• Column 4: reference sequence position (reads were extracted from 2Mbase region)

SAM flags

• SAM flags in column 2 describe mapping result • https://broadinstitute.github.io/picard/explain-flags.html

https://broadinstitute.github.io/picard/explain-flags.html

SAM post-processing

• Convert SAM file to BAM format: • samtools view -Sb -o chromosome22.unsorted.bam

chromosome22.sam

• Sort BAM file: • samtools sort -o chromosome22.bam

chromosome22.unsorted.bam

• Index BAM file: • samtools index chromosome22.bam

samtools tview• View alignment in console (in pileup format https://

en.wikipedia.org/wiki/Pileup_format ): • samtools tview chromosome22.bam chromosome22.fa

• Scroll with arrow keys (but remember beginning of chr22 is all Ns)

• Type "g" (without quotes) and type "22:10732771" to get to a region where reads are mapped

• Get to help screen by typing "?"

• Exit with "q"

https://en.wikipedia.org/wiki/Pileup_format

sufﬁx arrays, bwt and fm-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/sa... · fm-index...

Documents