suffix arrays, bwt and fm-indexloytynojalab.biocenter.helsinki.fi/data/evogeno/sa... · fm-index...
TRANSCRIPT
Suffix arrays, BWT and FM-index
Alan Medlar Wednesday 16th March 2016
Outline
• Lecture: Technical background for read mapping tools used in this course
• Suffix array • Burrows-Wheeler transform (BWT) • FM-index
• Lab session: Using BWA to map paired-end data against the human genome, SAM/BAM files, etc
Read mapping
• Sequencers can generate up to 100 million reads per sample
• Human genome is ~3 billion basepairs
• Need to map reads to the genome to discover variants (SNVs, indels), counts (gene expression)
Preliminaries
• String
• sequence of characters,
• e.g. "banana", "ATGC", "MDLISTFS"
• Alphabet { A, C, G, T, $ }, { A-Z, a-z, $ }
• Lexicographical order
• $ < A < C < G < T
Preliminaries• Prefix
• non-empty substring that is the beginning of another string (left-to-right)
• e.g. "banana", "ATGC", "MDLISTFS"
• Suffix
• non-empty substring that is the ending of another string (right-to-left)
• e.g. "banana", "ATGC", "MDLISTFS"
Naïve exact search
• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
Naïve exact search
B A N A N A
N A N A
N A N A
N A N A
• Text = "banana"
• Query = "nana"
• Linear search
Naïve search is too slow
• Human genome ~3 billion basepairs
• Read 100 basepairs
• Complexity of search scales linearly with the length of the text!
Suffix array
• Introduced by Manber and Myers (1990) as a space efficient alternative to suffix tree (independently by Gonnet (1987))
• Sorted array of all suffixes of a given text
• Allows fast search of very large texts (e.g. genomes)
SA: building
B A N A N A $$ is lexicographically lower than all other characters in the alphabet and cannot appear in the text otherwise
SA: building
B A N A N A $A N A N A $
SA: building
B A N A N A $A N A N A $N A N A $A N A $N A $A $$
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
SA: building
B A N A N A $ 0A N A N A $ 1N A N A $ 2A N A $ 3N A $ 4A $ 5$ 6
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
SA: querying
• Search for prefixes in the suffix array that match our query string
• SA is sorted, so we can use binary search!
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2N A N A
SA vs. naïve search
• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)
• Naïve search O(mn)
• Suffix array search O(m log(n))
SA vs. naïve search
• Searching the human genome (~3 billion basepairs, n) for a single-end read (100 basepairs, m)
• Naïve search O(mn)
• Suffix array search O(m log(n))
n O(n) O(log(n))
8 8 3
16 16 4
32 32 5
64 64 6
128 128 7
256 256 8
512 512 9
Good enough for read mapping?
• Human genome is ~3 billion basepairs
• Assume 5 bytes per basepair (1 byte characters, 4 byte integers) = ~14 GB
• NGS data really hit in 2009 (16 GB RAM at the time was a luxury!)
Burrows-Wheeler transform
• Invented by Burrows and Wheeler (1994) while working at DEC
• Used in compression (.bz2 files)
• Interested in three things:
• how to perform BWT • why BWT is useful for compression • how to reverse BWT
BWT
B A N A N A $
BWT
B A N A N A $A N A N A $
BWT
B A N A N A $A N A N A $ B
BWT
B A N A N A $A N A N A $ BN A N A $
BWT
B A N A N A $A N A N A $ BN A N A $ B A
BWT
B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A
BWT
B A N A N A $A N A N A $ BN A N A $ B AA N A $ B A NN A $ B A N AA $ B A N A N$ B A N A N A
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
BWT
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
BWT compression
• T = "banana$"
• BWT(T) = "annb$aa"
BWT compression
• T = "peter_piper_picked_a_peck_of_pickled_peppers_a_peck_of_pickled_peppers_peter_piper_picked_if_peter_piper_picked_a_peck_of_pickled_peppers_wheres_the_peck_of_pickled_peppers_peter_piper_picked"
• BWT(T) = "ddsddkkkkaeaaddddsfsrrrrffffrrrrss___eeeeiiiiiiiieeeeeeeehppppkkkkllllpppppppptttthpppprppppiooootwpppppppp_ppppcccccccccccckkkk____________iiiipppp_______________eeeeeeeeeeeeeeeeerrrereeee__"
Relation to suffix array
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• BWT matrix truncated at "$" in each row is the suffix array of the same text
• BWT can be computed directly from the suffix array
Reverse BWT
• It not very useful to compress something if we cannot get the original text back!
• BWT'(BWT(T)) = T
LF mapping (T-rank)
B A N A N A $
LF mapping (T-rank)
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (T-rank)
F L$ B0 A0 N0 A1 N1 A2
A2 $ B0 A0 N0 A1 N1
A1 N1 A2 $ B0 A0 N0
A0 N0 A1 N1 A2 $ B0
B0 A0 N0 A1 N1 A2 $N1 A2 $ B0 A0 N0 A1
N0 A1 N1 A2 $ B0 A0
Ns in the L column are sorted by their
"right context", same as Ns in F column!
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
LF mapping (B-rank)
B0 A2 N1 A1 N0 A0 $
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
B-rank
LF mapping (B-rank)
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
B0 A2 N1 A1 N0 A0 $
B A N A N A $
B0 A0 N0 A1 N1 A2 $T-rank
B-rank
LF mapping (B-rank)
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
• F column contains very little information, just counts of each character
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
Which row contains N1 in the F column?
{ $:1, A:3, B:1, N:2 }
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6
{ $:1, A:3, B:1, N:2 }
Which row contains N1 in the F column?
LF mapping (B-rank)
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
• Skip $ (+1) • Skip As (+3) • Skip Bs (+1) • Skip first N (+1) = 6
{ $:1, A:3, B:1, N:2 }
Which row contains N1 in the F column?
Reverse BWT• Use B-ranking to reverse BWT, recreating the text T
from right-to-left
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
$
{ $:1, A:3, B:1, N:2 }
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
B0 A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
Reverse BWT
LA0
N0
N1
B0
$A1
A2
F0 $1 A0
2 A1
3 A2
4 B0
5 N0
6 N1
B0 A2 N1 A1 N0 A0 $
B0 A2 N1 A1 N0 A0 $
{ $:1, A:3, B:1, N:2 }
• Use B-ranking to reverse BWT, recreating the text T from right-to-left
FM-index• All BWT allows us to do is compress text
• Ferragina and Manzini (2000) "Full-text index in Minute space"
• Combine BWT with other auxiliary data structures to get an index
• Space savings: e.g. Human genome (3 billion bp)
• SA = ~14 GB (5 bytes/bp) • FM = ~1.5 GB (2 bits/bp)
Cannot search BWT like SA
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• Rotation matrix contains the suffix array
• But we only store F and L columns, so binary search of prefixes not possible
Cannot search BWT like SA
$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A
• Rotation matrix contains the suffix array
• But we only store F and L columns, so binary search of prefixes not possible
BWT search
• In SA, we matched successively longer prefixes (left-to-right) of query string (binary search)
• In BWT, we will match successively longer suffixes (right-to-left) of query string (reverse BWT transform)
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2
N A N A
We know BWT contains the query, but unlike SA, we do not know the location
of the match in T!
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $ 5A N A $ 3A N A N A $ 1B A N A N A $ 0N A $ 4N A N A $ 2
Idea: just store SA as well?
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
Idea 2: store part of SA?
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
BWT search
F L$ B0 A2 N1 A1 N0 A0
A0 $ B0 A2 N1 A1 N0
A1 N0 A0 $ B0 A2 N1
A2 N1 A1 N0 A0 $ B0
B0 A2 N1 A1 N0 A0 $N0 A0 $ B0 A2 N1 A1
N1 A1 N0 A0 $ B0 A2N A N A
$ 6A $A N A $ 3A N A N A $B A N A N A $ 0N A $N A N A $
... and walk backwards through the BWT!
+1+1
6531042
BWT search
• Finding location takes constant time if the offsets into T are evenly spaced in T, not in the SA!
• Make tradeoff between space (RAM) and time (how long lookups take)
Things we left out
• Rank calculations in the BWT need to be fast! Needs another auxiliary data structure
• Only covered exact matching, read alignment requires mismatches (e.g. SNP in read, not in genome)
• Other details: store forwards and backwards indices of genome due to sequencing error profile
Lab exercisesBWA, SAM/BAM format, samtools
Reference Data• Reference genome
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.fa.gz
• md5sum chromosome22.fa.gz (168c78298e731128ee622cf422e70f1el)
• gunzip chromosome22.fa.gz
• du -h chromosome22.fa (49 MB, genome is 2.9 GB)
• less chromosome22.fa (where's the DNA?)
• grep -nv "^N" chromosome22.fa | head (line 175169!)
bwa index
• Indexing options:
• bwa index
• Index human chromosome 22 (~1.5 mins, genome takes ~1.5 hours):
• bwa index chromosome22.fa
bwa index outputbash-3.2$ bwa index chromosome22.fa
[bwa_index] Pack FASTA... 0.52 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 74.38 seconds elapse.
[bwa_index] Update BWT... 0.38 sec
[bwa_index] Pack forward-only FASTA... 0.37 sec
[bwa_index] Construct SA from BWT and Occ... 13.67 sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa index chromosome22.fa
[main] Real time: 93.689 sec; CPU: 89.320 sec
Read Data• Paired-end reads
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_1.fastq.gz
• wget http://wasabiapp.org/vbox/data/session_2/chromosome22.reads_2.fastq.gz
• md5sum chromosome22.reads_1.fastq.gz chromosome22.reads_2.fastq.gz
de1cd26056c61571de5cdf246ede60d3 chromosome22.reads_1.fastq.gz
2be64fb5848c2997af0ab8fab416d539 chromsome22.reads_2.fastq.gz
• gunzip chromosome22.reads_1.fastq.gz (and the other file)
• less chromosome22.reads_1.fastq
bwa mapping options
• Several alignment options:
• bwa mem (70bp+ Illumina, 454, IonTorrent, Sanger)
• bwa bwasw (Smith-Waterman, frequent gaps)
• bwa aln/samse/sampe (short reads, original)
bwa mem
• Mapping paired-end data • bwa mem [options] <idxbase> <in1.fq> <in2.fq>
• bwa mem -t 4 chromosome22.fa chromosome22.reads_1.fastq chromosome22.reads_2.fastq > chromosome22.sam
• -t specifies the number of CPUs to use
Sequence Alignment/Map format (SAM)
• SAM format is a TAB-delimited text file, we can inspect with a pager: • less -S chromosome22.sam
• Each row represents an alignment, at least 11 fields
• Specification: https://samtools.github.io/hts-specs/SAMv1.pdf
SAM fields• Column 1: read name
• Column 3: reference sequence name (in our case "22")
• Column 4: reference sequence position (reads were extracted from 2Mbase region)
SAM flags
• SAM flags in column 2 describe mapping result • https://broadinstitute.github.io/picard/explain-flags.html
SAM post-processing
• Convert SAM file to BAM format: • samtools view -Sb -o chromosome22.unsorted.bam
chromosome22.sam
• Sort BAM file: • samtools sort -o chromosome22.bam
chromosome22.unsorted.bam
• Index BAM file: • samtools index chromosome22.bam
samtools tview• View alignment in console (in pileup format https://
en.wikipedia.org/wiki/Pileup_format ): • samtools tview chromosome22.bam chromosome22.fa
• Scroll with arrow keys (but remember beginning of chr22 is all Ns)
• Type "g" (without quotes) and type "22:10732771" to get to a region where reads are mapped
• Get to help screen by typing "?"
• Exit with "q"