genomics method seminar - bwa october 15, 2014 sora kim researcher [email protected] yonsei biomedical...
TRANSCRIPT
![Page 1: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/1.jpg)
Genomics Method Seminar- BWA
October 15, 2014
Sora Kim
Yonsei Biomedical Science InstituteYonsei University College of Medicine
![Page 2: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/2.jpg)
2/12
Today’s paper
• PhD. Heng Li– a research scientist at the Broad Institute, working
with David Reich and David Altshuler.– principal developer of several projects including SAM-
tools, BWA, MAQ, TreeSoft and TreeFam with most of them started when he was a postdoctoral fellow of Richard Durbin at the Wellcome Trust Sanger Institute.
![Page 3: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/3.jpg)
3/12
Software information
• Purpose– BWA-MEM is a new alignment algorithm for aligning se-
quence reads or assembly contigs against a large refer-ence genome such as human.
• Category– aligner
• Software URL– http://bio-bwa.sourceforge.net/
• License– Free, Open Source under Artistic License
![Page 4: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/4.jpg)
4/12
RNA-seq
ChIP-seq
WGS, WES
![Page 5: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/5.jpg)
5/12
Previous work
• Bowtie
– BWT + FM index– LF mapping– Backtracking
![Page 6: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/6.jpg)
6/12
Conceptual Overview
BWA• For
short read
BWA-SW• For
long read
BWA-MEM• For both
![Page 7: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/7.jpg)
7/12
CUSHAW2 - MEMs
• Long read alignment based on maximal ex-act match seeds, Yongchao Liu and Bertil Schmidt, Bioinformat-ics (2012) 28 (18):i318-i324
• CUSHAW2, a parallelized, accurate, and memory-efficient long read aligner. It is based on the seed-and-extend approach and uses maximal exact matches as seeds to find gapped alignments.
![Page 8: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/8.jpg)
8/12
CUSHAW2 - MEMs
![Page 9: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/9.jpg)
9/12
CUSHAW2 - MEMs
1. Estimation of the minimal seed size2. Generation of maximal exact
matches
![Page 10: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/10.jpg)
10/12
1. Estimation of the minimal seed size
• qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t = max(|P|,|S|)-q+1-q*e (Exact and complete short-read alignment to microbial genomes using Graphics Pro-cessing Unit programming, Bioinformatics, Vol. 27 no. 10 2011, pages 1351–1358)
• That means that every error may destroy up to q*e overlapping qgrams.
• For non-overlapping qgrams, one error can destroy only the qgram in which it is located.
• Given this assumption, we define the length q of the qgrams as the largest value below such that
![Page 11: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/11.jpg)
11/12
1. Estimation of the minimal seed size
• A = ACGT• B = ACTT• q=2, e=1 이라고 가정
q(A) = {AC, CG, GT}q(B) = {AC, CT, TT}
• t = max(|A|,|B|)-q+1-q*et = max(4, 4)-2+1-2*1 = 1
• A_q 와 B_q 는 최소 t, 1 만큼은 share 하는 구간이 있어야 한다 .
![Page 12: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/12.jpg)
12/12
1. Estimation of the minimal seed size
• The estimation is based on the pigeonhole principle for non-overlapping q-grams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome.
• QL: global lower-bound = (default) 13• QH: global upper-bound = (default) 49
• employ a simplified error model for ungapped alignments to esti-mate e. w follows a binomial distribution.
![Page 13: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/13.jpg)
13/12
2. Generation of maximal exact matches
• To identify MEMs between S and T, we ad-vance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index.
• LEMs are right/left maximal if it is not part of any previously identified MEM.
• discard the MEMs whose lengths are less than Q.– we only keep its first h (h=1024 by default) occurrences
and discard the others.
![Page 14: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/14.jpg)
14/12
BWA-MEM
1. Aligning a single query sequencea. Seeding and re-seedingb. Chaining and chain filteringc. Seed extension
2. Paired-end mappinga. Rescuing missing hitsb. Pairing
![Page 15: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/15.jpg)
15/12
SE. Seeding and re-seeding
• BWA-MEM follows the canonical seed-and-ex-tend paradigm.
• Seed an alignment with SMEMs (Super Maximal Exact Matches), which essentially finds at each query position the longest exact match cov-ering the position.
• Suppose we have a SMEM of length l with k occurrences in the reference genome.
• To reduce mismappings caused by missing seeds, we introduce re-seeding.
![Page 16: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/16.jpg)
16/12
SE. Chaining and chain filtering
• We call a group of seeds that are colinear and close to each other as a chain.
• We greedily chain the seeds while seeding and then filter out short chains that are largely con-tained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain).
• Chain filtering aims to reduce unsuccessful seed extension at a later step.
• Chains detected here do not need to be accurate.
![Page 17: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/17.jpg)
17/12
SE. Seed extension
• rank a seed by length of the chain it belongs to and then by the seed length.
• drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic pro-gramming (DP) if it potentially leads to a new alignment.
![Page 18: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/18.jpg)
18/12
SE. Seed extension
• banded affine-gap-penalty dynamic pro-gramming
![Page 19: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/19.jpg)
19/12
SE. Seed extension
• BWA-MEM’s seed extension differs from the standard seed extension in two aspects.1. suppose at a certain extension step we
come to reference position x with the best extension score achieved at query position y.
2. while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence.
![Page 20: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/20.jpg)
20/12
PE. Rescuing missing hits
• estimates the mean and the variance of the in-sert size distribution from reliable single-end hits.
• For the top 100 hits (by default) of either end, if the mate is unmapped in a window [] from each hit, BWA-MEM performs SSE2-based Smith-Waterman alignment for the mate within the window.
![Page 21: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/21.jpg)
21/12
PE. Rescuing missing hits
• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.
![Page 22: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/22.jpg)
22/12
PE. Rescuing missing hits
• Hits found from both the single-sequence align-ment and SW rescuing will be used for pairing.
![Page 23: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/23.jpg)
23/12
PE. Pairing
• Given i-th hit for the first read, j-th hit for the second read• BWA-MEM computes their distance if the two hits are in the
right orientation, or sets to infinity otherwise.
• scores the pair (i, j)
– P(d) gives the probability of observing an insert size larger than d assuming a normal distribution
– ‘log4’ arises when we interpret SW score as odds ratio.– U is a threshold that controls pairing:
if is small enough such that , BWA-MEM prefers to pair the two ends;otherwise it prefers the unpaired alignments.
![Page 24: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/24.jpg)
24/12
Results
![Page 25: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/25.jpg)
25/12
Running Operation
• MEM mode
![Page 26: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/26.jpg)
26/12
SAM format - spec
![Page 27: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/27.jpg)
27/12
SAM format - example
![Page 28: Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine](https://reader035.vdocument.in/reader035/viewer/2022081504/56649de55503460f94add5b4/html5/thumbnails/28.jpg)
28/12
Discussion
• 100bp 이상의 확실한 long read 일 때 MEM 방식을 주로 사용하고 100bp 이하의 short read 일 때는 aln 을 쓰는 것을 추천
• Seed extend 와 local alignment 사용으로 인한 불필요하게 많이 split 되어 나타나는 alignment 결과물에 대해서 결과 보정 혹은 후처리를 위해 옵션 조정이 필요