snap: fast, accurate sequence alignment enabling biological applications
DESCRIPTION
SNAP: Fast, accurate sequence alignment enabling biological applications. Ravi Pandya, Microsoft Research ASHG 10/19/2014. SNAP. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/1.jpg)
SNAP: Fast, accurate sequence alignment enabling biological
applicationsRavi Pandya, Microsoft Research
ASHG 10/19/2014
![Page 2: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/2.jpg)
SNAP
SNAP is fast *Align 50x genome in 1.2 hours(BWA-MEM = 11.75 hours)Sort + index + markdup BAM in 2 hours(samtools+sambamba = 4.25 hours)
SNAP is as accurate as BWA-MEM, Bowtie2, etc.ROC on simulated data% aligned on real dataVariant calls on real data
* NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
![Page 3: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/3.jpg)
Sequence alignment
The problem:Given a read R and a reference genome GFind the position in p in G that minimizesEditDistance(R, G[p .. p + |R|])
SNAP solves this quickly and accurately because of:Efficient system architectureReducing the number of comparisonsReducing the cost of comparisons
![Page 4: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/4.jpg)
System architecturefull
align sort
async read async write
emptytemp file
mergesort
markduplicates
index
compress
![Page 5: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/5.jpg)
The sequence alignment problemThe easy part:
97% of 20-mersin the human genomeoccur only oncebut at only 75% of locations
The hard part:
The other 3% of 20-mersand 25% of locations
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Single Equally Weighted
Paired Equally Weighted
Single Time Weighted
Paired Time Weighted
10% of reads
95% of time
CDF of per-read/pair alignment time, NA18705 169M pairs(using deeper search parameters than current defaults)
Bill Bolosky, MSR
![Page 6: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/6.jpg)
Hash table lookup
Build a multi-valued map (~30GB for hg19)from all seeds S in G all locations of S in G
330 reads/s
14k reads/s
For all seeds in read, all locations of seed in genome,Score implied alignment of read, keep the best
Ignore frequent seeds (>300 occurrences)Only use a few seeds/read
42x
Bill Bolosky, MSR
![Page 7: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/7.jpg)
Fast scoring
113k reads/s
154k reads/s(470x overall)
Sort candidates by # of seed hits
Skip locations with #seed misses > limit
1.4x
92k reads/s O(n2) Ukkonen O(nd), n=len, d=min(limit, actual)Use limit = best score so far + 2 (for MAPQ)
1.2x
6.6x
Bill Bolosky, MSR
![Page 8: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/8.jpg)
Paired-end alignment
Find & score candidate location pairsC(R1:R2) = C(R1) ∩ C(R2) {± insert size}Enumerate in O(h log n) h = |C(R1) ∩ C(R2)| n = |C(R1)| + |C(R2)|Increases accuracy by allowingmuch higher limit on seed occurrences(e.g. 4k vs 300)
Bill Bolosky, MSR
![Page 9: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/9.jpg)
Results: simulated data
Mason-generated paired-end 100bp reads
![Page 10: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/10.jpg)
Results: real data
NA18507 (Illumina HiSeq 50x)
* AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)
![Page 11: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/11.jpg)
Results: GATK variant calls
Broad GATK pipeline, curated NA12878 variant calls
![Page 12: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/12.jpg)
Results: NIST Genome-in-a-BottleAppistry GATK pipeline, GIAB highly confident callsLonger seeds are much faster, similar precision/recall
11.75
ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)
![Page 13: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/13.jpg)
Results: NIST Genome-in-a-BottleLower confidence calls (qual>20, 2 platforms)
Highly confident indel snp Aligner Recall Precision Recall Precisionbwa-mem 97.24% 97.15% 99.57% 99.65%snap-20 97.04% 97.48% 99.51% 99.57%snap-24 97.04% 97.46% 99.52% 99.57%snap-28 97.04% 97.45% 99.53% 99.57%snap-32 97.00% 97.41% 99.51% 99.57%
Lower confidence indel snp Aligner Recall Precision Recall Precisionbwa-mem 96.38% 96.30% 99.00% 99.32%snap-20 96.17% 96.68% 98.94% 99.25%snap-24 96.17% 96.67% 98.95% 99.23%snap-28 96.16% 96.62% 98.96% 99.21%snap-32 96.11% 96.55% 98.94% 99.17%
![Page 14: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/14.jpg)
Pathogen ID: SURPI (Charles Chiu, UCSF)
“This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete, Chiu said.”
![Page 15: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/15.jpg)
SURPI
SNAP enables SURPI with:Fast filtering mode64-bit index for >40GB ntDBSecondary mapping output
Charles Chiu, UCSF
![Page 16: SNAP: Fast, accurate sequence alignment enabling biological applications](https://reader035.vdocument.in/reader035/viewer/2022062422/56813848550346895d9ff58a/html5/thumbnails/16.jpg)
Acknowledgements
Microsoft ResearchBill BoloskyRavi PandyaUC San FranciscoTaylor SittlerBroad InstituteChristopher Hartl
UC Berkeley AMPLabMatei ZahariaKristal CurtisArmando FoxScott ShenkerIon StoicaDavid Patterson
Binaries, source, documentation (Apache 2.0 licensed)http://snap.cs.berkeley.edu