jan2016 fritz sedlazeck mapping and sv calling from pac bio
TRANSCRIPT
Giab workshop
Fritz SedlazeckCSHL, JHU
Previous meetings: Utilizing long reads
1. How to predict the breakpoints?
2. How to assess genotype ?
3. Complex SVs?
1. Breakpoint prediction
• Over BWA-MEM alignments– First version had a bug…
• Redesigning Sniffles– Improved speed– Improved accuracy on noisy alignment– Improved read filtering -> reducing FDR– Optional realignment step
• Improved breakpoint accuracy• Improving Genotyping
Sniffles v01 error
Sniffles v02
Current limitations
• Linear: gap cost always the same• Affine: separate penalties for opening and extending a gap• Using one gap cost is considered state of the art
• Problem with PacBio/ONT: two different gap models required– Sequencing error: large high number of 1 bp indels– Real indels: extending a gap more likely than opening a new one– Sequencing error + repeats cause one gap cost to fail even for real
indels
AAAGAATTCAA-A-A-T-CA
AAAGAATTCAAAA----TCA
vs.
Convex gap costs• Costs for a gap follow a convex function of gap length
• Close to linear gap costs for 1 - 2 bp gaps• As gap gets longer penalty for "splitting" gaps increases• Problem optimal approach: O(nm2 + n2m)• Heuristic implementation O(nm)
NGM-LR workflow
NGM-LR reconcileRead within inversion Read within duplication
Deletion
Deletion
Insertions
Inversions
Translocations
Nested SV (SKBR3)
Outlook
• Finish new version of Sniffles– Assessment of noisy alignments
• NGM-LR:– MQ calculation– Runtime
• Visual inspection and comparison of SV calls