variant (snps/indels) calling in dna sequences, part 2

20
The Queensland Brain Institute | Variant calling for disease association (2/2) Searching the haystack 6/22/22 [www.absolutefab.com]

Upload: denis-bauer

Post on 10-May-2015

4.188 views

Category:

Technology


14 download

DESCRIPTION

Abstract: This session will focus on the steps involved in identifying genomic variants after an initial mapping was achieved: improvement the mapping, SNP and indel calling and variant filtering/recalibration will be introduced.

TRANSCRIPT

Page 1: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute |

Variant calling for disease association (2/2)Searching the haystack

April 11, 2023

[www.absolutefab.com]

Page 2: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Quick recap: DNA sequence read mapping

• Sequencing->FASTQ->alignment to reference genome

• Resulting file type: BAM• Visualized in Genome Viewer• “What genomic regions were sequenced?”

Quality ControlProjects Fastq Bam

Page 3: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Product Time

fastq 5 days

bam, vcf,… 3 weeks

paper >6 months

Per one-flowcell project

Production Informatics and Bioinformatics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch

Page 4: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Good mapping is crucial

• Mapping tools compromise accuracy for speed: approximate mapping.

• Identifying exactly where the reads map is the fundament for all subsequent analyses.

• The exact alignment of each read is especially important for variant calling.

by neilalderney123

Page 5: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Mapping challenges

• Incorrect mapping– Amongst 3 billion bp (human) a

100-mer can occur by chance

• Multi-mappers– The genome has none-unique

regions (e.g. repeats) one read mapping to multiple sites can happen

• Duplicates– PCR duplicates can introduce

artifacts.

Turner DJ, Keane TM, Sudbery I, Adams DJ. Next-generation sequencing of vertebrate experimental organisms. Mamm Genome. 2009 Jun;20(6):327-38. PMID: 19452216

ACGATATTACACGTACACTCAAGTCGTTCGGAACCT TTACACGTACA TACACGTACAC ACACGTACACT CACGTACTCTC CACGTACTCTC CACGTACTCTC CACGTACTCTC

Streptococcus suis (squares) Mus musculus (triangles)

Page 6: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Methods for ensuring a good alignment

• Biological: Using paired end reads to increase coverage

• Bioinformatically: – Local-realignment– Base pair quality score re-calibration

~200 bp

?

Repeat region

Page 7: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Local Realignment (GATK)

• Local realignment of all reads at a specific location simultaneously to minimize mismatches to the reference genome

• Reduces erroneous SNPs refines location of INDELS

original

realigned

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

QBI data

Page 8: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Quality score recalibration (GATK)

• PHRED scores are predicted• Looking at all reads at a specific location allows a

better estimate on base pair quality score. – Excludes all known dbSNP sites –  Assume all other mismatches are sequencing errors – Compute a new calibration table bases on mismatch

rates per position on the read

• Important for variant calling

Thomas Keane 9th European Conference on Computational Biology 26th September, 2010

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Page 9: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Recalibration of quality score

All bases are called with Q25

In reality not all are that good: bases actually mismatch the reference at a 1 in 100 rate, so are actually Q20” GATK

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Page 10: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Variant calling methods

• > 15 different algorithm • Three categories

– Allele counting– Probabilistic methods, e.g.

Bayesian model • to quantify statistical

uncertainty• Assign priors e.g. by taking the

observed allele frequency of multiple samples into account

– Incorporating linkage disequilibrium (LD)• Specifically helpful for low

coverage and common variants

Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011 Jun;12(6):443-51. PMID: 21587300.

http://seqanswers.com/wiki/Software/list

Ref

Ind1

Ind2

A

G/G

A/G

SNP variant

Page 11: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

VCF format

• Individual statistics– GT - genotype - 0/1– AD – total number of REF/ALT seen – 173 T, 141 A– DP – depth MAPQ > 17 – 282– GQ - Genotype Quality - 99 – PL – genotype likelihood - 0/0: 10-25.5=unlikely, 0/1:10-0=likely,

and 1/110-25.5=unlikely

• Location statistics, e.g.– Strand bias– How many reads have a deletion at this site

[HEADER LINES]#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT

NA12878chr1 873762 . T G 5231.78 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:173,141:282:99:255,0,255chr1 877664 rs3828047 A G 3931.66 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 1/1:0,105:94:99:255,255,0chr1 899282 rs28548431 C T 71.77 PASS [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:1,3:4:25.92:103,0,26chr1 974165 rs9442391 T C 29.84 LowQual [ANNOTATIONS] GT:AD:DP:GQ:PL 0/1:14,4:14:60.91:61,0,255

Page 12: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

When to call a variant ???REF: 77% ALT: 23%

HetREF: 50% ALT: 50%Hom

REF: 0% ALT: 100%

QBI data QBI data

Page 13: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Hard Filtering

• Reducing false positives by e.g. requiring– Sufficient Depth– Variant to be in >30% reads– High quality– Strand balance – …

• Subjective and dangerous in this high dimensional search space

Bentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008). Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).

Strand Bias

QBI data

Page 14: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Gaussian mixture model

• Train on trusted variants and require the new variants to live in the same hyperspace

• Potential problem: Overfitting and biasing to features of known SNPs !!!

Page 15: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Indel calling

• First local realignment might not be sufficient to confidently determine the beginning and end of indels

• Dindel-algorithm– Local realignment for

every indel candidate

Albers CA, Lunter G, Macarthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: Accurate indel calls from short-read data. Genome Res. 2011 Jun;21(6):961-73. PMID: 20980555.

Page 16: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Recap

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Page 17: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Outcome: How many variants will I find ?

DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. PMID: 21478889

Hiseq: whole genome; mean coverage 60; 101PE; (NA12878)Exome: agilent capture; mean coverage 20; 76/101PE; (NA12878)

Page 18: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Three things to remember

1. Getting the mapping right is critical2. Variant calling is not merely to count the

differences3. Just listing the variants does not tell you

anything biologically relevant.

by Яick Harris

Fernald GH, Capriotti E, Daneshjou R, Karczewski KJ, Altman RB. Bioinformatics challenges for personalized medicine. Bioinformatics. 2011 Jul 1;27(13):1741-8. PMID: 21596790

Page 19: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Next week:

Abstract: This seminar aims at answering the question of what to make of the identified variants, specifically how to evaluate the quality, prioritize and functionally annotate the variants.

Page 20: Variant (SNPs/Indels) calling in DNA sequences, Part 2

The Queensland Brain Institute | April 11, 2023

Walk-in-clinic