annotation of variants vkgl / vkgn course: cq-f1, erasmus ... van der j..pdfand genotypes variant...
TRANSCRIPT
Annotation of variantsVariant Detection and Interpretation in a Diagnostic Context
K. Joeri van der Velde, PhD
(bioinformatics, research, diagnostics)
Genomics Coordination Center
Dept. of Genetics, UMC Groningen
The Netherlands
VKGL / VKGN course:“NGS in DNA diagnostics”
Tuesday September 4th 2018
CQ-F1, Erasmus MC, Rotterdam
http://rnasilencing.files.wordpress.com/2010/11/caduceus-with-dna-helix4.jpg
VCF
FASTQ
BAM
G>T A>G T>CA
Genome diagnostics in a nutshell
athogenic
BPV
Penign
ariant of unknown significance
Sample prep & track WGS/WES/Targeted
Quality control
Read mapping
Call variants and genotypes
Variant annotation and interpretation
Counselpatient
Counselpatient,consent
Draw sample
This talk
How many variants?
1 patient, WES
~100,000 variants
..which causal ?P
Some keywords
Filteringcandidates
most likely
least likely
Prioritization IdentificationP
B B
B
B B BB
BB
Filter & prioritize… on what?
we must first annotate the variants with additional context...use this information to filter & prioritize
CHR POS REF ALT + GoNL + 1000G + ClinVar2 179575511 C T .23 .212 179577870 T C2 179578704 G A .01 Pathogenic2 179578730 G A Benign2 179578891 T C .0012 179579093 T C Pathogenic2 179579212 T C .022 179579694 T A2 179579822 T A 0.02 Benign2 179579977 G A .25 .3
Many different types of annotations
1. Gene panels2. Population alleles3. Protein impact4. Assessed variants5. Inheritance match6. Phenotype match7. Deleteriousness What? How?
+ pitfalls
1. Gene panels: in the genome
1% pseudogenes (~15,000) & ncRNA (~25,000)
20-40% regulatory 50% repetitive / transposons
1% Exome (~20,000 genes)
10%
Intron & UTR
0.2% ‘Clinome’ (~3,600 genes)
The human genome: 3,000,000,000 base pairs
Known Mendelian disorder genes,may be clinically actionable
0.002% a ‘gene panel’ (1 to ~100 genes)Genes specific for a disorder or spectrum
1. Gene panels: in practice
Jongbloed (2011) Expert Opin Med Diagn 5:9-24
Clinicians:
“Which genes am I
allowed to look at?”
Researchers:
“Which genes
am I interested in?”
Pitfalls:
❖ Genes can be shared across panels❖ Panels for same disorder differ per lab❖ Tunnel vision: this must be it❖ Panels often updated with latest genes
1. Gene panels: clinical “superpanel”
Special highlight:
>3,600 monogenic disease genes, actionable to a degree
Database Individuals Sequencing type Variant info
~500 Dutch WGS AF, GTC, genotypes (by
request)
~2500 various
WGS AF, genotypes
~60,000 various
WES AF, GTC
~6,500 various
WES AF, GTC
~140,000 various
WES and WGS AF, GTC
2. Population alleles: databases
Population variation reference databases: bigger = better
2. Population alleles: practical example
❖ Check how often an alternative allele is observed in the general population
ExAC contains ~60,000 individuals
With 10 heterozygous and 2 homozygous: AC 14
Therefore: 14/120000 = 0.000117 = AF 0.01%
❖ Allele frequency filtering threshold? Depends..
Common or rare disorder? Dominant or recessive?
Founder mutation? Consanguinity?
.....from 0.01% to 5%
Pitfalls:
❖ Assumptions of disorder rarity and penetrance❖ Population may not be representative of patient ethnicity
2. Population alleles: false discovery
Gene Variant Class Cases with Cases without
Controls with
Controls without
All Odds ratio
DSP Truncating 12 415 42 59570 41.01
❖ Disease genes usually enriched with pathogenic variants❖ Pitfall: Some (often newer) disease genes NOT enriched at all !
Special highlight:
Do patients have more “pathogenic variants” than healthy individuals?
3. Protein impact
Effect Impact
STOP_GAINED HIGH
FRAME_SHIFT HIGH
NON_SYNONYMOUS_CODING MODERATE
CODON_INSERTION MODERATE
SYNONYMOUS_CODING LOW
NON_SYNONYMOUS_START LOW
INTRON MODIFIER
MICRO_RNA MODIFIER
+34 more effectsPitfalls:
❖ Transcript definitions (there could be dozens!)❖ Overlapping genes (need to deal with this)❖ “Impact” very handy but may oversimplify
Predict the effect of a variant on the transcription of a gene
Easy to use filter: each effect falls in 1 of 4 impact severity categories
& others& VEP
4. Assessed variants: databases
“Has my variant been interpreted before?”
❖ LOVD instances ( http://grenada.lumc.nl/LSDB_list/lsdbs )❖ The Human Gene Mutation Database ( www.hgmd.cf.ac.uk )❖ HGVS list of LSDBs ( http://www.hgvs.org/locus-specific-mutation-databases )❖ VKGL data sharing of Dutch clinical variants
❖ Via Cartagenia bench lab ( http://cartagenia.com )❖ Joint national database ( www.molgenis.org/vkgl )
VKGL database current release:97,801 variants
Lead: Marielle van Gijn (UMCU)
4. Assessed variants: ClinVar
❖ ClinVar ( www.ncbi.nlm.nih.gov/clinvar )❖ Big & growing: 5,449 6,550 8,103 genes with 163,816 318,819
440,386 variants (stricken: numbers from previous years)❖ ‘Star status’ based on submitter(s)❖ Low confidence?
OK, good to know!
Pitfalls:
❖ Many false positives in some databases (various papers on this)❖ Legacy data (e.g. variants reported before 2000)
5. Inheritance match: CGD
“Could this heterozygous variant cause the disorder?”
Disorder inheritance mode Number of genes
Autosomal recessive (AR) ~1650
Autosomal dominant (AD) ~1000
AD or AR ~350
X-linked ~200
Other (digenic, bloodgroup, Y-linked etc) ~100
Pitfalls:
❖ CGD data is brilliant and easy to get but not totally “clean”, beware when using these data in a script/program
/
6. Phenotype match: when and why?
0.2% ‘Clinome’ (~3,500 genes)
TECPR2 ….
Spastic paraplegia 49 Mucopolysaccharidosis IVA DysarthriaBrachycephalySeizuresBroad neckAreflexia
Cervical subluxationGenu valgumHypoplasia of the odontoid processKeratan sulfate excretion in urineMetaphyseal widening
MicrocephalyDental crowdingHypotoniaCentral apneaGastroesophageal reflux disease
?
❖ Filter the genome, based on patient symptoms❖ How? We know many gene-disorder associations
GALNS…. ….
….
https://guideimg.alibaba.com/images/shop/88/11/20/2/close-up-of-a-baby-with-an-ice-pack-on-his-head-poster-print-18-x-24_1882252.jpg
6. Phenotype match: not so easy
There are >100,000 gene-disease-symptom associations!
…
......
OMIM:615031 TECPR2 9895 HP:0001260 Dysarthria
OMIM:615031 TECPR2 9895 HP:0000248 Brachycephaly
OMIM:615031 TECPR2 9895 HP:0001250 Seizures
OMIM:615031 TECPR2 9895 HP:0001284 Areflexia
OMIM:615031 TECPR2 9895 HP:0000475 Broad neck
OMIM:253000 GALNS 2588 HP:0003308 Cervical subluxation
OMIM:253000 GALNS 2588 HP:0002857 Genu valgum
OMIM:253000 GALNS 2588 HP:0003311 Hypoplasia of the odontoid process
OMIM:253000 GALNS 2588 HP:0012069 Keratan sulfate excretion in urine
OMIM:253000 GALNS 2588 HP:0003016 Metaphyseal widening
…...
...
ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt by HPO team
6. Phenotype match: how?
❖ NaiveConsiders terms separatelyTerms directly linked to diseases/genes
❖ SmartConsiders terms as one collectionTerms algorithmically matched to diseases/genes
6. Phenotype match: naive method
❖ Naive, using specific termsexample:
"Keratan sulfate excretion in urine"
"Cervical subluxation"
OMIM:253000 GALNS 2588 HP:0012069 Keratan sulfate excretion in urineOMIM:253010 GLB1 2720 HP:0012069 Keratan sulfate excretion in urineOMIM:601492 HYAL1 3373 HP:0012069 Keratan sulfate excretion in urine
OMIM:253000 GALNS 2588 HP:0003308 Cervical subluxationOMIM:253010 GLB1 2720 HP:0003308 Cervical subluxationOMIM:607095 RMRP 6023 HP:0003308 Cervical subluxation
HYAL1 RMRPGALNSGLB1
Prime candidates? Also interesting?
Pitfall: Chances of an exact match are very small
6. Phenotype match: naive method
❖ Naive, using broad terms
Pitfall: too many genes to investigate!
Seizures: 1181 genesMicrocephaly: 644 genesMuscular hypotonia: 928 genesHearing impairment: 1231 genesDevelopmental delay: 639 genesShort stature: 994 genesNystagmus: 706 genesCognitive impairment: 894 genes
Substringed on July 2015 release of ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt by HPO team
6. Phenotype match: smart method
Input symptom set A
Symptoms of disease B
❖ Smart matchingusing any terms
❖ HPO symptoms =graph of terms
❖ “Semantic similarity”between collectionsof symptoms (Resnik, 1999)
6. Phenotype match: smart tools
Phenotips ( http://phenotips.org )
Comprehensive matching in nice GUI, including ‘negative’ symptoms and suggestions for differential diagnosis
6. Phenotype match: smart tools
AMELIE ( https://amelie.stanford.edu/ )
Enter symptoms and candidate genes, uses automated literature search to prioritize candidate genes
Why: truly very good performance for finding known disease genes (unpublished benchmark)
6. Phenotype match: smart tools
GADO ( https://genenetwork.nl/gado )
Enter symptoms and optional candidate genes, prioritizes any expressed gene based on massive RNA-seq gene network
Why: perfect for discovering new disease genes
6. Phenotype match: wrap-up
Pitfall:
❖ Not every ‘phenotype matching’ software is smart!
So be wary of how the matching is performed
For ‘smart’ software:
Enter as many, as specific symptoms as possible
this will only improve the match and chances of success!
For ‘naive’ software:
Try a few specific terms first, but don’t get stuck. Broaden
your terms when needed, don’t enter too many at once.
no point in getting thousands of hits
7. Deleteriousness: traditional tools
Many conservation-based tools that predict how damaging a
variant will be:
SIFT, PolyPhen, PROVEAN, PON-P2, MutationAssessor, FATHMM-MKL, Condel, PhyloP, UMD-Pred, Grantham, DeepPVP, ENTPRISE, PHAST, FitCons, DANN, MutPred, GERP, Xvar, VAAST, AlignGVGD, MAPP, MutationTaster, Alamut, MVP, EIGEN, REVEL, DIVAN, DeepSEA, REMM, GWAVA, DeltaSVM, DANQ, and so on ...
Good tools! But there are some issues:
• Scope difference (types of variants, genomic regions)• Results often correlated with each other• Training bias or even overfitted
So which to use?
7. Deleteriousness: next gen tools
LINSIGHT (http://www.nature.com/ng/journal/v49/n4/full/ng.3810.html)
EIGEN (http://www.nature.com/ng/journal/v48/n2/full/ng.3477.html)
CADD (https://www.nature.com/ng/journal/v46/n3/full/ng.2892.html)
Machine-learned model using 60+ tools & sources,trained to find the difference between:variants stabilized in evolution since primates andvariants randomized across the genome
Low CADD score means:This variant looks like a stabilized variant, safe
High CADD score means:This variant looks like a non-stabilized variant, could be dangerous
Great for variant prioritization
http://3.bp.blogspot.com/-yeR87n827N8/TkagxSc9OuI/AAAAAAAAAyI/bYwDJjym8lA/s1600/Rise-of-the-Planet-the-Apes-James-Franco.jpg
7. Deleteriousness: CADD scores
COL3A1Ehlers Danlos Syndrome, data from Raymond Dalgleish, https://eds.gene.le.ac.uk/home.php?select_db=COL3A1
Patient causal variantGoNL variant1000G variant
http://www.reluctantcontortionist.co.uk/wp-content/uploads/2014/06/James-Morris-The-Elastic-Skin-Man-2-264x300.png
✔Threshold between pathogenic & benign?
7. Deleteriousness: CADD scores
MEFVFamilial Mediterranean Fever, data fromINSAID Consortium, http://molgenis.org/said
https://i.pinimg.com/originals/4e/b2/e6/4eb2e6e6c3e42b80ac363b3bc7dce572.jpg
Patient causal variantGoNL variant1000G variant
xPitfall: does not work for every gene!
Threshold between pathogenic & benign?
7. Deleteriousness: GAVIN method
GAVIN - Gene-Aware Variant INterpretation for medical sequencing Van der Velde et al., Genome Biology (2017) 18 (1): 6.
❖ Automated variant classification❖ Based on CADD calibrations❖ Web app: molgenis.org/gavin❖ UMC Groningen genome
diagnostics: 2x faster
Special highlight:
Spe
cific
ity →
Sensitivity →Plot by Kristin Abbott
Bonus: Excel gene names @$%&#-ups
Pitfall: Gene name errors caused by Excel
Zeeberg et al. 2004, Ziemann et al. 2016
http://www.winbeta.org/news/20-of-scientific-papers-on-genes-contain-gene-name-conversion-errors-caused-by-excel
This will happen to you
in summary...
sequence filter
prioritize identify
P
BB
BB
B
B BBB
BBB
B
PS: Solve-RD: huge patient re-analysis
21 international partners + ERNs19,000 unsolved WES re-analysis2,000 cases selected for WGS800 ultra-RD patients WES/WGS500 long-read genomes (PacBio)120 ‘unsolvables’ into multi-omics
Structural variation
GeneNetwork
A.S.E.
Methylation network
Exon skipping
Literature match
GAVIN+
InterVar SilVA
SPIDEX
Kumar NC
VIPUR
“UTR tool” “TFB
S to
ol”
dept. of Genetics
Diagnostics & research
Testbeds
Your help needed:Fresh ideas/methods/tools we can try out to diagnose moreSamples of undiagnosed patients (i.e. FASTQ/BAM/VCF files)
Project highlight
Some useful hyperlinks1. Clinical Genomics Database http://research.nhgri.nih.gov/CGD
2. Genome of the Netherlands http://nlgenome.nl
2. 1000 Genomes Project http://1000genomes.org
2. gnomAD database http://gnomad.broadinstitute.org
3. Variant Effect Predictor http://www.ensembl.org/vep
3. SnpEff tool http://snpeff.sourceforge.net
4. MOLGENIS for scientific data http://molgenis.org
5. Online Mendelian Inheritance in Man http://omim.org
6. HPO terms http://human-phenotype-ontology.github.io
6. Phenomizer http://compbio.charite.de/phenomizer
6. PhenoTips diagnosis https://phenotips.org
7. CADD scores http://cadd.gs.washington.edu
7. GAVIN method http://molgenis.org/gavin
8. European Spreadsheet Risk Interest Group http://www.eusprig.org
9. Solving the Unsolved Rare Diseases http://solve-rd.eu/
Thank you! Questions & acknowledgements
• Special thanks to:• Morris Swertz, Richard Sinke, Kristin Abbott, Rolf Sijmons
• Collaborators from UMCG, RUG, (inter)national projects
• Colleagues & students:
contact: [email protected]