annotation of variants vkgl / vkgn course: cq-f1, erasmus ... van der j..pdfand genotypes variant...

Annotation of variantsVariant Detection and Interpretation in a Diagnostic Context

K. Joeri van der Velde, PhD

(bioinformatics, research, diagnostics)

Genomics Coordination Center

Dept. of Genetics, UMC Groningen

The Netherlands

VKGL / VKGN course:“NGS in DNA diagnostics”

Tuesday September 4th 2018

CQ-F1, Erasmus MC, Rotterdam

http://rnasilencing.files.wordpress.com/2010/11/caduceus-with-dna-helix4.jpg

VCF

FASTQ

BAM

G>T A>G T>CA

Genome diagnostics in a nutshell

athogenic

BPV

Penign

ariant of unknown significance

Sample prep & track WGS/WES/Targeted

Quality control

Read mapping

Call variants and genotypes

Variant annotation and interpretation

Counselpatient

Counselpatient,consent

Draw sample

This talk

How many variants?

1 patient, WES

~100,000 variants

..which causal ?P

Some keywords

Filteringcandidates

most likely

least likely

Prioritization IdentificationP

B B

B

B B BB

BB

Filter & prioritize… on what?

we must first annotate the variants with additional context...use this information to filter & prioritize

CHR POS REF ALT + GoNL + 1000G + ClinVar2 179575511 C T .23 .212 179577870 T C2 179578704 G A .01 Pathogenic2 179578730 G A Benign2 179578891 T C .0012 179579093 T C Pathogenic2 179579212 T C .022 179579694 T A2 179579822 T A 0.02 Benign2 179579977 G A .25 .3

Many different types of annotations

1. Gene panels2. Population alleles3. Protein impact4. Assessed variants5. Inheritance match6. Phenotype match7. Deleteriousness What? How?

+ pitfalls

1. Gene panels: in the genome

1% pseudogenes (~15,000) & ncRNA (~25,000)

20-40% regulatory 50% repetitive / transposons

1% Exome (~20,000 genes)

10%

Intron & UTR

0.2% ‘Clinome’ (~3,600 genes)

The human genome: 3,000,000,000 base pairs

Known Mendelian disorder genes,may be clinically actionable

0.002% a ‘gene panel’ (1 to ~100 genes)Genes specific for a disorder or spectrum

1. Gene panels: in practice

Jongbloed (2011) Expert Opin Med Diagn 5:9-24

Clinicians:

“Which genes am I

allowed to look at?”

Researchers:

“Which genes

am I interested in?”

Pitfalls:

❖ Genes can be shared across panels❖ Panels for same disorder differ per lab❖ Tunnel vision: this must be it❖ Panels often updated with latest genes

1. Gene panels: clinical “superpanel”

Special highlight:

>3,600 monogenic disease genes, actionable to a degree

Database Individuals Sequencing type Variant info

~500 Dutch WGS AF, GTC, genotypes (by

request)

~2500 various

WGS AF, genotypes

~60,000 various

WES AF, GTC

~6,500 various

WES AF, GTC

~140,000 various

WES and WGS AF, GTC

2. Population alleles: databases

Population variation reference databases: bigger = better

2. Population alleles: practical example

❖ Check how often an alternative allele is observed in the general population

ExAC contains ~60,000 individuals

With 10 heterozygous and 2 homozygous: AC 14

Therefore: 14/120000 = 0.000117 = AF 0.01%

❖ Allele frequency filtering threshold? Depends..

Common or rare disorder? Dominant or recessive?

Founder mutation? Consanguinity?

.....from 0.01% to 5%

Pitfalls:

❖ Assumptions of disorder rarity and penetrance❖ Population may not be representative of patient ethnicity

2. Population alleles: false discovery

Gene Variant Class Cases with Cases without

Controls with

Controls without

All Odds ratio

DSP Truncating 12 415 42 59570 41.01

❖ Disease genes usually enriched with pathogenic variants❖ Pitfall: Some (often newer) disease genes NOT enriched at all !

Special highlight:

Do patients have more “pathogenic variants” than healthy individuals?

3. Protein impact

Effect Impact

STOP_GAINED HIGH

FRAME_SHIFT HIGH

NON_SYNONYMOUS_CODING MODERATE

CODON_INSERTION MODERATE

SYNONYMOUS_CODING LOW

NON_SYNONYMOUS_START LOW

INTRON MODIFIER

MICRO_RNA MODIFIER

+34 more effectsPitfalls:

❖ Transcript definitions (there could be dozens!)❖ Overlapping genes (need to deal with this)❖ “Impact” very handy but may oversimplify

Predict the effect of a variant on the transcription of a gene

Easy to use filter: each effect falls in 1 of 4 impact severity categories

& others& VEP

4. Assessed variants: databases

“Has my variant been interpreted before?”

❖ LOVD instances ( http://grenada.lumc.nl/LSDB_list/lsdbs )❖ The Human Gene Mutation Database ( www.hgmd.cf.ac.uk )❖ HGVS list of LSDBs ( http://www.hgvs.org/locus-specific-mutation-databases )❖ VKGL data sharing of Dutch clinical variants

❖ Via Cartagenia bench lab ( http://cartagenia.com )❖ Joint national database ( www.molgenis.org/vkgl )

VKGL database current release:97,801 variants

Lead: Marielle van Gijn (UMCU)

http://grenada.lumc.nl/LSDB_list/lsdbs

http://www.hgmd.cf.ac.uk

http://www.hgvs.org/locus-specific-mutation-databases

http://cartagenia.com

http://www.molgenis.org/vkgl

4. Assessed variants: ClinVar

❖ ClinVar ( www.ncbi.nlm.nih.gov/clinvar )❖ Big & growing: 5,449 6,550 8,103 genes with 163,816 318,819

440,386 variants (stricken: numbers from previous years)❖ ‘Star status’ based on submitter(s)❖ Low confidence?

OK, good to know!

Pitfalls:

❖ Many false positives in some databases (various papers on this)❖ Legacy data (e.g. variants reported before 2000)

http://www.ncbi.nlm.nih.gov/clinvar

5. Inheritance match: CGD

“Could this heterozygous variant cause the disorder?”

Disorder inheritance mode Number of genes

Autosomal recessive (AR) ~1650

Autosomal dominant (AD) ~1000

AD or AR ~350

X-linked ~200

Other (digenic, bloodgroup, Y-linked etc) ~100

Pitfalls:

❖ CGD data is brilliant and easy to get but not totally “clean”, beware when using these data in a script/program

/

6. Phenotype match: when and why?

0.2% ‘Clinome’ (~3,500 genes)

TECPR2 ….

Spastic paraplegia 49 Mucopolysaccharidosis IVA DysarthriaBrachycephalySeizuresBroad neckAreflexia

Cervical subluxationGenu valgumHypoplasia of the odontoid processKeratan sulfate excretion in urineMetaphyseal widening

MicrocephalyDental crowdingHypotoniaCentral apneaGastroesophageal reflux disease

?

❖ Filter the genome, based on patient symptoms❖ How? We know many gene-disorder associations

GALNS…. ….

….

https://guideimg.alibaba.com/images/shop/88/11/20/2/close-up-of-a-baby-with-an-ice-pack-on-his-head-poster-print-18-x-24_1882252.jpg

6. Phenotype match: not so easy

There are >100,000 gene-disease-symptom associations!

…

......

OMIM:615031 TECPR2 9895 HP:0001260 Dysarthria

OMIM:615031 TECPR2 9895 HP:0000248 Brachycephaly

OMIM:615031 TECPR2 9895 HP:0001250 Seizures

OMIM:615031 TECPR2 9895 HP:0001284 Areflexia

OMIM:615031 TECPR2 9895 HP:0000475 Broad neck

OMIM:253000 GALNS 2588 HP:0003308 Cervical subluxation

OMIM:253000 GALNS 2588 HP:0002857 Genu valgum

OMIM:253000 GALNS 2588 HP:0003311 Hypoplasia of the odontoid process

OMIM:253000 GALNS 2588 HP:0012069 Keratan sulfate excretion in urine

OMIM:253000 GALNS 2588 HP:0003016 Metaphyseal widening

…...

...

ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt by HPO team

6. Phenotype match: how?

❖ NaiveConsiders terms separatelyTerms directly linked to diseases/genes

❖ SmartConsiders terms as one collectionTerms algorithmically matched to diseases/genes

6. Phenotype match: naive method

❖ Naive, using specific termsexample:

"Keratan sulfate excretion in urine"

"Cervical subluxation"

OMIM:253000 GALNS 2588 HP:0012069 Keratan sulfate excretion in urineOMIM:253010 GLB1 2720 HP:0012069 Keratan sulfate excretion in urineOMIM:601492 HYAL1 3373 HP:0012069 Keratan sulfate excretion in urine

OMIM:253000 GALNS 2588 HP:0003308 Cervical subluxationOMIM:253010 GLB1 2720 HP:0003308 Cervical subluxationOMIM:607095 RMRP 6023 HP:0003308 Cervical subluxation

HYAL1 RMRPGALNSGLB1

Prime candidates? Also interesting?

Pitfall: Chances of an exact match are very small

6. Phenotype match: naive method

❖ Naive, using broad terms

Pitfall: too many genes to investigate!

Seizures: 1181 genesMicrocephaly: 644 genesMuscular hypotonia: 928 genesHearing impairment: 1231 genesDevelopmental delay: 639 genesShort stature: 994 genesNystagmus: 706 genesCognitive impairment: 894 genes

Substringed on July 2015 release of ALL_SOURCES_ALL_FREQUENCIES_diseases_to_genes_to_phenotypes.txt by HPO team

6. Phenotype match: smart method

Input symptom set A

Symptoms of disease B

❖ Smart matchingusing any terms

❖ HPO symptoms =graph of terms

❖ “Semantic similarity”between collectionsof symptoms (Resnik, 1999)

6. Phenotype match: smart tools

Phenotips ( http://phenotips.org )

Comprehensive matching in nice GUI, including ‘negative’ symptoms and suggestions for differential diagnosis

http://phenotips.org/


AMELIE ( https://amelie.stanford.edu/ )

Enter symptoms and candidate genes, uses automated literature search to prioritize candidate genes

Why: truly very good performance for finding known disease genes (unpublished benchmark)

https://amelie.stanford.edu/


GADO ( https://genenetwork.nl/gado )

Enter symptoms and optional candidate genes, prioritizes any expressed gene based on massive RNA-seq gene network

Why: perfect for discovering new disease genes

https://genenetwork.nl/gado

6. Phenotype match: wrap-up

Pitfall:

❖ Not every ‘phenotype matching’ software is smart!

So be wary of how the matching is performed

For ‘smart’ software:

Enter as many, as specific symptoms as possible

this will only improve the match and chances of success!

For ‘naive’ software:

Try a few specific terms first, but don’t get stuck. Broaden

your terms when needed, don’t enter too many at once.

no point in getting thousands of hits

7. Deleteriousness: traditional tools

Many conservation-based tools that predict how damaging a

variant will be:

SIFT, PolyPhen, PROVEAN, PON-P2, MutationAssessor, FATHMM-MKL, Condel, PhyloP, UMD-Pred, Grantham, DeepPVP, ENTPRISE, PHAST, FitCons, DANN, MutPred, GERP, Xvar, VAAST, AlignGVGD, MAPP, MutationTaster, Alamut, MVP, EIGEN, REVEL, DIVAN, DeepSEA, REMM, GWAVA, DeltaSVM, DANQ, and so on ...

Good tools! But there are some issues:

• Scope difference (types of variants, genomic regions)• Results often correlated with each other• Training bias or even overfitted

So which to use?

7. Deleteriousness: next gen tools

LINSIGHT (http://www.nature.com/ng/journal/v49/n4/full/ng.3810.html)

EIGEN (http://www.nature.com/ng/journal/v48/n2/full/ng.3477.html)

CADD (https://www.nature.com/ng/journal/v46/n3/full/ng.2892.html)

Machine-learned model using 60+ tools & sources,trained to find the difference between:variants stabilized in evolution since primates andvariants randomized across the genome

Low CADD score means:This variant looks like a stabilized variant, safe

High CADD score means:This variant looks like a non-stabilized variant, could be dangerous

Great for variant prioritization

http://3.bp.blogspot.com/-yeR87n827N8/TkagxSc9OuI/AAAAAAAAAyI/bYwDJjym8lA/s1600/Rise-of-the-Planet-the-Apes-James-Franco.jpg

7. Deleteriousness: CADD scores

COL3A1Ehlers Danlos Syndrome, data from Raymond Dalgleish, https://eds.gene.le.ac.uk/home.php?select_db=COL3A1

Patient causal variantGoNL variant1000G variant

http://www.reluctantcontortionist.co.uk/wp-content/uploads/2014/06/James-Morris-The-Elastic-Skin-Man-2-264x300.png

✔Threshold between pathogenic & benign?

7. Deleteriousness: CADD scores

MEFVFamilial Mediterranean Fever, data fromINSAID Consortium, http://molgenis.org/said

https://i.pinimg.com/originals/4e/b2/e6/4eb2e6e6c3e42b80ac363b3bc7dce572.jpg

Patient causal variantGoNL variant1000G variant

xPitfall: does not work for every gene!

Threshold between pathogenic & benign?

7. Deleteriousness: GAVIN method

GAVIN - Gene-Aware Variant INterpretation for medical sequencing Van der Velde et al., Genome Biology (2017) 18 (1): 6.

❖ Automated variant classification❖ Based on CADD calibrations❖ Web app: molgenis.org/gavin❖ UMC Groningen genome

diagnostics: 2x faster

Special highlight:

Spe

cific

ity →

Sensitivity →Plot by Kristin Abbott

Bonus: Excel gene names @$%&#-ups

Pitfall: Gene name errors caused by Excel

Zeeberg et al. 2004, Ziemann et al. 2016

http://www.winbeta.org/news/20-of-scientific-papers-on-genes-contain-gene-name-conversion-errors-caused-by-excel

This will happen to you

in summary...

sequence filter

prioritize identify

P

BB

BB

B

B BBB

BBB

B

PS: Solve-RD: huge patient re-analysis

21 international partners + ERNs19,000 unsolved WES re-analysis2,000 cases selected for WGS800 ultra-RD patients WES/WGS500 long-read genomes (PacBio)120 ‘unsolvables’ into multi-omics

Structural variation

GeneNetwork

A.S.E.

Methylation network

Exon skipping

Literature match

GAVIN+

InterVar SilVA

SPIDEX

Kumar NC

VIPUR

“UTR tool” “TFB

S to

ol”

dept. of Genetics

Diagnostics & research

Testbeds

Your help needed:Fresh ideas/methods/tools we can try out to diagnose moreSamples of undiagnosed patients (i.e. FASTQ/BAM/VCF files)

Project highlight

Some useful hyperlinks1. Clinical Genomics Database http://research.nhgri.nih.gov/CGD

2. Genome of the Netherlands http://nlgenome.nl

2. 1000 Genomes Project http://1000genomes.org

2. gnomAD database http://gnomad.broadinstitute.org

3. Variant Effect Predictor http://www.ensembl.org/vep

3. SnpEff tool http://snpeff.sourceforge.net

4. MOLGENIS for scientific data http://molgenis.org

5. Online Mendelian Inheritance in Man http://omim.org

6. HPO terms http://human-phenotype-ontology.github.io

6. Phenomizer http://compbio.charite.de/phenomizer

6. PhenoTips diagnosis https://phenotips.org

7. CADD scores http://cadd.gs.washington.edu

7. GAVIN method http://molgenis.org/gavin

8. European Spreadsheet Risk Interest Group http://www.eusprig.org

9. Solving the Unsolved Rare Diseases http://solve-rd.eu/

http://www.eusprig.org

Thank you! Questions & acknowledgements

• Special thanks to:• Morris Swertz, Richard Sinke, Kristin Abbott, Rolf Sijmons

• Collaborators from UMCG, RUG, (inter)national projects

• Colleagues & students:

contact: [email protected]

annotation of variants vkgl / vkgn course: cq-f1, erasmus ... van der j..pdfand genotypes variant...

Documents