ls-snp: large-scale annotation of coding non-synonymous snps based on multiple information sources

LS-SNP: Large-scale annotation of coding non-

synonymous SNPs based on multiple information sources

-Bioinformatics April 2005

Motivation

• Over 9 million snps in dbsnp with little functional annotation

• nsSNPs are critical importance for disease and drug sensitivity

• Prediction of functional snps enables targetting of snps to be genotyped in candidate gene studies

• help identify causative snp within snps that are in ld

Aims

• Identify candidate functional SNPs in– Gene– Haplotype– pathway

• Map nsSNPs onto protein sequences, functional pathways, comparative structure models

Predictions of snp function

• Predict positions where nsSNPs– rule based:

• destabilize proteins,• interfere with formations of domain-domain

interfaces• protein-ligand binding

– supervised learning (svm):• severely affect human health

Methods - pipeline• SNP-protein mapping

• Sequence to structure (exp derived)– genomic seq, protein seq, protein structure

• SNP prediction annotations combine:– rule based– supervised learning (svm)

SNP Annotations-rule based

• destabilizing (Sunyaev, et al., 2001) if:– RSA (rel solv access)< 25% and diff in

accessible surface propensities (knowledge based hydrophobic potentials) > 0.75

– RSA>50% and diff in accessible surface propensities > 2

– RSA<25% and charge change– variant involves a proline in

a helix

rule based (cont.)• Interference with domain-domain if:

– any of 4 rules combined and– within <=6A of an atom in an adjacent

domain• effect protein-ligand binding is predicted

– any of 4 rules combined and– ligand-binding if <=5A of a HETATM

• (not covalently bonded to the protein, not one of the 20 aa nor in a water mol)

SNP Annotations-supervised learning (svm)

• train svm to discriminate between mongenic disease nsSNPs from OMIM and neutral snps from dbSNP

(measure of strain)

(chemical similarity)

svm – training dataset

• 1457 disease-associated– VARIANTS in SWISS and OMIM

• 2504 neutral– neutral VARIANTS according to rules 1-4

• 3-fold cross validation– train on subset 1 and 2 test on 3– repeated 10 times

svm – training dataset

• the absolute values gives confidence• exclude low confidence predictions

– accuracy of 80.5%(+-0.3%)– false pos 19.7%(+-0.2%)– false neg 18.7%(+-0.8%)– 122 rejected on low confidence

Results-mapping

• snp to protein mapping– 28,043 (21,255 dbSNP) validated coding

nSNPs– 70,147 (54,048 dbSNP) incl non validated

Results-structure• 13,391(53%) proteins have modelled

domains with equivalent residues• 13,062 (19%) nsSNPs (all)• 8725 (31%) nsSNPs (validated)

– 67 nsSNPs appear in more than one protein (alt splicing)

Results -function

• 1886 destablizing nsSNPs (structural rules (1-4))

• 1317 monogenic disease-associated nsSNPs by svm– comparative models– conservation– sub properties

Web resourcehttp://alto.compbio.ucsf.edu/LS-SNP/

KEGG pathway,snp id(rs),hugo, swissprot

filter

– SCOP– swissprot– KEGG– UCSC– PDBSUM– MODBASE

genomic seq

protein seq

structure

snp prediction annotations

Discussion-data quality

• validated/non validated snps?– multiple independent submissions– submitter confirmation– alleles observed in at least 2 chr– submision to hapmap

• report non val and val snps with option to filter

Discussion -ligands

• local structural env of each snp-ligand cannot be evaluated by the pipeline

• all contacts reported– some will not be biologically interesting

• eg snp in proximity of glycerol will have no functional effect

• but, in glycerolkinase, the snp could be important

Discussion -structural annotations

• ModSNP 4109 str annotations. 70% sequence identity cutoff

• LS-SNP 13,062 dbSNP rsIDs (4907 validated) str annotations. No sequence identity cutoff-– instead, score given (0-1) based on seq

identity and model assessment (avg identity ~28%)


• ‘…because structure annotations are models, use properties that depend on correct fold assignment and a good target template alignments opposed to atomic-level structural details such as loss of either salt bridges or hydrogen or disulphide bonds.’


• not possible to model effects such as changes in backbone geometry

• or small side chain alterations

Case study-Glutathione S-Transferase

• GSTs play key role in cellular detoxification– domain interface– buried charge

change– unfavourable change

in accessible surface potential at buried postion

– conserved in mouse, rat,chicken

• combination of info sources build convincing case

Caveats• only updated twice a year• dependant on structure (comparative modelling)

– allowing predictions without structure data would have increased numbers

• no option to add your own snps • no idea as to which predictors are best

– combinations of predictors• domain-domain or ligand binding but no indication of

how damaging this might be• next version will have hapmap snps• svm – monogenic• only chose small, subset of Sunyaevs rules -

conservation?

ls-snp: large-scale annotation of coding non-synonymous snps based on multiple information sources

Documents

val snps

snp idrs

interestingeg snp

causative snp

neutral snps

targetting of snps

nssnps validated67 nssnps

sequence identity cutofflssnp