ls-snp: large-scale annotation of coding non-synonymous snps based on multiple information sources
DESCRIPTION
-Bioinformatics April 2005. LS-SNP: Large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Motivation. Over 9 million snps in dbsnp with little functional annotation nsSNPs are critical importance for disease and drug sensitivity - PowerPoint PPT PresentationTRANSCRIPT
LS-SNP: Large-scale annotation of coding non-
synonymous SNPs based on multiple information sources
-Bioinformatics April 2005
Motivation
• Over 9 million snps in dbsnp with little functional annotation
• nsSNPs are critical importance for disease and drug sensitivity
• Prediction of functional snps enables targetting of snps to be genotyped in candidate gene studies
• help identify causative snp within snps that are in ld
Aims
• Identify candidate functional SNPs in– Gene– Haplotype– pathway
• Map nsSNPs onto protein sequences, functional pathways, comparative structure models
Predictions of snp function
• Predict positions where nsSNPs– rule based:
• destabilize proteins,• interfere with formations of domain-domain
interfaces• protein-ligand binding
– supervised learning (svm):• severely affect human health
Methods - pipeline• SNP-protein mapping
• Sequence to structure (exp derived)– genomic seq, protein seq, protein structure
• SNP prediction annotations combine:– rule based– supervised learning (svm)
SNP Annotations-rule based
• destabilizing (Sunyaev, et al., 2001) if:– RSA (rel solv access)< 25% and diff in
accessible surface propensities (knowledge based hydrophobic potentials) > 0.75
– RSA>50% and diff in accessible surface propensities > 2
– RSA<25% and charge change– variant involves a proline in
a helix
rule based (cont.)• Interference with domain-domain if:
– any of 4 rules combined and– within <=6A of an atom in an adjacent
domain• effect protein-ligand binding is predicted
– any of 4 rules combined and– ligand-binding if <=5A of a HETATM
• (not covalently bonded to the protein, not one of the 20 aa nor in a water mol)
SNP Annotations-supervised learning (svm)
• train svm to discriminate between mongenic disease nsSNPs from OMIM and neutral snps from dbSNP
(measure of strain)
(chemical similarity)
svm – training dataset
• 1457 disease-associated– VARIANTS in SWISS and OMIM
• 2504 neutral– neutral VARIANTS according to rules 1-4
• 3-fold cross validation– train on subset 1 and 2 test on 3– repeated 10 times
svm – training dataset
• the absolute values gives confidence• exclude low confidence predictions
– accuracy of 80.5%(+-0.3%)– false pos 19.7%(+-0.2%)– false neg 18.7%(+-0.8%)– 122 rejected on low confidence
Results-mapping
• snp to protein mapping– 28,043 (21,255 dbSNP) validated coding
nSNPs– 70,147 (54,048 dbSNP) incl non validated
Results-structure• 13,391(53%) proteins have modelled
domains with equivalent residues• 13,062 (19%) nsSNPs (all)• 8725 (31%) nsSNPs (validated)
– 67 nsSNPs appear in more than one protein (alt splicing)
Results -function
• 1886 destablizing nsSNPs (structural rules (1-4))
• 1317 monogenic disease-associated nsSNPs by svm– comparative models– conservation– sub properties
Web resourcehttp://alto.compbio.ucsf.edu/LS-SNP/
KEGG pathway,snp id(rs),hugo, swissprot
filter
– SCOP– swissprot– KEGG– UCSC– PDBSUM– MODBASE
genomic seq
protein seq
structure
snp prediction annotations
Discussion-data quality
• validated/non validated snps?– multiple independent submissions– submitter confirmation– alleles observed in at least 2 chr– submision to hapmap
• report non val and val snps with option to filter
Discussion -ligands
• local structural env of each snp-ligand cannot be evaluated by the pipeline
• all contacts reported– some will not be biologically interesting
• eg snp in proximity of glycerol will have no functional effect
• but, in glycerolkinase, the snp could be important
Discussion -structural annotations
• ModSNP 4109 str annotations. 70% sequence identity cutoff
• LS-SNP 13,062 dbSNP rsIDs (4907 validated) str annotations. No sequence identity cutoff-– instead, score given (0-1) based on seq
identity and model assessment (avg identity ~28%)
Discussion -structural annotations
• ‘…because structure annotations are models, use properties that depend on correct fold assignment and a good target template alignments opposed to atomic-level structural details such as loss of either salt bridges or hydrogen or disulphide bonds.’
Discussion -structural annotations
• not possible to model effects such as changes in backbone geometry
• or small side chain alterations
Case study-Glutathione S-Transferase
• GSTs play key role in cellular detoxification– domain interface– buried charge
change– unfavourable change
in accessible surface potential at buried postion
– conserved in mouse, rat,chicken
• combination of info sources build convincing case
Caveats• only updated twice a year• dependant on structure (comparative modelling)
– allowing predictions without structure data would have increased numbers
• no option to add your own snps • no idea as to which predictors are best
– combinations of predictors• domain-domain or ligand binding but no indication of
how damaging this might be• next version will have hapmap snps• svm – monogenic• only chose small, subset of Sunyaevs rules -
conservation?