combinatorial methods for disease association search and susceptibility prediction alexander...
TRANSCRIPT
![Page 1: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/1.jpg)
Combinatorial Methods for Disease Association
Search and Susceptibility Prediction
Alexander Zelikovsky
joint work with Dumitru BrinzaDepartment of Computer Science
![Page 2: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/2.jpg)
2
Outline
SNPs, Haplotypes and Genotypes Disease Association Search
Genome-wide association search challenges Problem formulation Exhaustive & Combinatorial Search Optimization formulation & complimentary greedy search
Predicting susceptibility to complex diseases Problem formulation/cross-validation Previous methods: SVM, RF, LP Optimum clustering and prediction via model-fitting
Conclusions
![Page 3: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/3.jpg)
3
Length of Human Genome 3 109 #Single nucleotide polymorphism (SNPs) 1 107 SNPs are mostly biallelic, e.g., AC Minor allele frequency should be considerable e.g. >.1% Difference b/w ALL people 0.25% (b/w any 2 0.1%) Diploid = two different copies of each chromosome Haplotype = description of a single copy (expensive)
example: 00110101 (0 is for major, 1 is for minor allele) Genotype = description of the mixed two copies
example 01122110 (0=00, 1=11, 2=01) International Hapmap project: www.hapmap.org
SNP, Haplotypes, Genotypes
![Page 4: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/4.jpg)
4
Challenges of Disease Association Monogenic disease
A mutated gene is entirely responsible for the disease . Typically rare in population: < 0.1%.
Complex disease Interaction of multiple genes
2-SNP interaction analysis for a genome-wide scan with 1 million SNPs (3 kb coverage) has 1012 pairwise tests
Multiple independent causes Each cause explains < 10-20% of cases
Common: > 0.1%. In NY city, 12% of the population has Type 2 Diabetes
Multiple testing adjustment Reason for non-reproducible findings
![Page 5: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/5.jpg)
5
Disease Association Search Problem
01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120
-1-1-1-11111
Disease Status
Non-diseased genotypes: H
Sample population Sof individual genotypes
Risk/resistance factor = multi-SNP combination (MSC) a subset of SNP-columns of S the values of these SNPs, 0, 1, or 2
Cluster C= subset of S with an MSC, d(C) = diseased, h(C) = non-diseased
PROBLEM: Find all MSCs significantly associated with the disease
Diseased genotypes: D
SNPs
![Page 6: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/6.jpg)
6
Significance of Risk/Resistance Factors Measured by
Relative risk (RR) Odds ratio (OR) Their p-values
Unadjusted p-value: Probability of case/control distribution among exposed to risk factor, computed by binomial distribution
Multiple-testing adjustement: Bonferroni
easy to compute overly conservative
Randomization computationally expensive more accurate
![Page 7: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/7.jpg)
7
Exhaustive & Combinatorial Search Exhaustive search is infeasible
sample with n genotypes/m SNPs requires O(n3m)
Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-
SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of nondisease individuals.
Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized
Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters
Finds faster associated MSCs but still too slow
Tagging: compress S by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method
![Page 8: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/8.jpg)
8
MLR Tagging
Stepwise Index SNP Algorithm: Choose as a tag the SNP which best predicts all other SNPs Choose the next one which together with a first best predicts all
other SNPs and so on. Prediction method is based on Multiple Linear Regression
(MLR)
So far beats in quality other methods (STAMPA)
![Page 9: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/9.jpg)
9
Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243
Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646
Tick-borne encephalitis dataset of (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54
![Page 10: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/10.jpg)
10
Disease association search results IES(30):
exhaustive search 30 indexed SNPs with
MLR based tagging method
ICS(30): combinatorial
search 30 indexed SNPs
with MLR based tagging method.
![Page 11: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/11.jpg)
11
Disease Association Search Optimum Association Search
Problem: Find MSC that is the most associated
with the disease Measure: positive predictive value
= find (non-)diseased-free cluster of maximum size
Bad news: Generalization of max independent set NP complete and cannot be well
approximated Hope: sample S is not arbitrary
![Page 12: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/12.jpg)
12
Complimentary Greedy Algorithm Algorithm
Start with C=S (resp. MCS is empty) Repeat until h(C)=0 (non-diseased-free)
Find 1-SC s maximizing (h(C)-h(C {s})) / (d(C) – d(C {s})) = minimize payment with diseased for removal of non-diseased
Add s to SNPs of C’s MSC
Analogy: finding independent set by greedy removing highest degree vertecies
Extremely fast but inaccurate Can be used in susceptibility prediction
![Page 13: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/13.jpg)
13
Most disease-associated & disease-resistant MSC
Comparison of three methods for searching the disease-associated and disease-resistant multi-SNPs combinations with the largest PPV. The starred values refer toresults of the runtime-constrained exhaustive search
![Page 14: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/14.jpg)
14
Genetic Susceptibility Prediction
Given: Genotypes of diseased and non-diseased individuals, Genotype of a testing person.Find: The disease status of the testing person
01012010201022100220110210120021020012001222111000200110022121011101202020100110012012001010001102102200020211120021011000212120
-1-1-1-11111
Genotype Disease Status
healthy
sick
testing - gt 0110211101211201 s(gt)
![Page 15: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/15.jpg)
15
Cross-validation Leave-one-out test: The disease status of each genotype
in the data set is predicted while the rest of the data is regarded as the training set.
-10101201020102210022011021012002102001200122211100020011002212101
Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3.
-1-11
Genotype Real Disease Status
-1-1
1
PredictedDisease Status
1
0020011002212101 1 1
Accuracy = 80%
![Page 16: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/16.jpg)
16
Quality Measures of Prediction (confusion table)
Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease.
specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors.
PredictionDisease
+ -
Test
+True Positive False Positive
(TP) (FP)
-False Negative True Negative
(FN) (TN)
![Page 17: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/17.jpg)
17
Prediction Methods Support vector machine Random forest LP-based prediction
![Page 18: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/18.jpg)
18
Prediction via Clustering Drawback of the prediction problem formulation =
need of cross-validation no optimization Clustering P = partition into clusters defined by
MSC’s Minimizing number of errors S.t. bounded information entropy –∑(Si/S) log(Si/S)
Model-fitting prediction Set status of testing genotype to diseased Find number of errors Set status of testing genotype to diseased Find number of errors Predict status that implies lesser number of errors
![Page 19: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/19.jpg)
19
Leave-1-out cross-validation results
Leave-one-out cross-validation for combinatorial search-based prediction (CSP) and complimentary greedy search-based prediction (CGSP) are given when 20, 30, or all SNPs are chosen as informative SNPs.
![Page 20: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/20.jpg)
20
ROC curve
Comparison of 5 prediction methods on (Barkash et. al,2006 ) data on all SNPs.Area under the CSP’s curve is 0.81 vs 0.52 under the SVM’s curve.
![Page 21: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/21.jpg)
21
Conclusions
Combinatorial search is able to find statistically significant multi-gene interactions, for data where no significant association was detected before
Complimentary greedy search can be used in susceptibility prediction
Optimization approach to prediction New susceptibility prediction is by 8%
higher than the best previously known MLR-tagging efficiently reduces the
datasets allowing to find associated multi-SNP combinations and predict susceptibility
![Page 22: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/22.jpg)
22
International Symposium on Bioinformatics Research and Applications
May 6-9, 2007, Georgia State University, Atlanta, Georgia http://www.cs.gsu.edu/ISBRA/
Submissions must and must not exceed 12 pages in Springer LNCS style The proceedings of ISBRA 2007 will be published in LNBI
Important Dates Submission Deadline Notification of Acceptance Final Version Submission
December 20, 2006 January 31, 2007 February 21, 2007
Symposium Organizers • General Chairs: Dan Gusfield (University of California, Davis) and Yi Pan (Georgia State University) • Program Chairs: Ion Mandoiu (University of Connecticut) and
Alexander Zelikovsky (Georgia State University)
ISBRA provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications
![Page 23: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/23.jpg)
23
History of ISBRA ISBRA is the successor of the International Workshop on Bioinformatics Research and Applications (IWBRA), held on
- May 22-25, 2005 in Atlanta, GA and
- May 28-31, 2006 in Reading, UK
in conjunction with the International Conference on Computational Science
The two editions of IWBRA have enjoyed a great success, with special issues devoted to full versions of selected papers in
Springer LNCS Transactions on Computational Systems Biology and
IEEE/ACM Transactions on Computational Biology and Bioinformatics
![Page 24: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/24.jpg)
24
Support Vector Machine (SVM) Algorithm Learning Task
Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the
disease. Classification Task
Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not.
Linear SVM Non-Linear SVM
![Page 25: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/25.jpg)
25
Random Forest Algorithm
Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
Growing Tree, Split selection and Prediction.
Random sub-sample of training data, Random splitter selection.
![Page 26: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer](https://reader030.vdocument.in/reader030/viewer/2022032709/56649ea25503460f94ba6bfa/html5/thumbnails/26.jpg)
26
LP-based Prediction Algorithm Model:
Certain haplotypes are susceptible to the disease while others are resistant to the disease.
The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes.
Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive.
For each vertex-haplotype hi assign the weight pi,
such that for any genotype-edge ei,j =(hi,,hj )
where s(ei,j ) {-1,1} is the disease status of genotype ei,j. The sum of absolute values of genotype weights is maximized.