Download - Evaluation of methods in gene association studies: yet another case for Bayesian networks
Evaluation of Evaluation of methods in gene methods in gene
association studies: association studies: yetyet
another case for another case for Bayesian networksBayesian networks
Department of MeasurementDepartment of Measurement and Information Systemsand Information Systems
Budapest University of Budapest University of Technology and EconomicsTechnology and Economics
Gábor HullámGábor Hullám, , Péter AntalPéter Antal,,András Falus and Csaba SzalaiAndrás Falus and Csaba Szalai
Department of GeneticsCell and
ImmunobiologySemmelweis University
Genetic association studies Genetic association studies (GAS)(GAS)
A Bayesian approach to GASA Bayesian approach to GAS Bayesian networks in GASBayesian networks in GAS Evaluation of methodsEvaluation of methods
3
MotivationMotivation:: Exploring the Exploring the vvariomeariome
VariomeVariome Single-Nucleotide Polymorphisms (SNPs)Single-Nucleotide Polymorphisms (SNPs) Copy-Number Variations (CNVs).Copy-Number Variations (CNVs). Genome rearrangGenome rearrangeements ments MethylomeMethylome
SNPsSNPs Number of SNPs (10Number of SNPs (1077 ->10 ->1066)) Correlation structure: the HAPMAP projectCorrelation structure: the HAPMAP project
GWAS Data analysis candidate regions.
genes
PGAS s11 PGAS s12 PGAS s13
Study design SNP datasets
PGAS Data analysis confirmations
refutations
384x48384x48384x48
Genome wide association study (GWAS)
GAS phases
GAS Facts Publications: ~40K SNPs on plate: 100K-2M Sample size: 30K
Confirmed associations: <1000 Small attributable risk
Why? Common disease – common variance
hypothesis - multifactorial diseases, many weak interactions
Rare haplotype hypothesis (Minor allele freq. <1%)
Number of gene association studies GWAS: ~100 PGAS-: ~10K
Current challenge: the discovery of epistasis
Statistical epistasis: non-linear Statistical epistasis: non-linear interaction of genes interaction of genes
The goal is the exploration ofThe goal is the exploration of…… explanatory variables explanatory variables ofof the target the target
variablevariable((ss)) the the interactioninteraction of explanatory variables of explanatory variables
Genetic association concepts can be Genetic association concepts can be formalized (partially) as machine formalized (partially) as machine learning concepts and as learning concepts and as Bayesian Bayesian network conceptsnetwork concepts
The model class: Bayesian The model class: Bayesian networksnetworks
directed acyclic graph (DAG)directed acyclic graph (DAG) nodes – domain entitiesnodes – domain entities edges – direct probabilistic relationsedges – direct probabilistic relations
conditional probability models P(X|Pa(X))conditional probability models P(X|Pa(X)) interpretations:interpretations:
effectiverepresentationof the distribution
N
i iiN XPaXPXXP11 ))(|(),..(
DAG structure:dependency map(d-separation)
edges:direct causalrelations
8
Y
BBayesian network features ayesian network features representing relevancerepresenting relevance
Markov Blanket (Markov Blanket (subsub)Graphs (MBGs))Graphs (MBGs)
(1) parents of the node(1) parents of the node
(2) its children(2) its children
(3) parents of the children(3) parents of the children
Markov BlanketMarkov Blanket Set Sets (MBs (MBSSs)s) the set of nodes which the set of nodes which
probabilisticallyprobabilistically isolate isolate the target from the restthe target from the rest of the model of the model
Markov Blanket Membership (MBM)Markov Blanket Membership (MBM) pairwise relationshippairwise relationship
9
GA-to-BNGA-to-BN
(model-based) pairwise association(model-based) pairwise association Markov Blanket Memberhsips (MBM)Markov Blanket Memberhsips (MBM)
MultivariaMultivariatte analysis e analysis Markov Blanket Markov Blanket sets (MB)sets (MB)
MultivariaMultivariatte analysise analysis with interactions with interactions Markov Blanket Subgraphs (MBG)Markov Blanket Subgraphs (MBG)
CCausal ausal relations/relations/models models Partially Partially directed directed BayesBayesianian networknetwork (PDAG) (PDAG)
HierarchyHierarchy DAG=>DAG=>PPDAG=>DAG=>MBGMBG=>=>MBMB=>MBM=>MBM
AdvantagesAdvantages of of GA-to-BN - GA-to-BN - 11
SStrong relevancetrong relevance - direct association: - direct association: Clear semantics and dedicated goal for Clear semantics and dedicated goal for the explicit. faithful representation of the explicit. faithful representation of strongly relevant (e.g. non-transitive)strongly relevant (e.g. non-transitive) relationrelationss
GGraphical representationraphical representation:: It offers It offers better overview of the dependence-better overview of the dependence-independence structure. e.g. about independence structure. e.g. about interactions and conditionalinteractions and conditional relevance.relevance.
MultiMultiple ple targetstargets:: It inherently works for It inherently works for multiple targets.multiple targets.
AdvantagesAdvantages of of GA-to-BN – GA-to-BN – 22
IIncomplete datancomplete data:: It offers integrated It offers integrated management of incomplete data management of incomplete data within Bayesian inference.within Bayesian inference.
Causality:Causality: Model-based causal Model-based causal interpretation of associationsinterpretation of associations
Haplotype level: Haplotype level: Offers integrated Offers integrated approach to haplotype reconstruction approach to haplotype reconstruction and association analysis (assuming and association analysis (assuming unphased genotype data)unphased genotype data)
Challenges of applying BNs in GAS
High computational complexity High sample complexity Bayesian statistics
Bayesian model averaging Feature posterior
GoalGoal: approximat: approximate e the full-scale the full-scale summationsummation (integral) (integral)
A solutionA solution: Metropolis coupled : Metropolis coupled Markov chain Monte Carlo (MCMCMC)Markov chain Monte Carlo (MCMCMC)
fFG G GPfFP:
)()(
Uncertainty in Uncertainty in multivariate analysismultivariate analysis
Entropy of the MBS posteriors
0
2
4
6
8
10
12
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
20_full
50_full
100_full
116
250
Entropy of the MBS posteriors
0
2
4
6
8
10
12
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
20_full
50_full
100_full
116
250
Advantages of the Bayesian framework
Automated Automated correction for correction for “multiple “multiple testing”testing” The measure of uncertainty at a given level The measure of uncertainty at a given level
automatically indicates its applicabilityautomatically indicates its applicability Prior incorporationPrior incorporation: better prior incorporation : better prior incorporation
both at parameter and structural levels.both at parameter and structural levels. Post fusion:Post fusion: better semantics for the better semantics for the
construction of meta probabilistic knowledge construction of meta probabilistic knowledge basesbases
Normative uncertainty for model properties (cf. bootstrap)
The basis for comparison
Our approach is a model based exploration of the underlying
structure(note: multiple targets, causal and
direct aspects)
≠Prediction of class labels
Comparison of GAS toolsComparison of GAS toolsDedicated GAS toolsDedicated GAS tools
BEAMBEAM BIMBAMBIMBAM SNPAssocSNPAssoc SNPMstatSNPMstat PowermarkerPowermarker
General purpose General purpose FSS toolsFSS tools
MDRMDR Causal ExplorerCausal Explorer
……
moderate number of clinical variables (in the moderate number of clinical variables (in the range of 50) range of 50)
hundreds of genotypic SNP variables for each hundreds of genotypic SNP variables for each patient patient
thousands of gene expression measurements thousands of gene expression measurements
Asthma Asthma Complex disease mechanismComplex disease mechanism Half of the patients do not respond well to current Half of the patients do not respond well to current
treatmentstreatments Unknown pathways in the asthmatic processUnknown pathways in the asthmatic process
Application domain: The Application domain: The genomic background of genomic background of
asthnaasthna
Evaluation on an artificial data set
Artificial model based on a real-world domain: the genomic background of asthma
The real data set consists of: 113 SNPs 1117 samples
The reference model
Reference MBG
Asthma
SNP23
SNP69
SNP110
SNP27
SNP11
SNP64
SNP109
SNP17
SNP95
SNP61
SNP81
Results - 1
Software (Parameters) Sensitivity Specificity Accuracy
BMLA (CH) 1 0.99 0.99115044
BMLA (BD) 0.92307692 1 0.99115044
HITON MB (k=1) 0.76923077 0.98 0.95575221
HITON MB (k=2) 0.76923077 0.99 0.96460177
HITON MB (k=3) 0.69230769 0.99 0.95575221
MDR – TurF 0.61538462 0.97 0.92920354
MDR – Relief 0.53846154 0.96 0.91150442
interIAMB (MI) 0.46153846 0.96 0.90265487
Results – 2.Results – 2.
Results – 3. Results – 3.
SummarySummary General BN repreGeneral BN representation is feasible sentation is feasible
and gives superior performance for and gives superior performance for PGASPGAS
Bayesian statistics allows the Bayesian statistics allows the quantification of applicability of BNsquantification of applicability of BNs
Special extensions are necessary forSpecial extensions are necessary for Multiple targetsMultiple targets Combined discovery of relevance and Combined discovery of relevance and
interactions (MBM, MBS, MBG) interactions (MBM, MBS, MBG) Scalable multivariate analysis (k-MBS concept) Scalable multivariate analysis (k-MBS concept) Feature aggregationFeature aggregationAntal et al.: A Bayesian View of Challenges in Feature Antal et al.: A Bayesian View of Challenges in Feature
Selection: Multilevel Analysis, Feature Aggregation, Selection: Multilevel Analysis, Feature Aggregation, Multiple Targets, Redundancy and InteractionMultiple Targets, Redundancy and Interaction , JMLR , JMLR Workshop and Conference ProceedingsWorkshop and Conference Proceedings
Future work
Specific local models (GA –specific local models)
Integrated missing data management and GA analysis (cf. imputation)
Noisy genotyping probabilistic data (see poster)
Integrated haplotype reconstruction (see poster) Integrated study design and analysis (see
poster) Scaling computation up to ~1000 variables