Learning classifiers from discretized expression quantitative trait loci
A. Masegosa, M. M. Abad-Grau, S. Moral and F. Matesanz CITIC, Universidad de Granada & I. P. López Neyra,CSIC, Granada, Spain
Outline
Introduction
Methods Classification Algorithms SNPs Processing
Data Sets
Experimental Results
Conclusions & Future Works
Introduction Genetics variants and Gene expression
Genotypes and mRNA transcript levels. Basis to understand complex diseases [18].
Gene variants modifies some gene expressions (eQTLs). Hard to indentify. Linkage desequilibrium with the real cause [23].
Associations between single SNPs and gene expression [7,21]. No multiple SNPs. Satistical inference and computational problems.
Introduction Our approach
SNP-GeneExpression data association. Pre-discretization of expression data Low expression and high expression.
Alternative statistical inference approach. From regression to classification. Supervised classification machinery.
Different assumptions. Hidden binary variable (non-observable mechanism) SNPs Hidden Variable Gene Expression
Introduction Our approach
Gene HLA-DRB5 (DRB5) Encode β chains for the DR HLA class II receptor. Associated with immune related diseases susceptibility [5].
Introduction Our approach
Gene HLA-DRB5 (DRB5) Encode β chains for the DR HLA class II receptor. Associated with immune related diseases susceptibility [5].
Low Expressed High Expressed
Methods Classification algorithms
Classification function:
X is subset of SNPs.
Y is the output variable: Low expression vs High expression.
Learning Machines: Supervised Classifers Learn a function “f” from a set of labeled data samples Different models: Naïve Bayes, SVM, C4.5….
Evaluate the prediction capacity of a subset of SNPs: If there is prediction capacity then there is association.
Methods SNPs Processing
Genotypes from Chromosome 6. DRB5 is this chromosome. Cis association.
SNPs grouped in blocks of low recombination. SNPs with high LD among them. Pairwise computations of confident intervals of LD [6].
Analyze association between DRB5 expression Single SNPs Block of SNPs
Data Set
107 unrelated individuals (parents). Yoruba (Nigeria) population
6593 SNPs from Chromosome 6 345 non-overlapping blocks of low recombination
Block ID
SNPs
per
blo
ck
Results All Blocks
Classification Models (predict the binarized expression of DRB5): Naïve Bayes, C4.5 and SVM
Regression Models (predict the continuous expression of DRB5): SVM-Reg [20] & Gaussian processes [25]
Evaluation Train models with 90% of data, Test over the other10% and repeat (10 fold cv).
Results SNPs with maximum prediction capacity
Table SNPs with perfect predicition (AUC=1.0)
Histrogram Homozygotic mutant allele (left bar), heterozygotic (central bar) and
Homozygotic wild type (right bar). High Expression (red) and Low Expression (blue).
Conclusions & Future Works By discretizing gene expression:
GeneExpression-SNPs associations with classification learning. Simplify the hypothesis: low vs high expression. Many variables (relevant, noisy, redundant) can be considered.
Gene DRB5 has been studied with YRI population. Perfect correlation between some SNPs.
Future Works: Automated discretization approach (Gaussian mixture model). Extend these analysis to other genes.