predictive analysis of metabolomics data in …...predictive analysis of metabolomics data in...
TRANSCRIPT
PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN CHRONIC KIDNEY DISEASE USING A SUBGROUP DISCOVERY ALGORITHM
Margaux Luck1, Gildas Bertho2,3,4, Eric Thervet2,5, Philippe Beaune2,6,7, François d’Ormesson1, Mathilde Bateson1, Anastasia Yartseva1, Cécilia Damon1, Nicolas Pallet2,4,6,7
1 Hypercube Institute, Paris, France; 2 Paris Descartes University, Paris, France; 3 CNRS UMR 8601, Paris, France; 4 Magnetic Nuclear Resonance Platform “MetaboParis-Santé”, Paris, France; 5 Renal Division, Georges PompidouEuropean Hospital, Paris, France; 6 INSERM U1147, Paris, France; 7 Clinical Chemistry Department, Georges Pompidou European Hospital, Paris, France
CONTEXT
Chronic kidney disease (CKD) Renal function decrease : decrease of glomerular filtration rate (GFR) Often detected quite late & irreversible complications 1.7 to 2.5 million patients Heavy and invasive treatment (dialysis, graft)
MATERIALS AND METHODS
RESULTS
eGFR ≥ 60 ml/min/1.73 m2
low to mild CKD24 subjects
eGFR < 60 ml/min/1.73 m2
moderate to established CKD86 subjects
FourierTransform
Frequency spectrumTemporal spectrum
Supervised analysis
2) Feature selection
b) Rule mining
3) Predictive model
Normalized Mutual Information (nMI)Measures of all types of features’ mutual dependence.
*eGFR is for estimated glomerular filtration rate.
ObjectiveIdentification of the metabolomic profiles predictive of CKD stagesthrough an exhaustive local data mining approach with rule generation.
Dataset : 110 subjects (Features : urine 1H NMR spectra, Target : CKD stages, 2 modalities)
1H NMR
1) Dimension reduction
High dimension32K points
- Equidistant binning of 0.04 ppm (AUC)- Discretization of the features with a 10 bin quantization
Feature selection
Rule mining
Predictive model
References:[1] Loucoubar C et al. PLoS One (2011) 6(9):e24085[2] N Pallet et al. Kidney International (2014) 85(5):1239-40
Supervised feature selection.Features (blue points) according to : χ2 test : -log(p-value)≥3 => Xχ2
nMI ≥ min[nMI(Xχ2)] = 0.14Number of features selected = 36
a) Data mining
Data mining
CKD Stages
fR
Model of prediction based on Rule generation
Features : signal acquisition and pre-processing
Target : CKD Stages
Complex mixture of metabolites
Overlaps of some peaks
Our local data mining approach with rule generation combined withlogistic regression supports the discriminant metabolites obtained ina previous study [2].
It also provides information on the distribution of the CKD stageswith respect to range of spectral values.
Perspectives Include multi-source dataset Refine the GFR classes Make a predictive model on subgroups defined by rules (i.e. local
predictive model)
z-scores and normalized lift measure of significative 1D rules generated for thedescriptive features 3.56 and 2.72 ppm for both output modalities GFR < 60 andGFR ≥ 60 ml/min/1.73 m2. A segment is a rule characterized by its z-score (y-axis),normalized lift level (colorscale), and its covered variable values (x-axis). The moreinteresting the rule is the redder and the higher on the y-axis it will be.Interpretation: Bucket 3.56 : number of subjects eGFR < 60 when AUC values.
Bucket 2.72 : Opposite effect.
Pearson’s chi-squared test (χ2)Statistical test of goodness of fit and independence of the observeddistribution of 2 discrete variables with respect to a theoreticaldistribution.
Oi and Ei are respectively the observed andexpected frequencies.r and c are resp. nb of rows and columns inthe contingency table.
p(x), p(y), p(x,y) are resp. X-marginal, Y-marginal, jointprobability distribution.H is for entropy.
Local distribution of the patients’ subgroups with identicaldegrees of CKD severity on each feature, using HyperCube® [1].
1D rules : {Feature condition} {Target modality}range of values CKD stage
Tests if the target modality proportion inthe rule (p) and the total population (p0)are different (z-score ≥ 1.96).
CONCLUSIONRules quality measures
Measures the over-density (Lift>1).
OXOXXXOOXOXOOOOXOXXO
Dis
cret
izat
ion
in 1
0 b
ins
Values
ppm features
OXOOOXXOOXOXOOOOXXXX
XOOXOXOOOOXXXXOXOOOX
XOOXOXOXOXOOOXOOOXXO
OOXXXOOXOXOXOXOOOXOX
OXOOXOOXOXOXOOOXXOXX
XOOXOXOOXOXXXXOXOXOX
XOOXOXOXOXOOOXOOOXXX
OOXXXOOXOXOXOXOOOXOX
OXOOXOOXOXOXOOOXOOXX
O eGFR < 60
X eGFR ≥ 60
10 -
9 -
8 -
7 -
6 -
5 -
4 -
3 -
2 -
1 -
1D rules
- Data split into a train (80%) and a test (20%) set- Regularization parameter optimization (10 folds CV)- Classification score (Leave One Out, weighted average f1 measure) - Permutation test, 100 runs
Binary class L2 penalized logistic regression
Model evaluation:
Classification score : 0.79 (p-value : 0.001)
The 10 features selected with ourmethod (except the bucket 2.48)were in the first 16 features obtainedwith the Orthogonal Partial LeastSquare (O-PLS) model [2].
Our method provided informationon the distribution of the targetmodalities with respect to range ofspectral values (buckets AUC).
Metabolomic profiles of the CKDstages : GFR => [citrate](metabolic disorders) ; [dimethylsulfone] (clearance withoutcompensation) ; [trigonelline] (protection against oxidative stressand apoptosis) …
Increase the number of subjectsyi is the target modality, Xi the explanatory features, w the weight and C the regression coefficient.
p-value=0.05
18 featureswith χ2
36 featuresWith nMI
Relevant generated 1D rules were based on 10 out of the 36 previously selected features.