predictive analysis of metabolomics data in …...predictive analysis of metabolomics data in...

PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN CHRONIC KIDNEY DISEASE USING A SUBGROUP DISCOVERY ALGORITHM

Margaux Luck1, Gildas Bertho2,3,4, Eric Thervet2,5, Philippe Beaune2,6,7, François d’Ormesson1, Mathilde Bateson1, Anastasia Yartseva1, Cécilia Damon1, Nicolas Pallet2,4,6,7

1 Hypercube Institute, Paris, France; 2 Paris Descartes University, Paris, France; 3 CNRS UMR 8601, Paris, France; 4 Magnetic Nuclear Resonance Platform “MetaboParis-Santé”, Paris, France; 5 Renal Division, Georges PompidouEuropean Hospital, Paris, France; 6 INSERM U1147, Paris, France; 7 Clinical Chemistry Department, Georges Pompidou European Hospital, Paris, France

CONTEXT

Chronic kidney disease (CKD) Renal function decrease : decrease of glomerular filtration rate (GFR) Often detected quite late & irreversible complications 1.7 to 2.5 million patients Heavy and invasive treatment (dialysis, graft)

MATERIALS AND METHODS

RESULTS

eGFR ≥ 60 ml/min/1.73 m2

low to mild CKD24 subjects

eGFR < 60 ml/min/1.73 m2

moderate to established CKD86 subjects

FourierTransform

Frequency spectrumTemporal spectrum

Supervised analysis

2) Feature selection

b) Rule mining

3) Predictive model

Normalized Mutual Information (nMI)Measures of all types of features’ mutual dependence.

*eGFR is for estimated glomerular filtration rate.

ObjectiveIdentification of the metabolomic profiles predictive of CKD stagesthrough an exhaustive local data mining approach with rule generation.

Dataset : 110 subjects (Features : urine 1H NMR spectra, Target : CKD stages, 2 modalities)

1H NMR

1) Dimension reduction

High dimension32K points

- Equidistant binning of 0.04 ppm (AUC)- Discretization of the features with a 10 bin quantization

Feature selection

Rule mining

Predictive model

References:[1] Loucoubar C et al. PLoS One (2011) 6(9):e24085[2] N Pallet et al. Kidney International (2014) 85(5):1239-40

Supervised feature selection.Features (blue points) according to : χ2 test : -log(p-value)≥3 => Xχ2

nMI ≥ min[nMI(Xχ2)] = 0.14Number of features selected = 36

a) Data mining

Data mining

CKD Stages

fR

Model of prediction based on Rule generation

Features : signal acquisition and pre-processing

Target : CKD Stages

Complex mixture of metabolites

Overlaps of some peaks

Our local data mining approach with rule generation combined withlogistic regression supports the discriminant metabolites obtained ina previous study [2].

It also provides information on the distribution of the CKD stageswith respect to range of spectral values.

Perspectives Include multi-source dataset Refine the GFR classes Make a predictive model on subgroups defined by rules (i.e. local

predictive model)

z-scores and normalized lift measure of significative 1D rules generated for thedescriptive features 3.56 and 2.72 ppm for both output modalities GFR < 60 andGFR ≥ 60 ml/min/1.73 m2. A segment is a rule characterized by its z-score (y-axis),normalized lift level (colorscale), and its covered variable values (x-axis). The moreinteresting the rule is the redder and the higher on the y-axis it will be.Interpretation: Bucket 3.56 : number of subjects eGFR < 60 when AUC values.

Bucket 2.72 : Opposite effect.

Pearson’s chi-squared test (χ2)Statistical test of goodness of fit and independence of the observeddistribution of 2 discrete variables with respect to a theoreticaldistribution.

Oi and Ei are respectively the observed andexpected frequencies.r and c are resp. nb of rows and columns inthe contingency table.

p(x), p(y), p(x,y) are resp. X-marginal, Y-marginal, jointprobability distribution.H is for entropy.

Local distribution of the patients’ subgroups with identicaldegrees of CKD severity on each feature, using HyperCube® [1].

1D rules : {Feature condition} {Target modality}range of values CKD stage

Tests if the target modality proportion inthe rule (p) and the total population (p0)are different (z-score ≥ 1.96).

CONCLUSIONRules quality measures

Measures the over-density (Lift>1).

OXOXXXOOXOXOOOOXOXXO

Dis

cret

izat

ion

in 1

0 b

ins

Values

ppm features

OXOOOXXOOXOXOOOOXXXX

XOOXOXOOOOXXXXOXOOOX

XOOXOXOXOXOOOXOOOXXO

OOXXXOOXOXOXOXOOOXOX

OXOOXOOXOXOXOOOXXOXX

XOOXOXOOXOXXXXOXOXOX

XOOXOXOXOXOOOXOOOXXX

OOXXXOOXOXOXOXOOOXOX

OXOOXOOXOXOXOOOXOOXX

O eGFR < 60

X eGFR ≥ 60

10 -

9 -

8 -

7 -

6 -

5 -

4 -

3 -

2 -

1 -

1D rules

- Data split into a train (80%) and a test (20%) set- Regularization parameter optimization (10 folds CV)- Classification score (Leave One Out, weighted average f1 measure) - Permutation test, 100 runs

Binary class L2 penalized logistic regression

Model evaluation:

Classification score : 0.79 (p-value : 0.001)

The 10 features selected with ourmethod (except the bucket 2.48)were in the first 16 features obtainedwith the Orthogonal Partial LeastSquare (O-PLS) model [2].

Our method provided informationon the distribution of the targetmodalities with respect to range ofspectral values (buckets AUC).

Metabolomic profiles of the CKDstages : GFR => [citrate](metabolic disorders) ; [dimethylsulfone] (clearance withoutcompensation) ; [trigonelline] (protection against oxidative stressand apoptosis) …

Increase the number of subjectsyi is the target modality, Xi the explanatory features, w the weight and C the regression coefficient.

p-value=0.05

18 featureswith χ2

36 featuresWith nMI

Relevant generated 1D rules were based on 10 out of the 36 previously selected features.

predictive analysis of metabolomics data in …...predictive analysis of metabolomics data in...

Documents