representation of metabolomic data with wavelets

Post on 11-May-2015

96 Views

Category:

Self Improvement

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

June 5th, 2009 : Representation of metabolomic data with wavelets, Groupe de travail BioPuces, INRA d’Auzeville.

TRANSCRIPT

Representation of metabolomic data with wavelets

Nathalie Villa-Vialaneixhttp://www.nathalievilla.org

Toulouse School of Economics

Workgroup BioPuces, INRA de CastanetJune 5th, 2009

BioPuces (05/06/09) Nathalie Villa Metabolomic data 1 / 16

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 2 / 16

Database presentation

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 3 / 16

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Database presentation

Basics about the data baseThe database was given by Alain Paris (INRA) and consists ofmetabolomic registration (H NMR) from urine of mice.950 variables from 0.505 ppm to 9.995 ppm.

Baseline has been removed and peaks have been aligned.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 4 / 16

Database presentation

Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.

As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16

Database presentation

Purpose of the work

Study the effects of the ingestion of Hypochoeris radicata (HR) on themetabolism: the inflorescences of this plant are known to be responsiblefor a horse desease, the Australian stringhalt.As it is hard to obtain several dizains of horses to kill them, theexperiments have been conducted on 72 mice.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 5 / 16

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females

3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice

3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice

⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Database presentation

Description of the experiment

72 mice from:2 sexes 36 males 36 females3 kinds of HR doses 0 (control) : 24 mice 3%: 24 mice 9%: 24 mice3 sacrifice dates 8th day: 24 mice 15th: 24 mice 21st: 24 mice⇒ 18 groups.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 6 / 16

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.

In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Database presentation

Measurements days

The urine was collected:

Days 0 1 4 8 11 15 18 21Nb of observations 68 68 68 66 46 44 19 18

For each mice, from 2 to 22 measurements are made.In conclusion, 397 observations for 950 variables.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 7 / 16

Wavelet representation

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 8 / 16

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k) +J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Wavelet representation

Basic principle of waveletsFor a given J integer, the spectra can be expressed at level J as:

f(x) =∑

k

αk 2−J/2Ψ(2−Jx − k)︸ ︷︷ ︸Trend: based on the father wavelet Ψ

+J∑

j=1

∑k

βjk 2−j/2Φ(2−jx − k

)︸ ︷︷ ︸

Details at levels 1,...,J: based on the mother wavelet Φ

BioPuces (05/06/09) Nathalie Villa Metabolomic data 9 / 16

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced

↓ ↘

Level 1 Trend↓ ↘

Level 2 Trend. . .↓ ↘

Level 9⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details

↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details

. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details

⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Wavelet representation

Hierarchical decomposition

We add 74 zero values at the end of the spectra to have a diadic discretesampling.

Original Data: f observed at t1 ... t1024 equally spaced↓ ↘

Level 1 Trend Details↓ ↘

Level 2 Trend Details. . .↓ ↘

Level 9 Trend Details⇒ At level 9 (maximum level with 1024 length discrete sampling), weobtain 1025 coefficients.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 10 / 16

Wavelet representation

Examples

Trend Details

BioPuces (05/06/09) Nathalie Villa Metabolomic data 11 / 16

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)

Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.

Minimization of an empirical (self-created) quality criterium:√1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Wavelet representation

DenoisingFor coefficients corresponding to details greater than J (with J largeenough), a filtering is made:

c∗ =

{0 if |c | < 2

√log 10σ̂

c if |c | ≥ 2√

log 10σ̂

(Donoho and Johnstone)Two parameters are to be tuned:• Which wavelet has to be used?• Which J has to be used?

to make a trade-off between quality of the reconstruction of the function(what are the values on the functions built on the the basis of the filteredcoefficients?) and the number of non negative coefficients.Minimization of an empirical (self-created) quality criterium:√

1n

∑i

1D

∑j

(fi(tj) − f̂i(tj)

)2+

Nb of non negative coefficientsNb of coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 12 / 16

Wavelet representation

Final reconstruction of the data

274 positive coefficients

BioPuces (05/06/09) Nathalie Villa Metabolomic data 13 / 16

Wavelet representation

BoxplotsOriginal coefficients

Scaled coefficients (reduction by mean and standard deviation)

BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16

Wavelet representation

BoxplotsScaled coefficients (reduction by mean and standard deviation)

BioPuces (05/06/09) Nathalie Villa Metabolomic data 14 / 16

Perspective of work

Sommaire

1 Database presentation

2 Wavelet representation

3 Perspective of work

BioPuces (05/06/09) Nathalie Villa Metabolomic data 15 / 16

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.

Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16

Perspective of work

Using random forests

The idea is to use random forest to make prediction and also extract themain coefficients responsible for the explanation of the target variables.Proposed regression: the scale coefficients will be the explanatoryvariables. The variable of interest could be:

• the dose (either as a number or as a class leading to a classificationproblem);

• the total dose injected (i.e., the dose multiplied by the number ofdays of ingestion);

• any other interesting idea?

The idea is to rebuilt the individuals from the main coefficients (putting theothers to zero) to see which peaks are different from one group to theothers.

BioPuces (05/06/09) Nathalie Villa Metabolomic data 16 / 16

top related