computational discrete mathematics and statistics for molecular array data

Computational Discrete Mathematics and Statistics for

Molecular Array Data

Bill Shannon

Washington University

School of Medicine

Molecular Biology

“How Genes Work”, http://www.nigms.nih.gov

Microarrays

A B CGene

A B CGene

Normal Cell Tumor Cell

*Mes

seng

er R

NA

Lev

els

*Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.

Microarrays (Leukemia PPG)35 Probes Selected from ~50,000

Array Data Present New Data Analysis Challenges

(Curse of Dimensionality)

• Inaccuracy, or error, of a model becomes large very fast

– sparseness (descriptions of the data is impossible)

– model complexity (too many interaction terms, non-

linear effects, etc. to consider)

– random multicollinearity (spurious correlations)

Regression (Curse of Dimensionality)

• y = f(x) + error

• sparseness = little local signal– model parameters not estimated accurately– unstable models over-fit data (not genralizable)

• Non-parametric methods (e.g., CART, neural nets) – require a lot of model searching– use up degree’s of freedom rapidly– little or no information left to determine significance

Cluster Analysis (Curse of Dimensionality)

• Find structure in data

• Many cluster results with same goodness-of-fit

• Deciding among the models is impossible.

Classification Models (Curse of Dimensionality)

• Predict group membership (e.g., tumor versus normal)

• Three broad categories

– geometric methods (discriminant analysis, CART)

– probabilistic methods (Bayesian)

– algorithmic methods (neural networks, k-NN)

• Require training/validation datasets

Other Methods (Curse of Dimensionality)

• Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting)

• Multiple testing adjustment such as false discovery rate or permutation testing

Mantel Statistics

• Transform standard NxP data matrices into NxN subject pairwise distances or similarities

• Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix

Mantel Statistics

2,,

22,2,

21,1,, PiPiiiiiii xxxxxxd

0

0

0

0

3

2

1

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

21

N

N

N

P

PNNN

P

P

P

P

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

GGGSample

Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.

Mantel Statistics

0

0

0

0

3

2

1

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

)()2()1(

N

N

N

Pk

PNNN

P

P

P

k

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

GGGSample

Mantel Statistics

0

0

0

0

,3

,23,2

,13,12,1

N

N

N

Pk d

dd

ddd

D

0

0

0

0

,3

,23,2

,13,12,1

N

N

N

P d

dd

ddd

D

Signal + Noise Genes Signal Genes Only

ji

PkPkjiji

PPji

ji

PkPkji

PPjiPkP

dddd

ddddDD

2

,

2

,

,,,

Mantel Statistics

Correlating DP with Dk<<P avoids curse of dimensionality!

A positive Mantel correlation indicates the genes in Dk<<P contains the same information as the genes in DP

Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.

GA-Mantel

• Search algorithm to find signal genes

• Solution representation – list of genes (10 123 456 798 835 888 923)– binary vector {0000100110000….00010}

• Each solution maps to a Mantel correlation value – Assumption: the larger the correlation the more signal genes in

the solution

• Selection keeps solutions with high Mantel correlation

Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO

Recombination

Mutation

Gene Subset Selection

• Given– a data set comprising N microarray experiments with

g genes

• Find:– a subset of genes that captures relevant

relationships among the experiments

• Goal:– reduce data for further analysis– identify meaningful biological markers for diagnosis

1. Randomly generate an initial population

2. Do until stopping criteria is met:

Select individuals to be parents (biased by fitness).Produce offspring by recombination/mutation.Select individuals to die (biased by fitness).

End Do.

3. Return a result.

Genetic Algorithm

Fitness Evaluation for Gene Selection

• Calculate DP using all genes

• For each Subset(k) in current population:– Calculate Dk<<P

– Correlate DP with Dk<<P

• Use Mantel Correlation as fitness to select next population of solutions

• Permute to compute P-values

0

0

0

0

3

2

1111

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

N

N

N

P

PNNN

P

P

P

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

WtWtWtSample

0

0

0

0

3

2

1101

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

N

N

N

Pk

KNNN

K

K

K

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

WtWtWtSample

GA on Artificial Data• Simulated data:

– 100 experiments with 10,000 genes• 100 signal genes

• 9900 noise genes

– Two groups• Group 1 has signal genes sampled from N(0, 1)

• Group 2 has signal genes sampled from N(1, 1)

• GA Parameters– population size 200

– generations 200

• Outcome measures (averaged over 10 runs of the GA)

– prevalence (signal, noise) – number of signal and noise genes in GA answer

– correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix

– coverage - number of signal genes identified over all GA runs

GA on Artificial Data

Length = 30

Prevalence:mean number of signal genes = 22.9 (0.7)

76.3% (std 0.53%)

Correlation:mean rho for best subsets = 0.787 (0.009)

p-value < 0.0001

Coverage:total signal genes identified across 10

runs = 65/100

Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs

GA on Golub Data Set• Data set: Golub training set (38 x 7129)• Two Groups:

– 27 samples from ALL patients– 11 samples from AML patients

• GA searched for subsets of fixed length (10 to 50)• population = 200, generations = 200

• Mantel correlation tends to increase with subset sizeLength Final Mantel Corr p-value

10 0.926 (0.005) < 0.00001

20 0.954 (0.004) < 0.00001

30 0.967 (0.002) < 0.00001

40 0.975 (0.002) < 0.00001

50 0.979 (0.002) < 0.00001

Significant Feature Subsets

Clustering of Samples using all genes Clustering of Samples using 50 genes from GA

Letting GA Select Subset Length• Data set: Golub training set (38 x 7129)• GA searched over variable length subsets (min=5 max=50)• Fitness penalty = d * length / 50• population = 200, generations = 200

• Tradeoff between length of subsets and correlation score

Length Penalty d Pop Final Len Best Final Length Final Mantel Corr

0.00 48.6 49.0 0.979 (0.001)

0.25 17.4 25.0 0.954 (0.005)

0.50 10.7 16.1 0.939 (0.009)

1.00 7.7 10.6 0.922 (0.009)

Data Reduction

• Observation: GA appears to repeatedly converge to same regions of feature space

– In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets

• GA can also be used to find feature subsets that minimize rho

• pop = 200

• length = 50

• data set = Golub

• GA finds subsets with rho = 0 within 50 gens

GA in Experimental Data Analysis

• Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia)

• T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs

• Regulatory T-cells (Treg) suppress immune response

• Choi and DiPersio are studying the genetic mechanisms of Treg regulation

Mouse Array Experiment

GROUP TREATMENT ARRAYS

1 Naïve Treg dec1, dec5

2 Activated Treg dec2, dec6, dec10

3 PBST (Control) dec3, dec7, dec11

4 Decitabine treated dec4, dec8, dec12

~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work

Mouse Array Experiment

• Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)

Act+Dec Vs Naïve Vs Control

Naïve+Dec Vs Act Vs Control

Summary

• GA-Mantel effective at identifying signal genes

• Longer gene subsets associated with higher scores– tradeoff: higher correlations vs. smaller subsets– requires constraining growth of subsets in GA

• GA effective at identifying noise genes

• GA-Mantel can find genes associated with phenotype

Future Directions

• RFA CA-08-005 (under review)

– Optimize algorithm to improve coverage of solution space– Multiple solutions– Combine solutions (weak hierarchies)

• Lung disease R01 (to be submitted)

– Microarrays to identify disease subgroups across the bronchitis/emphysema continuum

Weak Hierarchies

Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.

computational discrete mathematics and statistics for molecular array data

Documents

subset of genes

dimensionality inaccuracy

nxp data matrix p n

impossible model complexity

mantel correlation scoring

classification models

model averaging bagging

lot of model searchinguse