computational discrete mathematics and statistics for molecular array data

40
Computational Discrete Mathematics and Statistics for Molecular Array Data Bill Shannon Washington University School of Medicine

Upload: zinna

Post on 13-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

Computational Discrete Mathematics and Statistics for Molecular Array Data. Bill Shannon Washington University School of Medicine. Molecular Biology. “How Genes Work”, http://www.nigms.nih.gov. A B C Gene. Microarrays. *Messenger RNA Levels. A B C Gene. Normal Cell. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Discrete Mathematics and Statistics for Molecular Array Data

Computational Discrete Mathematics and Statistics for

Molecular Array Data

Bill Shannon

Washington University

School of Medicine

Page 2: Computational Discrete Mathematics and Statistics for Molecular Array Data

Molecular Biology

“How Genes Work”, http://www.nigms.nih.gov

Page 3: Computational Discrete Mathematics and Statistics for Molecular Array Data

Microarrays

A B CGene

A B CGene

Normal Cell Tumor Cell

*Mes

seng

er R

NA

Lev

els

*Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.

Page 4: Computational Discrete Mathematics and Statistics for Molecular Array Data

Microarrays (Leukemia PPG)35 Probes Selected from ~50,000

Page 5: Computational Discrete Mathematics and Statistics for Molecular Array Data

Array Data Present New Data Analysis Challenges

(Curse of Dimensionality)

• Inaccuracy, or error, of a model becomes large very fast

– sparseness (descriptions of the data is impossible)

– model complexity (too many interaction terms, non-

linear effects, etc. to consider)

– random multicollinearity (spurious correlations)

Page 6: Computational Discrete Mathematics and Statistics for Molecular Array Data

Regression (Curse of Dimensionality)

• y = f(x) + error

• sparseness = little local signal– model parameters not estimated accurately– unstable models over-fit data (not genralizable)

• Non-parametric methods (e.g., CART, neural nets) – require a lot of model searching– use up degree’s of freedom rapidly– little or no information left to determine significance

Page 7: Computational Discrete Mathematics and Statistics for Molecular Array Data

Cluster Analysis (Curse of Dimensionality)

• Find structure in data

• Many cluster results with same goodness-of-fit

• Deciding among the models is impossible.

Page 8: Computational Discrete Mathematics and Statistics for Molecular Array Data

Classification Models (Curse of Dimensionality)

• Predict group membership (e.g., tumor versus normal)

• Three broad categories

– geometric methods (discriminant analysis, CART)

– probabilistic methods (Bayesian)

– algorithmic methods (neural networks, k-NN)

• Require training/validation datasets

Page 9: Computational Discrete Mathematics and Statistics for Molecular Array Data

Other Methods (Curse of Dimensionality)

• Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting)

• Multiple testing adjustment such as false discovery rate or permutation testing

Page 10: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mantel Statistics

• Transform standard NxP data matrices into NxN subject pairwise distances or similarities

• Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix

Page 11: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mantel Statistics

2,,

22,2,

21,1,, PiPiiiiiii xxxxxxd

0

0

0

0

3

2

1

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

21

N

N

N

P

PNNN

P

P

P

P

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

GGGSample

Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.

Page 12: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mantel Statistics

0

0

0

0

3

2

1

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

)()2()1(

N

N

N

Pk

PNNN

P

P

P

k

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

GGGSample

Page 13: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mantel Statistics

0

0

0

0

,3

,23,2

,13,12,1

N

N

N

Pk d

dd

ddd

D

0

0

0

0

,3

,23,2

,13,12,1

N

N

N

P d

dd

ddd

D

Signal + Noise Genes Signal Genes Only

ji

PkPkjiji

PPji

ji

PkPkji

PPjiPkP

dddd

ddddDD

2

,

2

,

,,,

Page 14: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mantel Statistics

Correlating DP with Dk<<P avoids curse of dimensionality!

A positive Mantel correlation indicates the genes in Dk<<P contains the same information as the genes in DP

Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.

Page 15: Computational Discrete Mathematics and Statistics for Molecular Array Data

GA-Mantel

• Search algorithm to find signal genes

• Solution representation – list of genes (10 123 456 798 835 888 923)– binary vector {0000100110000….00010}

• Each solution maps to a Mantel correlation value – Assumption: the larger the correlation the more signal genes in

the solution

• Selection keeps solutions with high Mantel correlation

Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO

Page 16: Computational Discrete Mathematics and Statistics for Molecular Array Data

Recombination

Page 17: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mutation

Page 18: Computational Discrete Mathematics and Statistics for Molecular Array Data

Gene Subset Selection

• Given– a data set comprising N microarray experiments with

g genes

• Find:– a subset of genes that captures relevant

relationships among the experiments

• Goal:– reduce data for further analysis– identify meaningful biological markers for diagnosis

Page 19: Computational Discrete Mathematics and Statistics for Molecular Array Data

1. Randomly generate an initial population

2. Do until stopping criteria is met:

Select individuals to be parents (biased by fitness).Produce offspring by recombination/mutation.Select individuals to die (biased by fitness).

End Do.

3. Return a result.

Genetic Algorithm

Page 20: Computational Discrete Mathematics and Statistics for Molecular Array Data

Fitness Evaluation for Gene Selection

• Calculate DP using all genes

• For each Subset(k) in current population:– Calculate Dk<<P

– Correlate DP with Dk<<P

• Use Mantel Correlation as fitness to select next population of solutions

• Permute to compute P-values

Page 21: Computational Discrete Mathematics and Statistics for Molecular Array Data

0

0

0

0

3

2

1111

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

N

N

N

P

PNNN

P

P

P

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

WtWtWtSample

0

0

0

0

3

2

1101

,3

,23,2

,13,12,1

,2,1,

,32,31,3

,22,21,2

,12,11,1

N

N

N

Pk

KNNN

K

K

K

d

dd

ddd

D

xxx

xxx

xxx

xxx

N

WtWtWtSample

Page 22: Computational Discrete Mathematics and Statistics for Molecular Array Data

GA on Artificial Data• Simulated data:

– 100 experiments with 10,000 genes• 100 signal genes

• 9900 noise genes

– Two groups• Group 1 has signal genes sampled from N(0, 1)

• Group 2 has signal genes sampled from N(1, 1)

• GA Parameters– population size 200

– generations 200

• Outcome measures (averaged over 10 runs of the GA)

– prevalence (signal, noise) – number of signal and noise genes in GA answer

– correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix

– coverage - number of signal genes identified over all GA runs

Page 23: Computational Discrete Mathematics and Statistics for Molecular Array Data

GA on Artificial Data

Length = 30

Prevalence:mean number of signal genes = 22.9 (0.7)

76.3% (std 0.53%)

Correlation:mean rho for best subsets = 0.787 (0.009)

p-value < 0.0001

Coverage:total signal genes identified across 10

runs = 65/100

Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs

Page 24: Computational Discrete Mathematics and Statistics for Molecular Array Data

GA on Golub Data Set• Data set: Golub training set (38 x 7129)• Two Groups:

– 27 samples from ALL patients– 11 samples from AML patients

• GA searched for subsets of fixed length (10 to 50)• population = 200, generations = 200

• Mantel correlation tends to increase with subset sizeLength Final Mantel Corr p-value

10 0.926 (0.005) < 0.00001

20 0.954 (0.004) < 0.00001

30 0.967 (0.002) < 0.00001

40 0.975 (0.002) < 0.00001

50 0.979 (0.002) < 0.00001

Page 25: Computational Discrete Mathematics and Statistics for Molecular Array Data

Significant Feature Subsets

Clustering of Samples using all genes Clustering of Samples using 50 genes from GA

Page 26: Computational Discrete Mathematics and Statistics for Molecular Array Data

Letting GA Select Subset Length• Data set: Golub training set (38 x 7129)• GA searched over variable length subsets (min=5 max=50)• Fitness penalty = d * length / 50• population = 200, generations = 200

• Tradeoff between length of subsets and correlation score

Length Penalty d Pop Final Len Best Final Length Final Mantel Corr

0.00 48.6 49.0 0.979 (0.001)

0.25 17.4 25.0 0.954 (0.005)

0.50 10.7 16.1 0.939 (0.009)

1.00 7.7 10.6 0.922 (0.009)

Page 27: Computational Discrete Mathematics and Statistics for Molecular Array Data

Data Reduction

• Observation: GA appears to repeatedly converge to same regions of feature space

– In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets

• GA can also be used to find feature subsets that minimize rho

• pop = 200

• length = 50

• data set = Golub

• GA finds subsets with rho = 0 within 50 gens

Page 28: Computational Discrete Mathematics and Statistics for Molecular Array Data

GA in Experimental Data Analysis

• Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia)

• T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs

• Regulatory T-cells (Treg) suppress immune response

• Choi and DiPersio are studying the genetic mechanisms of Treg regulation

Page 29: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mouse Array Experiment

GROUP TREATMENT ARRAYS

1 Naïve Treg dec1, dec5

2 Activated Treg dec2, dec6, dec10

3 PBST (Control) dec3, dec7, dec11

4 Decitabine treated dec4, dec8, dec12

~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work

Page 30: Computational Discrete Mathematics and Statistics for Molecular Array Data

Mouse Array Experiment

• Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)

Page 31: Computational Discrete Mathematics and Statistics for Molecular Array Data
Page 32: Computational Discrete Mathematics and Statistics for Molecular Array Data
Page 33: Computational Discrete Mathematics and Statistics for Molecular Array Data
Page 34: Computational Discrete Mathematics and Statistics for Molecular Array Data

Act+Dec Vs Naïve Vs Control

Page 35: Computational Discrete Mathematics and Statistics for Molecular Array Data
Page 36: Computational Discrete Mathematics and Statistics for Molecular Array Data

Naïve+Dec Vs Act Vs Control

Page 37: Computational Discrete Mathematics and Statistics for Molecular Array Data
Page 38: Computational Discrete Mathematics and Statistics for Molecular Array Data

Summary

• GA-Mantel effective at identifying signal genes

• Longer gene subsets associated with higher scores– tradeoff: higher correlations vs. smaller subsets– requires constraining growth of subsets in GA

• GA effective at identifying noise genes

• GA-Mantel can find genes associated with phenotype

Page 39: Computational Discrete Mathematics and Statistics for Molecular Array Data

Future Directions

• RFA CA-08-005 (under review)

– Optimize algorithm to improve coverage of solution space– Multiple solutions– Combine solutions (weak hierarchies)

• Lung disease R01 (to be submitted)

– Microarrays to identify disease subgroups across the bronchitis/emphysema continuum

Page 40: Computational Discrete Mathematics and Statistics for Molecular Array Data

Weak Hierarchies

Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.