

Data Analysis for Gene Chip DataPart I: One-gene-at-a-time methods

Min-Te Chao




• Simple description of gene chip data

• Earlier works

• Mutiple t-test and SAM

• Lee’s ANOVA

• Wong’s factor models

• Efron’s empirical Bayes



• Most works are statistical analysis, not really machine learning type

• Very small set of training sample – not to mention the test sample

• Medical research needs scientific rigor when we can


Arthritis and Rheumatism

• Guidelines for the submission and reviews of reports involving microarray technology

v.46, no. 4, 859-861



• Should document the accuracy and precision of data, including run-to-run variability of each gene

• No arbitrary setting of threshold (e.g., 2-fold)

• Careful evaluation of false discovery rate


Statistical Analysis

• Statistical analysis is absolutely necessary to support claims of an increase or decrease of gene expression

• Such rigor requires multiple experiments and analysis of standard statistical instruments.


Sample Heterogenenity

• … Strongly recommends that investigators focus studies on homogenous cell populations until other methodological and data analysis problems can be resolved.


Independent Confirmation

• It is important that the findings be confirmed using an independent method, preferably with separate samples rather than restating of the original mRNA.



• Other terms:

DNA array

DNA chips


Gene chips


• The underlying principle is the same for all microarrays, no matter how they are made

• Gene function is the key element researchers want to extract from the sequence

• DNA array is one of the most important tools

(Nature, v.416, April 2002 885-891)


2 types of microarray

• cDNA

• Oligonucleotides

• DIY type


• Microarray

allows the researchers to determine which genes are being expressed in a given cell type at a particular time and under particular condition



Basic data form

• On each array, there are p “spots” (p>1000, sometimes 20000). Each spot has k probes (k=20 or so). There are usually 2k measurements (expressions) per spot, and the k differences, or the difference of logs, are used.

• Sometimes they only give you a summary statistics, e.g. median, mean,.. per spot


• Each spot corresponding to a “gene”• For each study, we can arrange the chips

so that the i-th spot represents the i-th gene. (genes close in index may not be close physically at all)

• This means that when we read the i-th spot of all chips in one study, we know we get different measurements of the same i-th gene


• Data of one chip can be arranged in a matrix form,

Y; X_1, X_2, …, X_p

Just as in a regression setup. But in practice, n (chips used) is small compared with p.

Y is the response: cell type, experimental condition, survival time, …


• For a spot with 20 probes, see Efron et al. (2001, JASA, p.1153).


Earlier works

• Cluster analysis

• Fold methods

• Multiple t with Bonferroni correction


Multiple t with Bonferroni correction

• It is too conservative

• Family wise error rate

Among G tests, the probability of at least one false reject – basically goes to 1 with exponential rate in G


Sidak’s single-step adjusted p-value


Bonferroni’s single-step adjusted p-value


All are very conservative


FDR –false discovery rate

• Roughly: Among all rejected cases, how many are rejected wrong?

(Benjamini and Hochberg 1995 JRSSB, 289-300) “Sequential p-method”


Sequential p-method

• Using the observed data, it estimates the rejection regions so that the

FDR < alpha

Order all p-values, from small to large, and obtain a k so the first k hypotheses (wrt the smallest k p-values) are rejected.


• Since we have a different definition for error to control, it will increase the “power”

• For modifications, see Storey (2002, JRSSB, 479-498)

• These are criteria specifically designed to handle risk assessment when G is large


Role of permutation

• For tests (multiple or not), it is important to use a null distribution

• It is generated by a well-designed permutation (of the columns of the data matrix) –column refers to observations, not genes.


One simple example

• Let us say we look at the first gene, with n_1 arrays for treatment and n_2 arrays for control

• We use a t-statistics, t_1, say. What is the p-value corresponding to this observed t_1?


• Permute the n=n_+n_2 columns of data of the data matrix. Look at first row (corresponds to the first gene)

• Treat the first n_1 numbers as a fake “treatment”, the last n_2 numbers as a fake “control” , compute a t-value, say we get s_1


• Permute again and do the same thing and we get s_2, ….

• Do it B times and get s_1, s_2, …., s_B

• Treat these s’s as a (bootstrap) sample for the null distribution of the t_1 statistic

• The p-value of the earlier t_1 is found from the ecdf of the s_j, j=1,2,…,B


• Permutation plays a major role --- finding a reference measure of variation in various situations

• For a well designed experiment with microarray, DOE techniques will play an important role in determining how to do proper permutations.


SAM– significance analysis of microarray

• A standard method of microarray analysis, taught many times in Stanford short courses of data mining

• Modified multiple t-tests

• Using the permutation of certain data columns to evaluate variation of data in each gene


• Original paper is hard to read:

(Tusher, Tibshirani and Chu, PNAS 2001, v.98, no.9, 5116-5121)

But the SAM manual is a lot easier to read for statisticians: (free software for academia use)


• D(i)={X_treatment – X_control} over




Used in SAM, s_0 is a carefully determined constant >0.


• D(i)* are used with certain group of permutations of the columns; D(i)* are also ordered

• Plot D vs. D*, points outside the 45-degree line by a threshold Delta are signals of significant expression change.

• Control the value of Delta to get different FDR.


Other model-based methods

• Wong’s model

PM-MM= \theta \phi + \epsilon

Outlier detection

Model validation

Li and Wong (2001, PNAS v.98, no.1, 31-36)


Lee’s work

• ANOVA based

• May do unbalanced data – e.g., 7 microarray chips

(Lee et al. 2000, PNAS, v.97, 9834-9839)


Empirical Bayes

• (Efron et al. (2001) JASA, v.96, 1151-1160)

• Use a mix model

f(z)=p_0 f_0(z)+p_1 f_1(z)

with f_0, f_1 estimated by data.

p_1=prior prob that a gene expression is affected (by a treatment)


• A key idea is to use permuted (columns) data to estimate f_0

• Use a tricky logistic regression method

• Eventually found

p_1(Z)= the a posteriori probability that a gene at expression level Z is affected


Part I conclusion

• Earlier methods are relatively easy to understand, but to get familiar with the bio-language needs time

• More powerful data analytic methods will continue to develop

• It is important to first understand the basic problems of biologist before we jump with the fancy stat methods


• We may do the wrong problem …

• But if the problem is relevant, even simple methods can get good recognition

• All methods so far are “first moment only” – ie, not too much different from multiple t tests; or, they all are one-gene-at-a-time methods.


• We did not address issues about data cleaning, outlier detection, normalization, etc. Microarray data are highly noisy, these problems are by no means trivial.

• As the cost per chip goes down, the number of chips per problem may grow. But still well-designed experiments, e.g., fractional factorial, has room to play in this game


• Statistical methods, as compared with machine learn based methods, will play a more important role for this type of data since, with a model, parametric or not, one can attach a measure of confidence to the claimed result. This is crucial for scientific development.



• The statistical literature for microarrays, still in its infancy and with much of it unpublished, has tended to focus on frequentist data-analytical devices, such as cluster analysis, bootstrapping and linear models. (Efron, B. 2001)

Top Related