t-test - two groups

32
212-1 Chapter 212 T-Test – Two Groups Introduction This chapter describes how to obtain false discovery rate or experiment-wise error rate (Bonferroni) adjusted Prob Levels (P-values) for a two-sample experiment using the T-Test – Two Groups procedure. Before running this procedure, output (.ges) files containing a single expression value for each gene on each array must be obtained using the appropriate pre- processing procedure in GESS. Background Two microarray experimental designs lead to analysis using the GESS: Two-Sample T-Test procedure: two-sample designs and two-sample reference designs. Two-Sample Design In a two-sample design, two groups are compared, which we will call Treatment 1 and Treatment 2. Several experimental units are randomly assigned to each of the two treatment groups. A single mRNA or cDNA sample is obtained from each experimental unit of both groups. Each sample is exposed to a single microarray, resulting in a single expression value for each gene for each unit of each treatment group (see examples in Tables 1 and 2). The goal is to determine for each gene whether there is evidence that the expression is different between the two groups. Log2 Pre-processing Example, One Array Prior to computing test statistics, each array should be appropriately pre-processed. The Log (Base 2) transformation is commonly used in pre-processing. The final row of the table below is an example of intensity values to be used for each gene from a single array. Gene 1 2 3 4 5 • • • Median Pixel Intensity: 598 3418 5662 13762 699 • • • Log2(Median Intensity): 9.2240 11.7389 12.4670 13.7484 9.4491 • • •

Upload: 0infaase9tuawr83ayt2gaotabf

Post on 28-Jan-2016

239 views

Category:

Documents


0 download

DESCRIPTION

Gess - T-Test - Two Groups

TRANSCRIPT

Page 1: T-Test - Two Groups

212-1

Chapter 212

T-Test – Two Groups Introduction This chapter describes how to obtain false discovery rate or experiment-wise error rate (Bonferroni) adjusted Prob Levels (P-values) for a two-sample experiment using the T-Test – Two Groups procedure. Before running this procedure, output (.ges) files containing a single expression value for each gene on each array must be obtained using the appropriate pre-processing procedure in GESS.

Background Two microarray experimental designs lead to analysis using the GESS: Two-Sample T-Test procedure: two-sample designs and two-sample reference designs.

Two-Sample Design In a two-sample design, two groups are compared, which we will call Treatment 1 and Treatment 2. Several experimental units are randomly assigned to each of the two treatment groups. A single mRNA or cDNA sample is obtained from each experimental unit of both groups. Each sample is exposed to a single microarray, resulting in a single expression value for each gene for each unit of each treatment group (see examples in Tables 1 and 2). The goal is to determine for each gene whether there is evidence that the expression is different between the two groups.

Log2 Pre-processing Example, One Array Prior to computing test statistics, each array should be appropriately pre-processed. The Log (Base 2) transformation is commonly used in pre-processing. The final row of the table below is an example of intensity values to be used for each gene from a single array.

Gene 1 2 3 4 5 • • •

Median Pixel Intensity: 598 3418 5662 13762 699 • • • Log2(Median Intensity): 9.2240 11.7389 12.4670 13.7484 9.4491 • • •

Page 2: T-Test - Two Groups

212-2 T-Test – Two Groups

Pre-processing Example, Multiple Arrays Multiple intensity values for each gene obtained from replicate arrays (10 in each group) are used to compute the T-statistic for each two-sample T-test. The Array 1 values in the table below are the Log2 values of the previous table. Gene 1 2 3 4 5 • • •

Array 1: 9.2240 11.7389 12.4670 13.7484 9.4491 • • • Array 2: 13.6385 12.5858 11.0552 13.6523 12.9738 • • • Group 1 • • • • • • • • • (Treatment 1) • • • • • • • • • • • • • • • • • • Array 10: 13.6880 13.4530 8.21391 9.8417 12.6842 • • •

Array 1: 9.3083 13.2412 13.7344 11.5823 13.3627 • • • Array 2: 12.5562 13.8051 11.9091 13.6056 13.4300 • • • Group 2 • • • • • • • • • (Treatment 2) • • • • • • • • • • • • • • • • • • Array 10: 13.482 10.9060 13.2019 13.6082 9.9156 • • •

T-Statistic: -4.36410 1.61144 -2.57941 1.14711 -0.30963 • • •

Two-Sample Reference Design (Two-Channel Arrays Only) A two-sample reference design, or common reference design, employs an outside source of cDNA that is used as a reference for all samples in the experiment. Reference cDNA may be purchased separately or may be a combination of all cDNAs in the compared samples (The pros and cons of choice of reference cDNA is beyond the scope of this manual).

Suppose a treatment and control are to be compared. One group of experimental units serves as the control group. The other group of experimental units receives the treatment. Following treatment, cDNA is isolated for each of the experimental units. The cDNA for the treatment and control groups may be termed target cDNA. The target cDNA from both groups is labeled with Cyanine 5 (Cy5, red) dye. An outside source of cDNA, with (hopefully) most genes of interest expressed, is labeled with Cyanine 3 (Cy3, green) dye. This cDNA is the common reference, and is used as a baseline for all arrays of both groups. The intensity value for each gene of each array is the relative expression of the target cDNA to the reference cDNA at each spot (see data examples in the tables that follow).

The goal of the reference cDNA is to remove additional variation that may have been introduced in the experimental procedure. Array differences may be particularly pronounced when large periods of time pass between array hybridizations of a single experiment. Reference designs may also be employed in repeated measures/time-course designs.

Page 3: T-Test - Two Groups

T-Test – Two Groups 212-3

Two-Sample Reference Design, Six Arrays

Two-Sample Reference Design Pre-processing In the pre-processing stage of two-sample reference designs, reference sample expression values are typically subtracted from target sample reference values after the Log (Base 2) transformation.

Two-Sample Reference Design Pre-processing Example, One Array The final row contains the reference adjusted values for each gene.

Gene 1 2 3 4 5 • • • Median Pixel Intensity Cy5 Dye, Target (Treatment): 674 3498 6412 17899 773 • • •

Median Pixel Intensity Cy3 Dye (Reference): 658 4562 2689 398 621 • • •

Log-adjusted Target Log2(Cy5 Median): 9.3966 11.7723 12.6465 14.1275 9.5943 • • •

Log-adjusted Reference Log2(Cy3 Median): 9.3619 12.1554 11.3928 8.6366 9.2784 • • •

Log2(Cy5 Median) - Log2(Cy3 Median) Target – Reference: 0.0346 -0.3831 1.2537 5.4909 0.3158 • • •

Page 4: T-Test - Two Groups

212-4 T-Test – Two Groups

Two-Sample Reference Design Pre-processing Example, One Array Multiple difference values for each gene obtained from replicate arrays from each treatment group are used to compute the T-statistic for each two-sample T-test. The values for Array 1 of Group 1 are the same as the last row of the previous table.

Gene 1 2 3 4 5 • • •

Array 1: 0.0346 -0.3831 1.2537 5.4909 0.3158 • • • Array 2: 1.8717 2.2729 -1.2524 3.7579 -3.6397 • • • Group 1 • • • • • • • • • (Treatment 1) • • • • • • • • • • • • • • • • • • Array 10: -3.3008 -0.2392 3.0214 0.0467 -0.1932 • • •

Array 1: -1.5104 -0.5153 0.72928 -0.7236 3.3831 • • • Array 2: 2.0021 1.8478 0.8429 3.3310 2.6293 • • • Group 2 • • • • • • • • • (Treatment 2) • • • • • • • • • • • • • • • • • • Array 10: 2.5649 2.5001 3.2999 -1.7088 -0.8759 • • •

T-Statistic: -2.51354 3.60459 -1.71165 1.86741 -3.35956 • • •

Two-Sample T-Test This section describes the technical details of a single two-sample T-test. Adjusting for multiple tests is discussed in the next section.

Null and Alternative Hypotheses The two-sample null and alternative hypotheses are described here in terms of treatment groups: Treatment 1 and Treatment 2. These groups could equally be labeled Treatment A and Treatment B, Control and Treatment, etc. The two-sample null hypothesis for each gene is H0: μ1 = μ2, where μ1 is the true mean expression for that particular gene in the Treatment 1 environment, and μ2 is the true mean expression for that particular gene following Treatment 2. The alternative hypothesis may be any one of the following: Ha: μ1 < μ2, Ha: μ1 > μ2, or Ha: μ1 ≠ μ2. The choice of the alternative hypothesis depends upon the goals of the research. For example, if the goal of the experiment is only to determine which genes are up-regulated (increase in expression) over Treatment 1 when Treatment 2 is imposed, the alternative hypothesis would be Ha: μ1 > μ2. If the goal instead is to determine which genes are differentially expressed (up-regulated or down-regulated) when compared to the other treatment, the alternative hypothesis is Ha: μ1 ≠ μ2.

Page 5: T-Test - Two Groups

T-Test – Two Groups 212-5

T-Test Formula There are two common T-tests used for comparing groups, one that assumes the underlying variances of the two groups are equal, and another that assumes unequal variance.

Equal Variance The formula for the T-statistic when assuming equal variance based on samples and

is 1111 ,, nyy K

2221 ,, nyy K

21

21

11nn

s

yyT+

−=

with 2

)1()1(

21

222

2112

−+−+−

=nn

snsns , where and are the usual sample variances. If samples

and come from two normal populations with equal variances then the statistic T is known to follow “Student’s” t distribution with degrees of freedom equal to

(Fisher, 1925). An appropriate P-value for the test may then be calculated as the probability of t being as, or more, extreme than the one obtained, based on this distribution. Problems may arise, however, when the two underlying distributions are not normally distributed and/or have differing variances. These problems are often amplified when the sample sizes also differ. Unfortunately, in practice, little is usually known about the true underlying distributions from which the two samples come, particularly when sample sizes are small.

21s 2

2s

1111 ,, nyy K

221 −+ nn

2221 ,, nyy K

Unequal Variance When the populations compared have unequal variance, but are both normally distributed, the resulting test of H0: μ1 = μ2 is known as the Behrens-Fisher problem. The statistic usually

recommended for testing in this scenario is one developed by Welch (1947, 1949). The statistic is

2

22

1

21

21

ns

ns

yyTW

+

−=

or Welch’s T-statistic. The use of this statistic relies on the asymptotic convergence of the sample variances to the true variances, and is certainly appropriate for large samples. For small or moderate samples approximates ‘Student’s’ t-distribution, with estimated degrees of freedom

WT

2

2

22

2

2

1

21

1

2

2

22

1

21

11

11

ˆ

⎟⎟⎠

⎞⎜⎜⎝

⎛−

+⎟⎟⎠

⎞⎜⎜⎝

⎛−

⎟⎟⎠

⎞⎜⎜⎝

⎛+

=

ns

nns

n

ns

ns

v

Page 6: T-Test - Two Groups

212-6 T-Test – Two Groups

The statistic usually outperforms T (higher power when nominal α is preserved) when the variances of the sampled populations differ considerably (Welch, 1947, 1949). When , T and are equivalent, except for the degrees of freedom used in the test.

WT

1n n= 2

WT

Two-Sample Unadjusted P-Values (Probability Levels) Example Table below contains the unadjusted P-values (Probability Levels) for the T-statistics of the Two-Sample Design Example, assuming equal variance. P-values were calculated based on the two-sided alternative hypothesis Ha: μ1 ≠ μ2. The t distribution with 18 degrees of freedom was used. Gene 1 2 3 4 5 • • •

Array 1: 9.2240 11.7389 12.4670 13.7484 9.4491 • • • Array 2: 13.6385 12.5858 11.0552 13.6523 12.9738 • • • Group 1 • • • • • • • • • (Treatment 1) • • • • • • • • • • • • • • • • • • Array 10: 13.6880 13.4530 8.21391 9.8417 12.6842 • • •

Array 1: 9.3083 13.2412 13.7344 11.5823 13.3627 • • • Array 2: 12.5562 13.8051 11.9091 13.6056 13.4300 • • • Group 2 • • • • • • • • • (Treatment 2) • • • • • • • • • • • • • • • • • • Array 10: 13.482 10.9060 13.2019 13.6082 9.9156 • • •

T-Statistic: -4.36410 1.61144 -2.57941 1.14711 -0.30963 • • •

P-Value (Prob Level): 0.00037 0.12448 0.01890 0.26636 0.76040 • • •

Multiple Testing Adjustment When the two-sample T-test is run for a replicated microarray experiment, the result is a list of P-values (Probability Levels) that reflect the evidence of difference in expression. When hundreds or thousands of genes are investigated at the same time, many ‘small’ P-values will occur by chance, due to the natural variability of the process. It is therefore requisite to make an appropriate adjustment to the P-value (Probability Level), such that the likelihood of a false conclusion is controlled.

Benjamini and Hochberg’s (1995) False Discovery Rate Table The following table (adapted to the subject of microarray data) is found in Benjamini and Hochberg’s (1995) false discovery rate article. In the table, m is the total number of tests, m0 is the number of tests for which there is no difference in expression, R is the number of tests for which a difference is declared, and U, V, T, and S are defined by the combination of the declaration of the test and whether or not a difference exists, in truth.

Page 7: T-Test - Two Groups

T-Test – Two Groups 212-7

Declared Declared Not Different Different Total

A true difference in expression does not exist U V m0

There exists a true difference in expression T S m – m0

Total m – R R m

In the table, the m is the total number of hypotheses tested (or total number of genes) and is assumed to be known in advance. Of the m null hypotheses tested, m0 is the number of tests for which there is no difference in expression, R is the number of tests for which a difference is declared, and U, V, T, and S are defined by the combination of the declaration of the test and whether or not a difference exists, in truth. The random variables U, V, T, and S are unobservable.

Need for Multiple Testing Adjustment Following the calculation of a raw P-value (Probability Level) for each test, P-value adjustments need be made to account in some way for multiplicity of tests. It is desirable that these adjustments minimize the number of genes that are falsely declared different (V) while maximizing the number of genes that are correctly declared different (S). To address this issue the researcher must know the comparative value of finding a gene to the price of a false positive. If a false positive is very expensive, a method that focuses on minimizing V should be employed. If the value of finding a gene is much higher than the cost of additional false positives, a method that focuses on maximizing S should be used.

Error Rates – P-Value Adjustment Techniques Below is a brief description of three common error rates that are used for control of false positive declarations. The commonly used P-value adjustment technique for controlling each error rate is also described.

Per-Comparison Error Rate (PCER) – No Multiple Testing Adjustment The per-comparison error rate (PCER) is defined as

PCER ( ) /E V m= ,

where E(V) is the expected number of genes that are falsely declared different, and m is the total number of tests. Preserving the PCER is tantamount to ignoring multiple testing altogether. If a method is used which controls a PCER of 0.05 for 1,000 tests, approximately 50 out of 1,000 tests will falsely be declared significant. Using a method that controls the PCER will produce a list of genes that includes most of the genes for which there exists a true difference in expression (i.e., maximizes S), but it will also include a very large number of genes which are falsely declared to have a true difference in expression (i.e., does not appropriately minimize V). Controlling the PCER should be viewed as overly weak control of Type I error.

To obtain P-values (Probability Levels) that control the PCER, no adjustment is made to the P-value. To determine significance, the P-value is simply compared to the designated alpha.

Page 8: T-Test - Two Groups

212-8 T-Test – Two Groups

Family-Wise Error Rate (FWER) – Bonferroni Adjustment The family-wise error rate (FWER) is defined as

FWER Pr( 0)V= > ,

where V is the number of genes that are falsely declared different. Controlling FWER is controlling the probability that a single null hypothesis is falsely rejected. If a method is used which controls a FWER of 0.05 for 1,000 tests, the probability that any of the 1,000 tests (collectively) is falsely rejected is 0.05. Using a method that controls the FWER will produce a list of genes that includes a small (depending also on sample size) number of the genes for which there exists a true difference in expression (i.e., limits S, unless the sample size is very large). However, the list of genes will include very few or no genes that are falsely declared to have a true difference in expression (i.e., stringently minimizes V). Controlling the FWER should be considered very strong control of Type I error.

Assuming the tests are independent, the well-known Bonferroni P-value adjustment produces adjusted P-values (Probability Levels) for which the FWER is controlled. The Bonferroni adjustment is applied to all m unadjusted P-values ( jp ) as

min( ,1)j jp mp=% .

That is, each P-value (Probability Level) is multiplied by the number of tests, and if the result is greater than one, it is set to the maximum possible P-value of one.

False Discovery Rate (FDR) – Benjamini and Hochberg Adjustment The false discovery rate (FDR) (Benjamini and Hochberg, 1995) is defined as

{ 0}FDR ( 1 ) ( | 0) Pr( 0)RV VE E R RR R>= = > > ,

where R is the number of genes that are declared significantly different, and V is the number of genes that are falsely declared different. Controlling FDR is controlling the expected proportion of falsely declared differences (false discoveries) to declared differences (true and false discoveries, together). If a method is used which controls a FDR of 0.05 for 1,000 tests, and 40 genes are declared different, it is expected that 40*0.05 = 2 of the 40 declarations are false declarations (false discoveries). Using a method that controls the FDR will produce a list of genes that includes an intermediate (depending also on sample size) number of genes for which there exists a true difference in expression (i.e., moderate to large S). However, the list of genes will include a small number of genes that are falsely declared to have a true difference in expression (i.e., moderately minimizes V). Controlling the FDR should be considered intermediate control of Type I error.

Assuming the tests are independent, the Benjamini and Hochberg P-value adjustment produces adjusted P-values (Probability Levels) for which the FDR is controlled. These adjusted P-values are found as

,...,min {min( ,1)}

i kr rk i m

mp pk=

=% ,

where 1 2 mr r rp p≤ ≤ ≤L p are the observed ordered unadjusted P-values. The procedure is

defined in Benjamini and Hochberg (1995). The corresponding adjusted P-value definition given here is found in Dudoit, Shaffer, and Boldrick (2003).

Page 9: T-Test - Two Groups

T-Test – Two Groups 212-9

Multiple Testing Adjustment Comparison The following table gives a summary of the multiple testing adjustment procedures and error rate control. The power to detect differences also depends heavily on sample size.

Adjustment Error Rate Control of Power to Technique Controlled Type I Error Detect Differences

None PCER Minimal High

Bonferroni FWER Strict Low

Benjamini and FDR Moderate Moderate/High Hochberg

Type I Error: Rejection of a null hypothesis that is true.

Analysis Steps Following are the recommended steps for running a two-sample T-Test on microarray data.

Step 1 – Pre-Processing Run the appropriate pre-processing procedure (e.g., GenePix Pre-processing or Affymetrix Pre-processing) to prepare data (.ges) files for statistical analysis. The .ges files are created when a variable name is entered in the Output File Names Variable box on the variables tab of the pre-processing procedure window.

Step 2 – Spreadsheet Setup Because the analysis for hundreds or thousands of genes may be time-consuming, it is recommended that an initial run be made on fictitious data to assure the spreadsheet is setup properly. In the output for the run on fictitious data, check, for example, that the sample sizes of each group are correct. The importance of this step increases as the complexity of the statistical analysis increases. This step is also useful for getting ideas for follow-up statistical analyses of specific genes.

Step 3 – Run the Analysis Carefully select the desired H0 Value, Prob Level Cutoff, and direction of the alternative hypothesis. If follow-up experiments are to be run, the False Discovery Rate Control adjustment is recommended. If there will be no follow-up experiments, the Bonferroni adjustment is recommended. The pre-processed data for the most significant genes should be stored in the spreadsheet for detailed follow-up analysis.

Examine the output to determine if the number of hypothesis tests conducted is as expected, and to see if the appropriate number of replicates was used. It may also help to look at the Prob Level histogram and/or the Prob Level vs Mean Difference plots to understand the distribution of statistics across the entire experiment.

Page 10: T-Test - Two Groups

212-10 T-Test – Two Groups

Step 4 – Follow-Up Analysis Run individual follow-up statistical analyses on the genes for which pre-processed data was stored using the T-Test – Two-Sample procedure in NCSS. These individual analyses are useful for examining test assumptions and specific trends in greater detail. Note, however, that statistical tests are not adjusted for multiple testing across genes in the NCSS two-sample procedure.

Procedure Options This section describes the options available in this procedure.

Variables Tab These options specify the variables that will be used in the analysis.

GES Files Specifications These variables are used to identify the .ges files for T-Test analysis.

Response GES Files Variable Specify the variable containing the column of input files on the spreadsheet. These input files will usually be those files that were output as a result of a pre-processing procedure. The files of this column contain the intensity summaries on which the T-tests will be run.

Group Specifications These variables are used to identify the groups for T-Test analysis.

Group Variable Specify the name of the variable that divides the array files into two groups. These two groups are those compared using the two-sample T-test.

Hypotheses Specifications These options determine the null and alternative hypotheses.

H0 Value This is the hypothesized difference between the two population means. It is usually assumed to be zero.

Alternative Hypothesis Specify the alternative hypothesis test to be used in Probability Level calculation. Mean1 corresponds to the mean of the group that is first if the group names are sorted. Mean2 corresponds to the mean of the group that is second if the group names are sorted. That is, if the group names are B and A, Mean1 will be the mean of Group A and Mean2 will be the mean of Group B.

• Mean1 > Mean2 Probability Levels for this alternative hypothesis are based solely on the probability that the true Mean1 is greater than the true Mean2.

Page 11: T-Test - Two Groups

T-Test – Two Groups 212-11

• Mean1 < Mean2 Probability Levels for this alternative hypothesis are based solely on the probability that the true Mean1 is less than the true Mean2.

• Mean1 <> Mean2 Probability Levels for this alternative hypothesis are based on the probability that the true Mean1 is less than or greater than the true Mean2.

Adjustment for Multiple Testing

Multiple Test Correction When several tests are performed on the same set of data, the probability levels of the individual tests should be corrected. This option lets you specify the type of multiple test correction.

• None No correction is done.

• Bonferroni The Bonferroni correction preserves the experiment-wise error rate.

• False Discovery Rate Control False Discovery Rate Control controls the proportion of falsely declared significant differences.

Recommendation: If you will be doing follow-up testing, False Discovery Rate Control should be used. If not, the Bonferroni correction should be used.

Variance Assumption

Variances of Two Groups Assumed to be Specify whether the variances of the two groups are assumed to be equal or unequal.

• Equal If the variances are assumed to be equal, the traditional two-sample T-test with n1 + n2 - 2 df is used to calculate Probability Levels.

• Unequal If the variances are assumed to be unequal, Welch's two-sample T-test is used to calculate Probability Levels.

Recommendation: Welch's test is better for preserving the error rate, but results in slightly lower power.

Page 12: T-Test - Two Groups

212-12 T-Test – Two Groups

Reports Tab The options on this panel control which reports and plots are generated.

Select Reports The following options are used to determine the reports that will be displayed.

Test Detail Sorted by Prob Level Check this box to obtain a list of the most significant differences, sorted by the probability level. Associated names or IDs, unadjusted probability levels, means, standard deviations, and test statistics are also shown.

Prob Level Cutoff Specify the cutoff for the multiple test corrected probability levels. When the Test Detail Sorted by Prob Level box is checked, all adjusted probability levels below this value will be reported.

Test Detail Sorted by Gene Within Subset Check this box to obtain a list of all genes that are in subset lists. Associated corrected probability levels, unadjusted probability levels, means, standard deviations, and test statistics are also shown.

A separate list is produced for each subset, sorted alphabetically.

Report Options These options determine the format of the reports.

Precision Specifies whether unformatted numbers are displayed as single (7-digit) or double (13-digit) precision numbers.

• Single Unformatted numbers are displayed with 7-digits. This is the default setting. All reports have been formatted for single precision.

• Double Unformatted numbers are displayed with 13-digits. This option is most often used when the extremely accurate results are needed for further calculation. Double precision numbers will require more space than allotted, potentially resulting in unaligned output. This option is provided for those instances when accuracy is more important than format alignment.

COMMENTS:

This option does not affect formatted numbers such as probability levels.

This option only influences the format of the numbers as they are output. All calculations are performed in double precision regardless of selection.

Prob Decimals Specify the number of decimal places to be used for displaying probability levels on the reports. The number chosen here does not affect the internal precision of the data.

Page 13: T-Test - Two Groups

T-Test – Two Groups 212-13

Stat Decimals Specify the number of decimal places to be used for displaying differences in means and standard errors on the reports. The number chosen here does not affect the internal precision of the data.

Test Decimals Specify the number of decimal places to be used for displaying T-statistics on the reports. The number chosen here does not affect the internal precision of the data.

Select Prob Level vs Mean Plots The following option are used to determine which Prob Level vs Mean Difference (P vs M, Volcano) plots will be displayed.

Prob Level vs Mean Difference (P vs M, Volcano) Plot Check this box to obtain a Prob Level vs Mean Difference plot. This plot displays the difference in means on the X-axis and the probability level (P-Value) on the Y-axis. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the point is plotted with the Prob Level set to 0.00001.

Corrected Prob Level vs Mean Difference (P vs M, Volcano) Plot Check this box to obtain a Corrected Prob Level vs Mean Difference plot. This plot displays the difference in means on the X-axis and the corrected probability level (corrected P-Value) on the Y-axis. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the point is plotted with the Prob Level set to 0.00001.

Select Histograms The following options are used to determine which histograms will be displayed.

Histogram of Prob Level Check this box to obtain a histogram of the unadjusted (raw) probability levels.

Histogram of Corrected Prob Level Check this box to obtain a histogram of all corrected probability levels.

Histogram of Log10(Prob Level) Check this box to obtain a histogram of the Log(base 10) transformed, unadjusted (raw) probability levels. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the Log10(Prob Level) is put in the bin at -10.

Histogram of Log10(Corrected Prob Level) Check this box to obtain a histogram of all Log(base 10) transformed, corrected probability levels. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the Log10(Corrected Prob Level) is put in the bin at -5.

Histogram of Z(Prob Level) Check this box to obtain a histogram of Z-transformed unadjusted (raw) probability levels. The Z-transformation converts the probability level into the corresponding standard normal distribution value using the probability integral transform. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the Z(Prob Level) is put in the bin at -9 or 9.

Page 14: T-Test - Two Groups

212-14 T-Test – Two Groups

Histogram of Z(Corrected Prob Level) Check this box to obtain a histogram of Z-transformed corrected probability levels. The Z-transformation converts the probability level into the corresponding standard normal distribution value using the probability integral transform. Occasionally, a standard error of zero occurs, producing an undefined Prob Level. When the standard error is zero, the Z(Corrected Prob Level) is put in the bin at -9 or 9. The Bonferroni and False Discovery Rate corrections usually result in many Corrected Prob Levels of 1.0. For these, Z(Corrected Prob Level) is also set to -9 or 9.

Histogram of Difference of Means Check this box to obtain a histogram of all differences in means.

Histogram of SE Check this box to obtain a histogram of all standard errors.

Histogram of T Value Check this box to obtain a histogram of all T-statistic values. Occasionally, a standard error of zero occurs, producing an undefined T Value. When the standard error is zero, the T Value is put in the bin at -20 or 20.

Storage Tab The options on this panel control the storage of pre-processed data values on the spreadsheet for further analysis.

Spreadsheet Storage of the NAMES of Significant Genes These options determine whether the names of significant genes will be stored and where.

Store the names of the most significant genes on the spreadsheet Check this box to store a list of names of the most significant genes into the variable (column) specified under Store Gene Names in Variable.

Store Gene Names in Variable If the box immediately below is checked, the names of the most significant genes will be stored in the column associated with this variable.

Any data that is already in this variable will be overwritten.

Spreadsheet Storage of the EXPRESSION VALUES of Significant Genes These options determine whether the expression values of significant genes will be stored and where.

Store the data values of the most significant genes on the spreadsheet Check this box to store the pre-processed data values of all genes for which the corrected probability level is below the cutoff value.

This allows the user to utilize other procedures to obtain follow-up analyses and graphics for the significant genes.

Page 15: T-Test - Two Groups

T-Test – Two Groups 212-15

Store Expression Values Beginning with Variable The values of the most significant gene will be stored in this variable. The values for each additional significant gene are stored in the variables immediately to the right of this variable.

Leave this value blank if you want the data storage to begin in the first blank column on the right-hand side of the data.

WARNING: Use caution when selecting this variable, since existing data is automatically replaced when the storage variables are created.

Maximum Storage Variables Used Specify the maximum number of variables (columns) for which you want the gene intensity data stored on the spreadsheet. This choice may be particularly important when the number of significant genes is large.

Note that NCSS spreadsheets are limited to 255 variables, so if you want to store more values, you will have to add more sheets.

Subsets 1 - 9 Tabs The options on this panel control the names and lists of subsets.

Subset 1 – 9

Name The name of the gene subset is entered here.

Separate reports may be generated to show all genes of a subset (see Reports tab). This may be useful for examining probability levels of specific genes you are interested in that do not make the cutoff.

Genes in this Subset Enter a list of genes that are to be in this subset. The genes may be entered directly, or the * character may be used to specify all genes with a particular beginning. The gene names or IDs entered in this list must be in the column specified in Gene Name From box on the Variables tab.

EXAMPLES:

Blank

spike1

spike3

spike* (all names beginning with spike)

AA44719

NM_00582

NM_04762

cntrl* (all names beginning with cntrl)

file(C:\Microarray\genelist.txt) (all names in the genelist.txt file)

var(OutputGenes) (all names in the spreadsheet variable with the variable name OutputGenes)

Page 16: T-Test - Two Groups

212-16 T-Test – Two Groups

These Genes are Specify here whether the genes of this subset are to be included or excluded from the list of genes that are analyzed. Probability levels will not be calculated for the genes of this subset when 'Excluded' is entered here.

(Plotting) Symbol Click on the symbol or on the button to its right to display a window that allows you to change the characteristics of the plotting symbol. This plotting symbol will be used in the Prob Level vs Mean Difference plots.

Non-Subset (Ungrouped) Genes

Name of Ungrouped Set Enter the subset name to be used for all genes that are not included in any of the nine subsets.

Ungrouped Genes are Specify here whether the genes not listed in any other subset are to be included or excluded from the list of genes that are analyzed. Probability levels will not be calculated for these genes when 'Excluded' is entered here.

Excluding the genes of the ungrouped subset may be useful when analyzing only a small subset of the genes of the array is desired.

(Plotting) Symbol Click on the symbol or on the button to its right to display a window that allows you to change the characteristics of the plotting symbol. This plotting symbol will be used in the Prob Level vs Mean Difference plots.

P vs M Plot Tab The options on this panel control the appearance of the Prob level and Corrected Prob Level vs Mean Difference (Volcano) plots.

Vertical and Horizontal Axes These options control the vertical and horizontal axes attributes.

Label Enter text here for the designated label. If {Y} is entered here for the vertical axis, an appropriate default label will be used.

Minimum Specify the value to be displayed as the minimum on this axis. Data values less than this amount will be ignored.

If this value is left blank, the minimum will be determined from the data.

Maximum Specify the value to be displayed as the maximum on this axis. Data values greater than this amount will be ignored.

If this value is left blank, the maximum will be determined from the data.

Page 17: T-Test - Two Groups

T-Test – Two Groups 212-17

Tick Label Settings… This option specifies the characteristics of the reference numbers. It displays a window that edits the font size and color of the reference numbers that appear next to the text along the axis of the plot. It also allows you to set the number of digits in the reference numbers as well as their vertical/horizontal orientation.

Note that in some cases, the format specified here is overridden by the variable's format as specified on the database in the Variable Info Sheet.

Major Ticks Specify the number of large tickmarks and optional grid lines along this axis. A set of minor tickmarks will be generated between each pair of major tickmarks. A reference number is displayed adjacent to each major tickmark.

Minor Ticks Select the number of small tickmarks to be displayed between each pair of major (large) tickmarks along this axis.

Show Grid Lines Check this option to display grid lines at the major tickmarks along this axis.

NOTE: Since the grid lines are drawn out from the tickmarks, they appear perpendicular to the axis. Thus, checking the Y Grid Lines will actually cause horizontal grid lines to appear.

P vs M Plot Settings These options are used to specify the appearance of the P vs M plots.

Plot Style File Designate a scatter plot style file. This file sets all scatter plot options that are not set directly on this panel. Unless you choose otherwise, the ChipPlot style file is used. Scatter plot style files are created in the Scatter Plots procedure.

Log Scale (Y) Select an optional log scaling (base 10) for the vertical axis. The options are:

• No Use regular scaling along this axis.

• Yes: Numbers Use logarithmic scaling (base 10) in which the tickmark reference numbers are displayed as numbers (e.g., .1, .01, .001, .0001).

• Yes: Powers of Ten Use logarithmic scaling (base 10) in which the tickmark reference numbers are displayed as the exponents of ten (e.g., -3, -2, -1, 0).

Legend Enter text here for the Prob Level vs Mean Difference plot legend title.

Page 18: T-Test - Two Groups

212-18 T-Test – Two Groups

H0 Value Line Check this box to display a vertical line on the P vs M plot at the H0 Value. The H0 Value is specified under the Variables tab.

Prob Cutoff Line Check this box to display a horizontal line on the plot at the Prob Level Cutoff value. The Prob Level Cutoff value is specified under the Reports tab.

Interior Color Specify the interior color of the plot.

Background Color Specify the background color of the plot.

Plot Title

Plot Title Enter text here for the Prob Level vs Mean plot title.

Histograms Tab The options on this panel control the appearance of the histograms.

Vertical and Horizontal Axes These options are used to format the histogram axes.

Label Enter text here for the designated label.

REPLACEMENT CODES:

The following code is replaced by the appropriate name when the plot is generated.

{X} is replaced by the statistic that is reported in the histogram.

Minimum Specify the value to be displayed as the minimum on this axis. Data values less than this amount will be ignored.

If this value is left blank, the minimum will be determined from the data.

Maximum Specify the value to be displayed as the maximum on this axis. Data values greater than this amount will be ignored.

If this value is left blank, the maximum will be determined from the data.

Tick Label Settings… This option specifies the characteristics of the reference numbers. It displays a window that edits the font size and color of the reference numbers that appear next to the text along the axis of the plot. It also allows you to set the number of digits in the reference numbers as well as their vertical/horizontal orientation.

Page 19: T-Test - Two Groups

T-Test – Two Groups 212-19

Note that in some cases, the format specified here is overridden by the variable's format as specified on the database in the Variable Info Sheet.

Major Ticks Specify the number of large tickmarks and optional grid lines along this axis. A set of minor tickmarks will be generated between each pair of major tickmarks. A reference number is displayed adjacent to each major tickmark.

Minor Ticks Select the number of small tickmarks to be displayed between each pair of major (large) tickmarks along this axis.

Show Grid Lines Check this option to display grid lines at the major tickmarks along this axis.

NOTE: Since the grid lines are drawn out from the tickmarks, they appear perpendicular to the axis. Thus, checking the Y Grid Lines will actually cause horizontal grid lines to appear.

Histogram Settings These options are used to specify the appearance of the histograms.

Style File Designate a histogram style file. This file sets all histogram options that are not set directly on this panel. Unless you choose otherwise, the HistoBox style file is used. Histogram style files are created in the Histograms procedure.

Number of Bars Specify the number of bars (bins) to be displayed. Select '0 - Automatic' to direct the program to select an appropriate number based on the number of values.

Interior Color Specify the histogram interior color.

Background Color Specify the histogram background color.

Bar Fill Color Specify the color of the inside of the bars.

Bar Border Color Specify the color of the lines around the bars.

Horizontal Axis Minimums and Maximums

Horizontal Axis Minimum Specify the value to be displayed as the maximum on this axis. Data values greater than this amount will be ignored.

If this value is left blank, the maximum will be determined from the data.

Page 20: T-Test - Two Groups

212-20 T-Test – Two Groups

Horizontal Axis Maximum Specify the value to be displayed as the maximum on this axis. Data values greater than this amount will be ignored.

If this value is left blank, the maximum will be determined from the data.

Histogram Title

Title Enter text here for the histogram title.

REPLACEMENT CODES:

The following code is replaced by the appropriate name when the plot is generated.

{X} is replaced by the statistic that is reported in the histogram.

Template Tab The options on this panel allow various sets of options to be loaded (File menu: Load Template) or stored (File menu: Save Template). A template file contains all the settings for this procedure.

Specify the Template File Name

File Name Designate the name of the template file either to be loaded or stored.

Select a Template to Load or Save

Template Files A list of previously stored template files for this procedure.

Template Id’s A list of the Template Id’s of the corresponding files. This id value is loaded in the box at the bottom of the panel.

Page 21: T-Test - Two Groups

T-Test – Two Groups 212-21

Example 1 – Two-Sample T-Test with Five Arrays per Group This section presents an example of the two-sample T-test with five arrays in each group, with 345 genes probed per array. The ten arrays used in the example have already been pre-processed using one of the pre-processing procedures. In this example, the two sided two-sample T-test is used to determine which genes are differentially expressed. The spreadsheet data used are recorded in the TTest2G_Ex1 dataset. You may follow along here by making the appropriate entries or load the completed template Example 1 from the Template tab of the T-Test – Two Groups window.

1 Open the TTest2G_Ex1 dataset. • From the File menu of the NCSS Data window, select Open. • Select the Data subdirectory of your NCSS directory. • Open the GESS folder. • Click on the file TTest2G_Ex1.S0. • Click Open.

2 Open the GESS T-Test – Two Groups window. • On the menus, select GESS, then T-Test Routines, then Two Groups. The T-Test –

Two Groups procedure will be displayed. • On the menus, select File, then New Template. This will fill the procedure with the

default template. Alternatively, load the Example 1 Template, which generates the specifications described below.

3 Specify the variables and hypothesis test details. • On the T-Test – Two Groups window, select the Variables tab. • Set the Response GES Files Variable to OutputFile. • Set the Group Variable to Group. • Set the H0 Value to 0.0. • Set the Alternative Hypothesis to Mean1<> Mean2. • Set the Multiple Test Correction to False Discovery Rate Control. • Set Variances of Two Groups Assumed to be to Equal.

4 Specify the Reports. • Select the Reports tab. • Check the box next to Test Detail Sorted by Prob Level. • Set the Prob Level Cutoff to 0.05. • Check the all other boxes except Test Detail Sorted by Gene Within Subset.

5 Run the procedure. • From the Run menu, select Run Procedure. Alternatively, just click the Run button (the

left-most button on the button bar at the top).

Page 22: T-Test - Two Groups

212-22 T-Test – Two Groups

T-Test Detail in Probability Level Order T-Test Detail in Probability Level Order Alternative Hypothesis: Mean of A - Mean of B <> 0 FDR Adjusted Multiple Single Gene Subset Tests Test Counts Mean Standard Name Name Prob Level Prob Level T Value (N1/N2) Difference Error 40515_at Other 0.0000006 0.0000000 30.138 5/5 3.1832 0.1056 39425_at Other 0.0000008 0.0000000 25.420 5/5 3.2847 0.1292 100084_at Other 0.0000008 0.0000000 -24.276 5/5 -3.7710 0.1553 38730_at Other 0.0000008 0.0000000 -24.007 5/5 -3.8015 0.1583 94766_at Other 0.0000023 0.0000000 20.531 5/5 2.8419 0.1384 37725_at Other 0.0000057 0.0000001 -17.870 5/5 -2.8102 0.1573 37029_at Other 0.0000656 0.0000013 -12.771 5/5 -2.3668 0.1853 37001_at Other 0.0004204 0.0000097 9.816 5/5 1.9633 0.2000 101482_at Other 0.0004998 0.0000130 -9.439 5/5 -2.2982 0.2435 31962_at Other 0.0012867 0.0000373 8.177 5/5 2.5772 0.3152 Total number of hypothesis tests conducted = 345

This report displays the genes for which the False Discovery Rate adjusted Prob Level is less than 0.05. That is, it is expected that 5% of the genes of this list are false discoveries.

Gene Name This is the name or ID of the genes for which the False Discovery Rate adjusted Prob Level is less than 0.05.

Subset Name This is the name of the specified subset to which this gene belongs. If the gene is a not a member of a subset list the default subset name is Other.

FDR Adjusted Multiple Tests Prob Level This is the Prob Level for the specified hypothesis test following a False Discovery Rate correction.

Single Test Prob Level This is the Prob Level of the individual test, before multiple test correction is done.

T Value This is the value of the T Statistic used to conduct the hypothesis test of interest.

Counts This is number of intensity values from each group used in the calculation of the T Statistic. If there are not missing values, this is the number of arrays in each group.

Mean Difference This is difference in average intensity values for each of the significant genes. It is compared the H0 value in the hypothesis test.

Standard Error This is standard error of the intensity values for each of the significant genes. It is the denominator of the T Statistic.

Page 23: T-Test - Two Groups

T-Test – Two Groups 212-23

Histograms and Plots Section

0.0

6.3

12.5

18.8

25.0

0.0 0.3 0.5 0.8 1.0

Histogram of Prob Level

Prob Level

Cou

nt

0.0

35.0

70.0

105.0

140.0

0.0 0.3 0.5 0.8 1.0

Histogram of Corrected Prob Level

Corrected Prob Level

Cou

nt

0.0

37.5

75.0

112.5

150.0

-10.0 -7.5 -5.0 -2.5 0.0

Histogram of Log10(Prob Level)

Log10(Prob Level)

Cou

nt

0.0

87.5

175.0

262.5

350.0

-8.0 -6.0 -4.0 -2.0 0.0

Histogram of Log10(Corrected Prob Level)

Log10(Corrected Prob Level)

Cou

nt

0.0

12.5

25.0

37.5

50.0

-6.0 -3.5 -1.0 1.5 4.0

Histogram of Z(Prob Level)

Z(Prob Level)

Cou

nt

0.0

35.0

70.0

105.0

140.0

-6.0 -3.5 -1.0 1.5 4.0

Histogram of Z(Corrected Prob Level)

Z(Corrected Prob Level)

Cou

nt

These six plots are used to examine the distribution of the P-Values (Prob Levels) of all genes in the experiment, before and after the multiple testing correction. The Log (Base 10) and Z (Normal) transformations aid in examining the distribution of the P-Values (Prob Levels) that are extremely close to zero.

Page 24: T-Test - Two Groups

212-24 T-Test – Two Groups

0.0

30.0

60.0

90.0

120.0

-4.0 -2.0 0.0 2.0 4.0

Histogram of Difference in Means

Difference in Means

Cou

nt

0.0

12.5

25.0

37.5

50.0

0.0 0.3 0.5 0.8 1.0

Histogram of SE

SE

Cou

nt

The distributions of the difference in means and standard errors give a feel for the components (numerator and denominator) of the calculated T Values. Often these plots will be omitted.

0.0

50.0

100.0

150.0

200.0

-30.0 -12.5 5.0 22.5 40.0

Histogram of T Value

T Value

Cou

nt

The distribution of the T Statistics can show the position of extreme T Values. Often this plot will be omitted.

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

-4.0 -2.0 0.0 2.0 4.0

Prob Level vs Mean Difference Plot

Difference in Means

Pro

b Le

vel

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

-4.0 -2.0 0.0 2.0 4.0

Prob Level vs Mean Difference Plot

Difference in Means

Cor

rect

ed P

rob

Leve

l

The Prob Level and Corrected Prob Level vs Mean Difference plots allow the user to assess practical significance (X axis) as well as statistical significance (Y axis). The vertical line shows the H0 Value. The Prob Level (P-Value) cutoff is included on the Corrected Prob Level plot.

Page 25: T-Test - Two Groups

T-Test – Two Groups 212-25

Example 2 – Analysis Steps This section presents an example of the two-sample T-test involving 7 arrays in each group, with 345 genes probed per array. The purpose of the experiment is to determine which genes are expressed higher in Group 2 than in Group 1.

Step 1 – Pre-Processing The 14 arrays used in the example have already been pre-processed using one of the pre-processing procedures. The spreadsheet containing the pathways for these files is the TTest2G_Ex2 dataset. To open the TTest2G_Ex2 dataset, use the following steps.

1 Open the TTest2G _Ex2 dataset. • From the File menu of the NCSS Data window, select Open. • Select the Data subdirectory of your NCSS directory. • Open the GESS folder. • Click on the file TTest2G_Ex2.S0. • Click Open.

Step 2 – Spreadsheet Setup The TTest2G_Ex2 dataset should appear as

TTest2G_Ex2 dataset

Group OutputFile 1 %p%\data\gess\TTest2G\TTest2G_Ex2_1.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_2.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_3.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_4.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_5.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_6.ges 1 %p%\data\gess\TTest2G\TTest2G_Ex2_7.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_8.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_9.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_10.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_11.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_12.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_13.ges 2 %p%\data\gess\TTest2G\TTest2G_Ex2_14.ges

Random numbers may be entered into a vacant column to verify that the setup is correct. The title for the column may be named Random. The spreadsheet should now look like the following.

Page 26: T-Test - Two Groups

212-26 T-Test – Two Groups

TTest2G_Ex2 dataset

Group OutputFile Random 1 %p%\data\gess\TTest2G\TTest2G_Ex2_1.ges 51 %p%\data\gess\TTest2G\TTest2G_Ex2_2.ges 81 %p%\data\gess\TTest2G\TTest2G_Ex2_3.ges 71 %p%\data\gess\TTest2G\TTest2G_Ex2_4.ges 31 %p%\data\gess\TTest2G\TTest2G_Ex2_5.ges 41 %p%\data\gess\TTest2G\TTest2G_Ex2_6.ges 61 %p%\data\gess\TTest2G\TTest2G_Ex2_7.ges 42 %p%\data\gess\TTest2G\TTest2G_Ex2_8.ges 82 %p%\data\gess\TTest2G\TTest2G_Ex2_9.ges 102 %p%\data\gess\TTest2G\TTest2G_Ex2_10.ges 72 %p%\data\gess\TTest2G\TTest2G_Ex2_11.ges 92 %p%\data\gess\TTest2G\TTest2G_Ex2_12.ges 112 %p%\data\gess\TTest2G\TTest2G_Ex2_13.ges 132 %p%\data\gess\TTest2G\TTest2G_Ex2_14.ges 10

Alternatively, open the TTest2G_Ex2a dataset.

1 Open the TTest2G _Ex2a dataset. • From the File menu of the NCSS Data window, select Open. • Select the Data subdirectory of your NCSS directory. • Open the GESS folder. • Click on the file TTest2G_Ex2a.S0. • Click Open.

To analyze the random column using the NCSS: T-Test – Two-Sample procedure, take the following steps.

2 Open the NCSS: T-Test – Two-Sample window. • On the menus, select Analysis, then T-Tests, then T-Test – Two-Sample. The NCSS: T-

Test – Two-Sample procedure will be displayed. • On the menus, select File, then New Template. This will fill the procedure with the

default template.

3 Specify the variables. • Select the Variables tab. • Set Response Variable(s) to Random. • Set Group Variables to Group. • Set the H0 Value to 0.

4 Run the procedure. • From the Run menu, select Run Procedure. Alternatively, just click the Run button (the

left-most button on the button bar at the top).

Page 27: T-Test - Two Groups

T-Test – Two Groups 212-27

T-Test Output The Descriptive Statistics through Aspin-Welch Unequal-Variance Test Sections should appear as follows.

Descriptive Statistics Section Standard Standard 95% LCL 95% UCL Variable Count Mean Deviation Error of Mean of Mean Group=1 7 5.285714 1.799471 0.680136 3.621481 6.949947 Group=2 7 9.714286 1.976047 0.7468756 7.886747 11.54182 Note: T-alpha (Group=1) = 2.4469, T-alpha (Group=2) = 2.4469 Confidence-Limits of Difference Section Variance Mean Standard Standard 95% LCL 95% UCL Assumption DF Difference Deviation Error of Mean of Mean Equal 12 -4.428571 1.889822 1.010153 -6.629505 -2.227638 Unequal 11.90 -4.428571 2.672612 1.010153 -6.631633 -2.22551 Note: T-alpha (Equal) = 2.1788, T-alpha (Unequal) = 2.1809 Equal-Variance T-Test Section Alternative Prob Decision Power Power Hypothesis T-Value Level (5%) (Alpha=.05) (Alpha=.01) Difference <> 0 -4.3841 0.000890 Reject Ho 0.979922 0.881602 Difference < 0 -4.3841 0.000445 Reject Ho 0.993348 0.938335 Difference > 0 -4.3841 0.999555 Accept Ho 0.000000 0.000000 Difference: (Group=1)-(Group=2) Aspin-Welch Unequal-Variance Test Section Alternative Prob Decision Power Power Hypothesis T-Value Level (5%) (Alpha=.05) (Alpha=.01) Difference <> 0 -4.3841 0.000908 Reject Ho 0.979758 0.880518 Difference < 0 -4.3841 0.000454 Reject Ho 0.993303 0.937774 Difference > 0 -4.3841 0.999546 Accept Ho 0.000000 0.000000 Difference: (Group=1)-(Group=2)

The Count is 7 per group, as expected. The appropriate T-Test, assuming equal variances or unequal variances, would be the second one, with Difference < 0, except that these numbers are fictitious. The appropriateness of the setup has been verified.

Step 3 – Run the Analysis The following steps should be taken to run the analysis. You may follow along here by making the appropriate entries or load the completed template Example 2 from the Template tab of the T-Test – Two Groups window.

1 Open the GESS T-Test – Two Groups window. • On the menus, select GESS, then T-Test Routines, then Two Groups. The T-Test –

Two Groups procedure will be displayed. • On the menus, select File, then New Template. This will fill the procedure with the

default template. Alternatively, load the Example 2 Template, which generates the specifications described below.

2 Specify the variables and hypothesis test details. • On the T-Test – Two Groups window, select the Variables tab. • Set the Response GES Files Variable to OutputFile. • Set the Group Variable to Group. • Set the H0 Value to 0.

Page 28: T-Test - Two Groups

212-28 T-Test – Two Groups

• Set the Alternative Hypothesis to Mean1 < Mean2. • Set the Multiple Test Correction to False Discovery Rate Control. • Set Variances of Two Groups Assumed to be to Unequal.

3 Specify the reports. • Select the Reports tab. • Check the box next to Test Detail Sorted by Prob Level. • Set the Prob Level Cutoff to 0.05. • Check the box next to Corrected Prob Level vs Mean Difference. • Check the box next to Z(Prob Level).

4 Specify the storage options. • Select the Storage tab. • Check the box next to Store the data values of the most significant genes on the

spreadsheet. • Set Store Expression Values Beginning with Variable to C4. • Set Maximum Storage Variables used to 5.

5 Run the procedure. • From the Run menu, select Run Procedure. Alternatively, just click the Run button (the

left-most button on the button bar at the top).

T-Test Detail in Probability Level Order T-Test Detail in Probability Level Order Alternative Hypothesis: Mean of 1 - Mean of 2 < 0 FDR Adjusted Multiple Single Gene Subset Tests Test Counts Mean Standard Name Name Prob Level Prob Level T Value (N1/N2) Difference Error 41237_at Other 0.0000000 0.0000000 -43.313 7/7 -3.1017 0.0716 31962_at Other 0.0000028 0.0000000 -13.371 7/7 -2.9648 0.2217 38730_at Other 0.0000074 0.0000001 -12.019 7/7 -2.4023 0.1999 93822_at Other 0.0000708 0.0000008 -11.972 7/7 -2.0463 0.1709 100084_at Other 0.0002342 0.0000034 -8.429 7/7 -1.1437 0.1357 39425_at Other 0.0005160 0.0000090 -6.903 7/7 -1.9447 0.2817 Total number of hypothesis tests conducted = 345

This report displays the genes for which the Bonferroni adjusted Prob Level is less than 0.05.

Page 29: T-Test - Two Groups

T-Test – Two Groups 212-29

Histograms and Plots Section

0.0

15.0

30.0

45.0

60.0

-8.0 -4.5 -1.0 2.5 6.0

Histogram of Z(Prob Level)

Z(Prob Level)

Cou

nt

10-11

10-10

10-9

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

100

-4.0 -2.0 0.0 2.0 4.0

Prob Level vs Mean Difference Plot

Difference in Means

Cor

rect

ed P

rob

Leve

l

The histogram shows a small group of genes at the left of the distribution, indicating perhaps a small group of significant genes. The Corrected Prob Level vs Mean Difference plot shows six genes below the Prob Level cutoff.

Step 4 – Follow-Up Analysis Fourteen pre-processed values should have been saved for 5 genes, beginning with X41237_at and ending with X100084_at (X is added at the beginning to avoid a variable name beginning with a number). More specific analyses of these 5 genes may be obtained using the NCSS: T-Test – Two-Sample procedure.

1 Open the NCSS: T-Test – Two-Sample window. • On the menus, select Analysis, then T-Tests, then T-Test – Two-Sample. The NCSS: T-

Test – Two-Sample procedure will be displayed. • On the menus, select File, then New Template. This will fill the procedure with the

default template.

2 Specify the variables. • Select the Variables tab. • Set the Response Variable(s) to X41237_at-X100084_at. • Set the H0 Value to 0.

3 Run the procedure. • From the Run menu, select Run Procedure. Alternatively, just click the Run button (the

left-most button on the button bar at the top).

Page 30: T-Test - Two Groups

212-30 T-Test – Two Groups

NCSS Two Sample T-Test Output Descriptive Statistics Section Standard Standard 95% LCL 95% UCL Variable Count Mean Deviation Error of Mean of Mean Group=1 7 2.869166 0.1061587 4.012422E-02 2.770986 2.967346 Group=2 7 5.970853 0.1569322 5.931481E-02 5.825715 6.115992 Note: T-alpha (Group=1) = 2.4469, T-alpha (Group=2) = 2.4469 Confidence-Limits of Difference Section Variance Mean Standard Standard 95% LCL 95% UCL Assumption DF Difference Deviation Error of Mean of Mean Equal 12 -3.101687 0.1339728 7.161146E-02 -3.257715 -2.945659 Unequal 10.54 -3.101687 0.1894661 7.161146E-02 -3.260146 -2.943229 Note: T-alpha (Equal) = 2.1788, T-alpha (Unequal) = 2.2127 Equal-Variance T-Test Section Alternative Prob Decision Power Power Hypothesis T-Value Level (5%) (Alpha=.05) (Alpha=.01) Difference <> 0 -43.3127 0.000000 Reject Ho 1.000000 1.000000 Difference < 0 -43.3127 0.000000 Reject Ho 1.000000 1.000000 Difference > 0 -43.3127 1.000000 Accept Ho 0.000000 0.000000 Difference: (Group=1)-(Group=2) Aspin-Welch Unequal-Variance Test Section Alternative Prob Decision Power Power Hypothesis T-Value Level (5%) (Alpha=.05) (Alpha=.01) Difference <> 0 -43.3127 0.000000 Reject Ho 1.000000 1.000000 Difference < 0 -43.3127 0.000000 Reject Ho 1.000000 1.000000 Difference > 0 -43.3127 1.000000 Accept Ho 0.000000 0.000000 Difference: (Group=1)-(Group=2) Tests of Assumptions Section Assumption Value Probability Decision(5%) Skewness Normality (Group=1) 0.0000 Kurtosis Normality (Group=1) 1.000000 Cannot reject normality Omnibus Normality (Group=1) Skewness Normality (Group=2) 0.0000 Kurtosis Normality (Group=2) 1.000000 Cannot reject normality Omnibus Normality (Group=2) Variance-Ratio Equal-Variance Test 2.1853 0.364013 Cannot reject equal variances Modified-Levene Equal-Variance Test 0.6885 0.422876 Cannot reject equal variances Median Statistics 95% LCL 95% UCL Variable Count Median of Median of Median Group=1 7 2.847067 2.709467 3.036104 Group=2 7 5.975441 5.812292 6.262713 Mann-Whitney U or Wilcoxon Rank-Sum Test for Difference in Medians Mann W Mean Std Dev Variable Whitney U Sum Ranks of W of W Group=1 0 28 52.5 7.826238 Group=2 49 77 52.5 7.826238 Number Sets of Ties = 0, Multiplicity Factor = 0 Exact Probability Approximation Without Correction Approximation With Correction Alternative Prob Decision Prob Decision Prob Decision Hypothesis Level (5%) Z-Value Level (5%) Z-Value Level (5%) Diff<>0 0.000583 Reject Ho -3.1305 0.001745 Reject Ho -3.0666 0.002165 Reject Ho Diff<0 0.000291 Reject Ho -3.1305 0.000873 Reject Ho -3.0666 0.001083 Reject Ho Diff>0 0.999709 Accept Ho -3.1305 0.999127 Accept Ho -3.1944 0.999299 Accept Ho

Page 31: T-Test - Two Groups

T-Test – Two Groups 212-31

Kolmogorov-Smirnov Test For Different Distributions Alternative Dmn Reject Ho if Test Alpha Decision Prob Hypothesis Criterion Value Greater Than Level (Test Alpha) Level D(1)<>D(2) 1.000000 0.6556 .050 Reject Ho 0.0006 D(1)<D(2) 1.000000 0.6556 .025 Reject Ho D(1)>D(2) 0.000000 0.6556 .025 Accept Ho

Plots Section

0.0

1.0

2.0

3.0

4.0

2.7 2.8 2.9 3.0 3.1

Histogram of X41237_at when Group=1

X41237_at when Group=1

Cou

nt

0.0

1.3

2.5

3.8

5.0

5.8 5.9 6.1 6.2 6.3

Histogram of X41237_at when Group=2

X41237_at when Group=2C

ount

2.7

2.8

2.9

3.0

3.1

-1.5 -0.8 0.0 0.8 1.5

Normal Probability Plot of X41237_at when Group=1

Expected Normals

X41

237_

at w

hen

Gro

up=1

5.8

5.9

6.1

6.2

6.3

-1.5 -0.8 0.0 0.8 1.5

Normal Probability Plot of X41237_at when Group=2

Expected Normals

X41

237_

at w

hen

Gro

up=2

2.50

3.50

4.50

5.50

6.50

G1 G2

Box Plot

Groups

X41

237_

at

Assumptions and nonparametric tests along with graphics can be used to further study the results of each particular gene that is found to be statistically significant. Notice, however, that no correction is made for multiple testing across genes in NCSS.

Page 32: T-Test - Two Groups

212-32 T-Test – Two Groups