jmp genomics
TRANSCRIPT
JMP Genomics
Version 3.1
User Guide
“Creativity involves breaking out of established patterns in order to look at things in a different way.” Edward de Bono
JMP. A Business Unit of SASSAS Campus Drive
Cary, NC 27513 www.jmp.com
The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. JMP ®
Genomics User Guide. Cary, NC: SAS Press.
JMP®
Genomics User Guide Copyright © 2007, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.
U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227–19 Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, August 2006 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228.
JMP®
, SAS®
and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ®
indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
Table of Contents
Chapter Title Begins on page
1 Introduction 1
2 Designing New Experiments 11
3 Creating Data Sets for Analysis in JMP Genomics
29
4 Data Set Utilities 49
5 Genetic Marker Case-Control Data 91
6 Genetic Marker Family or Pedigree Data 111
7 Microarray Case Study I: The Drosophila Aging Experiment
127
8 Microarray Case Study II: Affymetrix Latin Square Data
195
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example
227
10 Predictive Modeling 245
11 Annotation Analysis 267
12 Troubleshooting 299
References 307
Appendix 309
Introduction
1 C H A P T E R
Welcome to JMP Genomics, a powerful desktop software system for integrated statistical analysis of genetic marker, microarray, and spectral (proteomics and metabolomics, for example) data. The purpose of this manual is to provide you with informative examples of how to use JMP Genomics to extract the maximum amount of useful information from genomics data. You should be familiar with the terminology and technology associated with modern genomics analyses and standard JMP functionality. The JMP Introductory Guide provides information on getting started with JMP. This manual is organized as a set of tutorials. Follow along with the JMP Genomics software as you read the manual. The conventions illustrated in Table 1.1 are used throughout this manual.
Table 1.1: Text conventions
Symbol/Font/Style Used to designate:
Instruction or task to be
performed
A > B > C Navigation path from A to
B to C, used for paths through nested directories
A > B > C Navigation path from A to
B to C, used for paths through menus
Choose or Run Buttons or Commands
General or Options
Names of data tables, column headings, and other text generated by
JMP are set in a different font
General or Options Text to be typed by the
user This chapter provides an overview of the primary functional aspects of the JMP Genomics system, descriptions of some important differences between standard JMP functionality and JMP Genomics functionality, and descriptions of the included sample data sets.
Genomics Main Menu
JMP Genomics is a fully functional version of JMP plus a collection of analytical process dialogs in the Genomics main menu (Figure 1.1). It provides access to more than 100 analytical processes.
1 Introduction 2
Figure 1.1: The JMP Genomics main menu is organized into submenus
Some Important Differences Between JMP and JMP Genomics
JMP Genomics Dialogs JMP Genomics dialogs function differently from standard JMP dialogs. Standard JMP dialogs invoke calculations in compiled code, whereas JMP Genomics dialogs generate a SAS program (with suffix .sas), execute it in the background, and then return results. The results typically consist of SAS data sets (also known as SAS data tables, with suffix .sas7bdat) along with a JMP scripting language file (with suffix .jsl) that automatically invokes standard JMP platforms. Small Java programs facilitate some of the calculations. The interaction between JMP, SAS, and Java can be depicted as follows:
Data Results
Figure 1.2: Interaction between JMP Genomics, SAS and Java
Data Sets An important distinction of most JMP Genomics dialogs is that they do not process open JMP data tables. Instead, they prompt you to specify one or more SAS data sets that have been created and saved in your file system. This characteristic enables you to work with very large data sets without having to open them as JMP data tables and to specify multiple SAS data sets in one process. The creation and use of JMP Genomics data sets is described more fully in Chapter 3.
1 Introduction 3
Deciding Which Processes to Run
An initial challenge in using JMP Genomics is deciding which processes to run and in what order. The software does not provide detailed guidance on constructing a workflow, and there are a wide variety of possible workflow combinations depending upon your discovery objectives. The Genomics menu organizes the JMP Genomics processes into groups. The groups are organized in an order that is typically employed by bioinformaticians, statisticians, and data analysts. However, you are free to rearrange the menus to your liking. Refer to the JMP Genomics Programmers Guide for details on customizing menus. JMP Genomic processes are modular, so they can be run in any order. Over time, you develop expertise with the system and form favorite workflows. The sample case studies, outlined in this manual, illustrate some typical, frequently used workflows.
Running a Process
To run a JMP Genomics process, select the process from one of the JMP Genomics menus, specify the parameters on all tabbed panes in the process dialog, and then click Run. The following example, which invokes the ArrayTrackInput Engine, illustrates a typical JMP Genomics process.
Select Genomics > Import > Other Expression > ArrayTrack. The following dialog opens.
Description box
Asterisks (*) are used to indicate required parameters.
Parameter panes
Functional buttons
Figure 1.3: A typical JMP Genomics Dialog
Each dialog has three main sections: a description box, one or more tabbed parameter panes, and functional buttons (illustrated in Figure 1.3). The description box on the top of the dialog describes the purpose of the process. The tabbed panes are the main area to specify input parameters. The six functional buttons, common to all of the JMP Genomics dialogs, are described in Table 1.2.
1 Introduction 4
Table 1.2: Functional buttons
Functional Button Used to:
Run the process using the specified parameters
Save the specified parameters
Load selected, saved parameters into the dialog
Apply the specified parameters as default settings
to all relevant JMP Genomic dialogs
Clear all the parameter settings and return the
dialog to its default state
Cancel the process and close the dialog
Use these buttons to load, save, or clear specified parameters, run the process using the specified parameters, or apply those parameters to other JMP Genomics processes. There is a defined order to the specification of some parameters. Such parameters are disabled and grayed until their dependency requirements are fulfilled. Many processes contain multiple tabbed panes with numerous optional parameters. As you develop expertise with particular processes, make sure to investigate the often rich collection of parameters available.
Click to the right of any parameter entry field to obtain help about its specification.
The General tab for each dialog typically contains the most important parameters for the process. For example, most processes require specific types of input files or data sets and an output folder. For our example, we want to open the AT_exp2.txt file. This Experimental Design File, which contains information about the experiment, is needed to import raw data into JMP Genomics and is discussed more fully in Chapter 3.
Click Choose (circled in Figure 1.4).
Figure 1.4: Click Choose to select a file or folder
1 Introduction 5
When you installed JMP Genomics, a folder named Sample Data was also installed. Navigate to this folder and then to a file named AT_exp2.txt by following the path Sample Data > Microarray > ArrayTrack.
Click on the AT_exp2.txt file.
Click Open to select the file (circled in Figure 1.5).
Figure 1.5: Click Open to select the file
The file is added to the dialog, as shown in Figure 1.6.
Figure 1.6: The Experimental Design File has been specified
Our next step is to select the folder containing the raw data files.
1 Introduction 6
Click Choose (circled in Figure 1.7).
Figure 1.7: Click Choose to select a file or folder
Navigate to the Sample Data folder and then to a folder named ArrayTrack by following the path Sample
Data > Microarray > ArrayTrack.
Click the Select button (circled in Figure 1.8) at the bottom of the Choose directory window. Note: to select a folder in JMP Genomics, you must first open the folder.
Figure 1.8: Selecting the ArrayTrack folder
1 Introduction 7
The next step is to choose a folder in which to place and store output. You may choose any folder you like. For this example, select the ProcessResults folder that came with JMP Genomics.
Repeat the selection process to specify the Output Folder.
The completed dialog is shown in Figure 1.9.
Figure 1.9: The completed ArrayTrackImport Engine dialog
Once you have specified the parameters for a process, click Save to save the parameters for later recall, if
needed.
Click Run to run the process. JMP Genomics dialogs generate and run a SAS program each time you click Run. Depending upon the size of your data sets and capacities of your computer, some analyses can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (circled in Figure 1.10).
1 Introduction 8
Figure 1.10: Display during Analysis
While a process is running, it is a good idea to monitor progress using an application that displays statistics such as CPU, memory, and disk usage, like the Windows Task Manager. This can be informative for troubleshooting a hung process. You can only run one process at a time. If you attempt to run a second process while another one is running, you are prompted to disconnect from SAS and stop the current process, to view the current SAS log, or to wait until it completes. The location of each SAS data set generated by your analysis is listed in a new window (shown in Figure 1.11). You can view each of the data sets by clicking Open.
Figure 1.11: The SAS Message generated by our analysis
Saving and Loading Settings JMP Genomics dialogs allow you to save and load parameter settings. This enables you to save, recall, modify and exchange analyses without having to re-enter specifications each time your run a process. You can save and load settings using the Save and Load buttons at the bottom of each dialog. Most of the processes in JMP Genomics come with one or more example settings that use the example data sets that come with the system. A
1 Introduction 9
good way to learn about a new process is to load one of the example settings, study its parameter values, run the process, and explore the results.
SAS Variable Names and Labels Each variable/column in a SAS data set must have a unique name. SAS variable names must adhere to the following conventions:
1) The first character must be a letter (A, B, C, …) or underscore ( _ ). 2) Subsequent characters can be letters, numeric digits (0,1,2 …) or underscores ( _ ). 3) Blank spaces are not allowed. 4) Special characters, except for underscore, are not allowed. 5) Names must not exceed 32 characters.
SAS variable names are not case-sensitive. SAS variables can be either character or numeric. In either case, a fixed length is assigned to store each observation of that variable. Optionally, SAS variables can have a label. Labels have much less restrictive creation rules. For example, SAS labels can be up to 256 characters in length and can contain blanks and special characters. When JMP opens a SAS data set, it reads the labels (when they exist) and uses them as JMP data table column names. If you want information on the variable names and labels for a SAS data set, run the Column Contents process under the Data Set Utilities menu. There are other processes available for changing SAS variable names, labels, and lengths.
Sample Case Studies
The data sets included with JMP Genomics, which are detailed below, allow you to work through many of the analytical processes in JMP Genomics. In addition to the data sets, each case study includes experimental design files and other needed files. These case studies are referred to throughout this manual.
Drosophila Aging Experimental Data This data set represents a small subset of the Drosophila aging experiment data from (Jin, Riley et al. 2001). The
experiment consisted of 24 two-color cDNA microarrays, 6 for each experimental combination of 2 lines (Oregon and Samarkand), 2 sexes (Female and Male), and 2 ages (1 week and 6 weeks). The Cy3 and Cy5 dyes were flipped for two of the 6 replicates for each genotype and sex combination. The design is a split-plot, with Age and Dye as subplot factors, and Line and Sex as whole-plot factors. A total of 4256 clones were spotted on the arrays, but this example uses a subset containing 100 randomly selected
genes from the original data set.
Affymetrix Latin Square Data The spike-in data set used in this example was originally generated by Affymetrix Corporation to develop and validate their U95A GeneChip and Microarray Suite (MAS) 5.0 algorithm over a range of known concentrations. (Affymetrix, 2001) The experiment consists of 59 arrays. There are 14 experimental groups, designated with letters, a, b, c, d, e, f, g, h, i, j, k, l, m, and q. (Group m and group q each have 4 within-chip replicates, group m replicates were originally designated n, o, and p and group q replicates were originally designated r, s, and t, The extra letters are not needed because they are replicates of m and q, respectively.) Each experiment was repeated in triplicate using Affymetrix chips cut from different wafers. The last four digits of the wafer numbers are 1521, 1532 and 2353. Wafer 2353, chip c was
1 Introduction
10
defective so is not included in the data set. For wafers 1521 and 1532, 20 .CEL files were generated, and for wafer 2353, 19 .CEL files were generated. Each group contains a pool of non specific RNA as well as a set of 14 distinct human transcripts spiked in at known concentrations of 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM. The spike-in concentrations for each pool are staggered in a Latin square design. For purposes of rapid demonstration, the data have been trimmed to only 100 genes (including the 14 spike-ins), and trimmed versions of .CEL files containing just these 100 genes are available in your JMP Genomics Sample Data folder
Prostate Cancer Biomarkers This data set was obtained by surface-enhanced laser desorption/ionization (SELDI). This method allows an investigator to detect and resolve multiple proteins bound to protein chip arrays (Merchant and Weinberger 2000). This approach was used by Qu, et al. ( 2002) to discriminate prostate cancer from non-prostate cancer patients. The promise of this approach is that a panel of multiple biomarkers can be used to distinguish important phenotypes such as cancer status; however, great care must be taken to pre-process and analyze the data appropriately to ensure generalizability of results. The example data set consists of serum samples collected from 165 men, 84 of whom had prostate cancer. The remaining 81 men are considered to be controls. The primary goal is to determine differences in protein expression between these groups. Sample Genetic Marker Data These data are computer-simulated. The data are in wide form, with the 1000 rows corresponding to individuals and 130 columns corresponding to various data on these individuals. These data contain family, genotype, and phenotype information. The disease column contains the binary trait of primary interest, with 1 indicating individuals affected with the disease and 0 indicating unaffected individuals. There are also four quantitative traits and sixty markers, with two possible alleles (designated 1 and 2), per marker, for each individual. The marker data occur in pairs, so that the genotype at the first marker comprises columns ma1 and, ma2, ma3 and ma4 the second marker genotype, and so on. The analyses performed on this data set are aiming to locate the gene or genes that affect susceptibility to this disease. Accompanying this data set is a map data set that provides information about the 60 markers, which are spread across two hypothetical candidate gene regions. The variable representing on which candidate gene the marker resides can be used to group analyses, and the Location variable is useful for accurately displaying distances in base pairs between markers along the x-axis of plots containing various association p-values. Affected Sib-Pair (ASP) Data Two hundred families, each containing an affected sib-pair and the siblings’ parents, were genotyped at 20 markers from a single chromosome in simulated data provided by Gonçalo Abecasis at the University of Michigan Center for Statistical Genetics. MERLIN was used to estimate identical-by-descent (IBD) allele-sharing probabilities at these markers for all pairs of related individuals. The 400 offspring are also measured for a quantitative trait of interest.
Designing New Experiments
2 C H A P T E R
Designed experiments are at the very heart of scientific discovery. However, a lot of scientific experimentation is conducted in a haphazard fashion. Because of this, many experiments are less efficient and less informative merely because of a lack of planning. Many designs have a large degree of confounding, in which the effects of two or more factors or their interactions are indistinguishable. If an experiment has too much confounding, it can lead to inconclusive or potentially misleading results. Taking a little time to properly design each experiment avoids confounding and leads to maximal information gain for the research costs incurred. This chapter guides you through features in JMP Genomics that plan efficient experiments. Our starting point is the JMP DOE (Design of Experiments) menu found in the main menu bar.
Figure 2.1: DOE Main Menu
JMP offers a wide range of DOE functionality, including classical designs. For scientific discovery and genomics purposes, we focus only on the first item in this menu: Custom Design (Figure. 2.1). For in-depth details and background on all items in this menu, refer to the JMP Design of Experiments guide. Note: Some of the terminology in the JMP Design of Experiments guide is derived from the statistical and engineering literature, which chronicles a long, rich, and successful history of highly efficient experimental designs. Many of the best designs are not widely known or utilized in genomics research, but JMP enables you to rapidly find and customize them for your laboratory’s needs.
Example: A Two-Way Design for Single Channel Instrumentation
This example uses 12 biological samples to study the effects of a chemical agent versus a chemical control. The study examines the expression of a large set of genes, proteins, or metabolites at 1 hour, 6 hours, and 24 hours after dosing the samples with the chemical. Because of the destructive nature of the expression protocol, each sample can be treated with only one chemical and observed at only one time. Expression is measured with a single-channel instrument, which excludes two-channel microarrays, considered later in this chapter. A standard two-way design is appropriate in this case.
2 Designing New Experiments 12
The JMP Custom Design platform allows you to interactively create designs of any complexity, but let’s begin with this simple case.
Select DOE > Custom Design. The dialog illustrated in Figure 2.2 appears.
Figure 2.2: JMP’s Custom Design Dialog
The two main fields of this Custom Design dialog allow entry of responses and factors. Responses are the numerical measurements taken during the experiment. In Genomics research, thousands of responses are collected simultaneously, so JMP Genomics has special conventions for loading large response data files. These conventions are explained later.
For now, leave this field as it is, with response variable Y.
Factors are the variables that are controlled during the experiment. They are the effects of interest. For our two-way experiment, we have two factors: Treatment and Time. Treatment has two levels, Agent and Control. Time has three levels, 1h, 6h and 24h.
To add these factors to the design, select Add Factor > Categorical > 2 Level, as shown in Figure 2.3.
Figure 2.3: Adding a categorical factor
2 Designing New Experiments 13
Double-click on X1 (Figure 2.4) and change its name to Treatment.
Under the Values column, click on L1 and change it to Agent.
Press the Tab key to L2 and change it to Control.
Figure 2.4: Specifying the first factor
To add the second factor, select Add Factor > Categorical > 3 Level.
Double-click on X2 and change it to Time.
Under the Values column, click on L1 and change it to 01h.
Tab to L2, change it to 06h.
Tab to L3 and change it to 24h.
Tip: Use zero-padding to code numerically-based values with varying lengths so that alphabetical sorting order matches numerical order during later analytical processing in SAS.
Note: You could optionally define Time as a Continuous factor, if you plan to directly model linear or quadratic trends over time. For this example, we define Time as categorical in order to allow each time level to have an arbitrary mean response.
The Factors section is shown in Figure 2.5.
Figure 2.5: Specific factors
2 Designing New Experiments 14
Click Continue to proceed to the next design step, shown in Figure 2.6.
Figure 2.6: JMP’s DOE Custom Design Dialog Window (part II)
There are no constraints, so skip the Define Factor Constraints section. The Model section allows specification of the design that enables estimation of interactions between Treatment and Time. Note the Design Generation section at the bottom of Figure 2.6, which specifies 6 runs. In the current design, with no interaction terms specified, the default number of runs is 6.
Click Interactions > 2nd, as shown in Figure 2.7.
Figure 2.7: Adding Two-Level Interactions.
2 Designing New Experiments 15
A Treatment*Time row is added to the model, as seen in Figure 2.8.
Figure 2.8: The Complete Model
In the Design Generation section (Figure 2.9), note the default number of runs has changed to 12. A run is one specific combination of factors applied to obtain one set of responses.
Figure 2.9: Design Generation
Since there are 12 samples budgeted for the runs, leave this field as is and click Make Design to generate
the design shown in Figure 2.10.
2 Designing New Experiments 16
Figure 2.10: Custom Design Dialog Window (part III)
Note in the Design section that the 12 runs are listed sequentially. Whenever possible, it is always a good idea to randomize the order in which you collect experimental data. This helps avoid any unwanted trends that may creep into the data over time. If you are unable to randomize the order of one or more factors, you should consider more complex designs such as Randomized Block or Split-Plot designs, described later in this chapter. To do this randomization, complete the following steps.
In the Output Options box (Figure 2.11), leave Run Order as Randomize.
Figure 2.11: The Output Options Box
Do not change the Number of Replicates, since we have all 12 available samples.
Click Make Table to obtain the table shown in Figure 2.12.
2 Designing New Experiments 17
Figure 2.12: Experimental Design Table
The treatment each biological sample is subjected to and the time at which each sample is observed are both listed in Figure 2.12. Note that the run order (1-12) may be different than Figure 2.12 because of the random number generator used to generate this design.
This is an example of a completely randomized design. The levels of both Treatment and Time are arranged in a random order. When collecting expression data on only one gene, protein, or metabolite, simply enter the data in the Y column and then analyze them directly in JMP using any number of different methods. But to work with thousands of expression measurements simultaneously, JMP Genomics requires you to construct a table like this one as a way to link the experimental design information to a collection of raw response data files, each of which contains thousands of measurements. Construction of this table, known as an Experimental Design File, requires adding two columns to this table, called File and Array, that are described more fully in Chapter 3. The File column lists the names of the raw data files containing the expression measurements corresponding to the factor levels for the run in its same row. The Array column contains a unique index for each array in the experiment.
For now, the table is ready to use in the lab to run the design in the random order specified.
Blocking Factors
Experimental designs are often difficult to conduct in a completely randomized fashion because of the presence of one or more additional factors that can induce correlation in the observed responses. In these situations define one or more blocking factors to better control unwanted experimental variation. Examples of blocking factors include: batch, animal, day of processing, technology lot number, machine, location, laboratory, technician, or operator. Blocking factors are typically considered random because they can be viewed as arising from a population of effects having a probability distribution, usually a normal distribution.
To continue the two-way design example, suppose that the 12 samples are not totally independent, but that 3 samples each were taken from 4 distinct batches. The batches could consist of any number of things, including the day of initial sample collection or the mode of processing them. In this case, it is important to control for the effect of batches on the experimental outcomes. To do this, add Batch as a blocking factor to the design. Defining such a blocking factor lets you model a correlation betweens samples from the same batch and provides a more accurate assessment of true batch-to-batch variability. Ignoring the batch effect when it is significant leads to biased conclusions about expression differences.
2 Designing New Experiments 18
To add Batch to the previous design, complete the following steps.
Begin a new design. Select DOE > Custom Design.
Define Treatment and Time factors, as previously described.
Click Add Factor > Blocking > 3 runs per block. Double-click on X3 and change it to Batch. The Factors section should now appear as shown in Figure 2.13.
Figure 2.13: Design with a blocking factor
Click Continue to specify which terms need to be modeled.
Click Interactions > 2nd.
Click Continue in any message windows.
Click Make Design to make the design shown in Figure 2.14.
2 Designing New Experiments 19
Figure 2.14: Custom Design Dialog Window with 1 blocking factor
Note the Batch factor has four levels, with three runs for each level.
Click Make Table to obtain the table shown in Figure 2.15.
Figure 2.15: Experimental Design Table (with 1 blocking factor)
2 Designing New Experiments 20
This is an example of an Incomplete Block Design. The blocks corresponding to Batch are incomplete because not all combinations of treatment and time are observed within a block; however, there is a form of partial balance in the experiment, because each unique combination of treatment and time is observed exactly twice across the whole experiment. Tip: Good designs often have some form of balance in terms of number of treatment combinations observed. Balancing the number of factor levels helps break confounding among factors and ensures approximately equal information gain on all relevant differences.
Split-Plot Designs
Continuing our two-way experiment example with factors Treatment and Time on a single-channel instrument, suppose that instead of the need for the batch blocking factor, the actual constraint is that samples need to be processed immediately after collection at the 1 hour, 6 hour, and 24 hour time points. In other words, it is not feasible to conduct the experimental runs in a completely randomized fashion; rather, they must be processed in time order. This is a situation calling for a Split-Plot Design, in which certain factors are easy to change in the lab and others are hard to change. You can easily generate a split-plot design in JMP DOE by changing values in the Changes column in the Factors section.
Begin a new design. Select DOE > Custom Design.
Define Treatment and Time factors as described previously.
In the Changes column, click Easy in the Time row.
Select Hard from the menu that appears (Figure 2.16).
Figure 2.16: Changing Time from Easy to Hard
Click Continue to define the model.
Click Interactions > 2nd in the Model section
Click Make Design to proceed to the next step (shown in Figure 2.17).
2 Designing New Experiments 21
Figure 2.17: Custom Design Dialog Window (split-plot design)
Notice the automatic creation of the Whole Plots column in the Design section.
Note: The term whole plot derives from agricultural field research where split-plot designs were originally popularized. Imagine a two-way design in a field trial in which the effects of plant variety and different fertilizers are to be studied. The fertilizers can only be applied to large sections of the field via large machinery or airplanes, but varieties can be planted in smaller sections. The split-plot design consists of dividing the field into fertilizer-level sections called Whole Plots, and the varieties are planted in subplots within each whole plot. There are six whole plots in this design. Note that levels of Time are constant within any particular whole plot. In contrast, the Agent and Control levels of Treatment change within whole plots. Treatment is known as a subplot factor, and Time as a whole-plot factor. By their nature, split-plot designs provide more precision in estimating effects of subplot factors than they do for effects of whole-plot factors. This is perhaps intuitive given the constraints placed on the whole-plot factors. Many experimenters employ split-plot designs without realizing it when they process samples in a grouped order, but then analyze the data as if they were completely randomized. This practice can lead to badly biased conclusions, especially when the whole-plot effect is substantial. The appropriate way to analyze a split-plot design involves specifying whole-plots as a random effect in the analysis, thereby modeling a correlation among measurements taken within the same whole plot.
Two-Channel Microarrays
Two-channel microarrays are characterized by the fact that two measurements for each gene are obtained from each microarray. This is because two different samples are tagged with different dyes, competitively hybridized to one array, and then measured under two different laser frequencies. This technology therefore offers an additional layer of complexity for experimental design beyond the one-channel designs described previously.
2 Designing New Experiments 22
Several papers have discussed different two-channel design options in detail, including Kerr and Churchill (Kerr and Churchill 2001) and Dobbin and Simon (Dobbin and Simon 2002). Arguably the most popular design is the Reference Sample Design, in which a common reference sample (typically a pool of samples that is not of direct experimental interest) is tagged with one dye and hybridized on every array, while the various treated samples are tagged with the other dye. This design is easy to set up and effectively reduces design considerations to the single channel that is changing. However, the reference sample design can be two to four times less efficient than designs that hybridize samples of interest directly together on microarrays. The keys to higher efficiency are to pair samples together on arrays in a way that optimizes experimental interests and then to make sure the analysis of the data is conducted appropriately. The previous discussion of blocking factors and split-plot designs has direct bearing here. If we narrow our focus to all the data from a single gene, and assume there is only one spot for that gene on each array, then the data come in pairs corresponding to the two measurements from each array. Each array can therefore be considered as a block of size two. Alternatively, in a split-plot scenario where certain factors are hard to change, you desire more precise information on some factors versus others, then you can consider arrays to be whole plots and assign certain factors to change within whole plots (subplot factors) and others to stay constant on the whole plots (whole-plot factors). Example: Split-Plot Design for Two-Channel Microarrays. Here we use the Drosophila aging experiment described in Jin et al. (Jin, Riley et al. 2001) as an example to consider for experimental design options for two-channel microarrays. A subset of these data is included with your JMP Genomics installation and is described in Chapter 1 of this manual. This design has three experimental factors with two levels each: Age (1 week, 6 weeks), Sex (Female, Male), and Line (Oregon, Samarkand).
Note: For higher-level factorial arrangements, experimental design experts often use exponential notation as a shorthand description. The Drosophila example would be called a 23 design, which designates 3 factors with 2 levels each. The primary experimental factor of interest is Age, and for this experiment it was desirable to obtain more precise information on the effects of Age at the expense of the Sex and Line effects. The latter two are still included to provide a higher degree of generalization for conclusions. These considerations call for a split-plot design. To create a split-plot design for this example,
Click DOE > Custom Design.
Define the three categorical factors and a fourth factor indicating the Channel. Specify Sex and Line as Hard in the Changes column, and leave Age and Channel as Easy. The completed dialog should look like the one in Figure 2.18.
Figure 2.18: The Factors have been defined
Click Continue.
2 Designing New Experiments 23
There are 24 assays available for experimentation, so in the Design Generation section (Figure 2.19), specify 24 in the Number of Whole Plots box.
Figure 2.19: The Design Generation box
Click Make Design and then Make Table to generate a table like the one partially shown in Figure 2.20.
Figure 2.20: A portion of the Experimental Design Table
Note how Age and Channel change within whole plots, whereas Sex and Line stay constant for each whole plot. To convert this table to a valid JMP Genomics Experimental Design File (EDF), change the name of the Whole Plot column to Array by double-clicking on the column header and typing in Array as the new column name. Also, delete the Y column, since it will be replaced by a column named File. See Chapter 3 for specific instructions on building EDFs. To compare this design with the original design in Jin et al. (Jin, Riley et al. 2001), open the file AgingExperimentTable.txt located in the Sample Data folder. Note the run order and randomization schemes are different, but the designs are similar in terms of their split-plot structure.
Example: Randomized Block Design for Two-Channel Microarrays
Suppose that instead of the split-plot design just considered, equal information about the Age, Sex, and Line factors is needed and they need to be randomly allocated to the arrays in a randomized block design. A somewhat different approach in JMP illustrate a few more of its features.
Click DOE > Custom Design.
2 Designing New Experiments 24
Define Dye as a 2-level categorical factor and Array as a 2-runs-per-block blocking factor as shown in Figure 2.21.
Figure 2.21: The completed Factors panel
Click Continue.
In the Design Generation section (Figure 2.22), enter 48 runs.
Figure 2.22: The Design Generation box
Click Make Design and then Make Table to generate a table like the one shown in Figure 2.23.
Figure 2.23: The Experimental Design Table
This table establishes the static portion of the design and ensures that Cy3 and Cy5 always appear once in each array.
Make sure this table is the active JMP table, and then open a new Custom Design window with DOE > Custom Design.
2 Designing New Experiments 25
In the Factors section, click Add Factors > Covariate, select Dye, and click OK, as shown in Figure 2.24.
Figure 2.24: Selecting the first covariate
Click Add Factors > Covariate again, select Array, and click OK to generate the Factors section shown
in Figure 2.25.
Figure 2.25: Both covariates have been selected.
Note: JMP considers a Covariate to be a factor describing fixed characteristics of the samples that do not change. Also note the levels of the two covariates Dye and Array are automatically read from the active JMP table because it is a previously created JMP table. In addition to loading factors from an active JMP table, you can save and load factors by clicking on the small red triangle beside Custom Design. Next, define the three experimental factors Age, Sex, and Line. Since all three of these factors have two levels, they can be added to the design at the same time.
Type a 3 into the Add N Factors box, and then click Add Factor > Categorical > 2 Level. This creates three new rows in the Factors section.
Change each row to match the Factors section shown in Figure 2.26.
Figure 2.26: Completed Factors dialog
Click Continue.
2 Designing New Experiments 26
In the Factors section, highlight Age, Sex, and Line.
In the Model section, select Interactions > 3rd.
This produces a Model section like the one shown in Figure 2.27.
Figure 2.27: Model Section with all factors and interactions defined
Click Make Design and then Make Table to create the final design, shown in Figure 2.28.
Figure 2.28: The Randomized Block Design
This Randomized Block Design allocates 2 of the 8 possible treatment combinations to each array. Note: The previous design is also known as a kind of loop design (Kerr and Churchill 2001), and is illustrated in Figure 2.29. The term loop derives from the fact that the design can be depicted as nodes indicating samples treated with one particular experimental factor combination. Aliquots of RNA from each sample are labeled either with the CY3 (green) or CY5 (red) florescent dyes. Two labeling reactions are required for each sample. Pairs of alternately labeled samples are pooled and hybridized to identical arrays. Each spot is probed with each sample, labeled with either dye, allowing the experimenter to control for confounding biases resulting from either dye or array effects.
2 Designing New Experiments 27
Sample 1
Sample 3
Cy5 Cy3
Cy5 Cy3
Sample 2Sample 4
Cy5 Cy3
Cy5 Cy3
Figure 2.29: Loop design for 4 experimental conditions
Microarrays with Three or More Channels
With microarrays having three or more channels, the previous discussion for two-channel designs can be extended. For incomplete block designs, set the number of runs per block equal to the number of channels and set up the other factors as usual. For split-plot designs, set the number of whole plots equal to the number of budgeted arrays.
Microarrays with More than One Spot per Gene on Each Array
Some microarrays, often those manufactured in your local lab, have multiple spots per gene on the array. Such two-color arrays pose no additional concerns from a new experimental design perspective because the samples are applied to the entire array. However, the existence of multiple spots does make a difference during subsequent data analysis, when random effects caused by such things as the nesting of identical spots within an array or differences in dye effects among multiple arrays should be considered, in addition to the usual Array random effect.
Choosing the Overall Number of Runs in a Design Selecting the number of runs in a design is always a tradeoff between cost of the experiment versus the desired information gain, precision, or power. The latter can be difficult to quantify, considering that tens of thousands of genes or proteins are measured simultaneously. One rule of thumb is to use three biological replicates for each distinct combination of factors. A biological replicate is a biologically unique sample from the population of samples considered for experimentation. This is to be distinguished from a technical replicate, which is a repetitive measurement from biological material already
2 Designing New Experiments 28
used in a previous run. Biological replicates tend to be much more variable than technical replicates, but they also provide the best means to make appropriate conclusions about the population of interest. A more statistical concept for evaluating size of designs is degrees of freedom for error. This represents the fraction of the data that is used to estimate noise instead of signal. It is computed by subtracting the total number of factor combinations from the total number of runs. Another rule of thumb requires at least 10 degrees of freedom for error in the design in order to be able to obtain an accurate estimate of noise and accompanying standard errors for effect differences. A rigorous statistical approach for determining the number of replicates in a design is to use sample size and power calculations. These require some prior knowledge about anticipated magnitudes of effect sizes as well as desired false positive rates. Some common methods are available under DOE > Sample Size and Power, and a few advanced ones are under Genomics > Power and Sample Size. Refer to the JMP Design of Experiments guide for additional information.
Creating Data Sets for Analysis in JMP Genomics
3 C H A P T E R
Congratulations! A completed experiment has yielded many data files. Each file consists of hundreds or thousands of rows and columns filled with numbers. Now what? Fortunately, JMP Genomics is available to help you analyze your large and complex data sets, extracting the maximum amount of information from them. Before analyzing your data, however, you must convert the raw files into a readable format. This chapter demonstrates how to prepare data for analysis. Recall from Chapter 1 that, instead of using standard JMP data files, JMP Genomics uses SAS data sets. JMP Genomics provides several commands to create SAS data sets from raw genomics data files, such as text files, Excel spreadsheets, or data from various types of special instruments. These SAS data sets serve as inputs to other JMP Genomics processes. Nearly all JMP Genomics processes generate more SAS data sets as outputs, which then serve as inputs to more processes. This framework provides considerable flexibility for statistical workflows. Make sure to organize and name your SAS data sets in a clear way to avoid confusion. The examples in this chapter demonstrate the processes for creating SAS data sets using JMP Genomics. Before we get to those examples, we should review and clarify some aspects of SAS data sets, particularly as they relate to JMP Genomics.
A Few Words about SAS Data Sets and JMP Genomics
SAS data sets have the extension .sas7bdat. We recommend you associate the extension .sas7bdat with JMP (Control Panel > Folder Options > File Types) so that double-clicking on any .sas7bdat file opens it in JMP as a JMP table. JMP can then produce its native graphics and analyses, in addition to those created by JMP Genomics dialogs. To save a JMP table as a SAS data set, change the File Type in the Save As dialog. Alternatively, you may use the File > Save As SAS Data Set command. JMP Genomics Requires Specific Types of Data Sets Many of the processes in JMP Genomics (especially those used for microarray and proteomic analyses) require the specification of two separate SAS input data sets: 1. an input data set in tall format (The tall and wide data formats are defined on the next page), and 2. an appropriate Experimental Design Data Set (EDDS). An EDDS is a SAS data set that provides
information about the columns of the tall data set. It describes relevant experimental variables such as treatment conditions and covariates, as well as a variable named ColumnName. Entries in the ColumnName column must exactly match the column names in the input data set. Experimental design data sets have certain constraints that must be followed for the processes to run successfully.
3 Creating Data Sets for Analysis in JMP Genomics 30
To create these data sets, first construct a third type of file, the Experimental Design File (EDF). An EDF imports various kinds of data into JMP Genomics. An EDF is a precursor to an EDDS. The EDF is normally saved as a comma separated values (.csv) file, tab delimited text (.txt) file, or Microsoft Excel (.xls) spreadsheet rather than as a SAS dataset. A typical JMP data file (.jmp) does not work as an EDF. When designing a new experiment from scratch, refer to Chapter 2 on how to use JMP’s DOE (Design of Experiments) functionality to create an optimal design. After creating a design, one or more columns are usually added to the table to make a valid EDF. Then use the JMP File menu to save it as a text or Excel table. Note: The advantage to using an EDF is having all of the experimental variables collected in one table that can be reused or modified as needed. An EDF is an excellent way to consolidate, store, and share the critical factors in an experiment, rather than trying to attach them to the raw data manually or adding them into the names of the raw data files. Since an EDF can be used to record corresponding experimental factors of a microarray experiment, it is good practice to construct it during the initial planning of your experiments. Note: Many of the processes used for genetic analyses make use of wide data sets and do not require an EDDS. Tall and Wide Data Sets Most of the processes in JMP Genomics assume that the input SAS data set has a particular data structure. JMP Genomics distinguishes between tall and wide SAS data sets. A tall SAS data set has samples as columns and molecular entity (such as marker, gene, clone, protein, or metabolite) as rows. A wide SAS data set is the transpose of a tall data set, having the samples as rows and molecular entity as columns. When specifying the input SAS data set for a process, it is important to know the required form. Most of the processes associated with genetic analyses require a wide structure, whereas most of those for microarray and proteomics analyses use a tall structure. The Transpose Tall and Wide and Transpose Rectangular processes under the Data Set Utilities menu transform SAS data sets between tall and wide forms. The use of these commands is discussed in more detail in Chapter 4. Terminology The columns in a SAS data set are called variables, and the rows are called observations. This terminology is used frequently in JMP Genomics dialogs and this documentation.
Annotation Data Sets In addition to an experimental design data set, many JMP Genomics processes also optionally accept an annotation data set. This is a SAS data set containing biological or chemical properties corresponding to the molecular entities in the experiment. Annotation data sets can correspond to either tall or wide data sets. For tall data sets, annotation data sets must share one or more merge key variables with the tall
data set so that the two data sets can be joined at run time. For wide data sets, an assumption on the order of the variables is usually in effect.
Annotation data sets are typically created by opening an appropriate text or Excel table in JMP, removing any undesired columns, and then saving it as a SAS data set (with extension .sas7bdat) using the Save As menu. However, if the column names in the data set contains special characters (-, *, #, for example), the columns may be truncated. This problem can be avoided by using the File > Save As SAS Data Set command. Annotation data sets provided by Affymetrix or other suppliers, typically as .txt or .csv files, must first be imported into JMP using the Genomics > Data Set Creation > Text > Import Individual Text, CSV or Excel Files process to convert the .txt, .csv files or excel file to a .sas7bdat file. See Chapter 11 for more information on Annotation Data Sets.
3 Creating Data Sets for Analysis in JMP Genomics 31
Creating the Input Data Sets There are numerous ways to create input data sets and the EDDSs needed for analysis by JMP Genomics. How you decide which method to use depends on the form in which your raw data is stored, the availability of design files that describe the organization of your experiment and the data, the complexity of your experiment, and the number and types of analyses you plan to conduct. Table 3.1 lists possible scenarios for creating the needed data sets, depending upon the types of files you start with.
Table 3.1: Recommended Procedures for Creating the Needed Data Sets
What you have: Recommended Procedure:
Raw data files (device-specific format), and a Design file
1. Convert the design file to an EDF (Join multiple files if needed)
2. Generate the EDDS and SAS data set using the device-specific import engine under Import.
Raw data files only (device-specific format)
1. Create an EDF using the Experimental Design File Builder.
2. Generate the EDDS and SAS data set using the device-specific import engine under Import.
Raw data files only (.txt, .csv. .xls, .sas7bdat)
1. Create an EDF using the Experimental Design File Builder.
2. Generate the EDDS and SAS data set by running the Import a Designed Experiment from Text, CSV, or Excel Files process under Import > Text.
Raw data file only (one file, in tall form)
1. Read the raw data file into JMP and then save it as a .sas7bdat file using either the File > Save As or the File > Save As SAS Data Set command. The JMP User Guide provides instructions for importing data from .txt, .csv, and .xls files.
2. Run the Experimental Design Data Set Builder process, under Experimental Design, on the newly created .sas7bdat file to create the EDDS.
Raw data file only (one file, in wide form)
1. Read the raw data file into JMP and then save it as a .sas7bdat file using either the File > Save As or the File > Save As SAS Data Set command. For processes that require a tall data set, run the Transpose Tall and Wide process, under Data Set Utilities, to convert the data set from a wide to a tall form and to generate the EDDS.
2. Run the Experimental Design Data Set Builder process, under Experimental Design, on your newly created .sas7bdat file to create the EDDS.
Note: For processes that do not require an EDDS, you import the data using the Import Individual Text, CSV or Excel Files command.
3 Creating Data Sets for Analysis in JMP Genomics 32
Later, this chapter includes several examples of how to create these data sets using the procedures listed in the Table 3.1.
The Experimental Design File Recall the ArrayTrack example from Chapter 1. In this example, we created an input data set, an EDDS, and an Annotation Data Set using parameters specified by an EDF from the Sample Data folder when you installed JMP Genomics. In most cases, an EDF must be created before you conduct further analyses. EDFs for JMP Genomics must adhere to the following conventions:
1. The first row of the file must contain column header names. The second and subsequent rows must contain data with no blank rows.
2. One column must have the header name Array, Chip, or Spectrum. An optional second
column must be named Channel or Dye. The data entries in these two columns must uniquely identify the rows of the file. The Create Array Index process (under Genomics > Experimental Design) generates this column, if needed.
3. One column must have the header name File or FileName. The entries in this column must
contain the names of the raw data files that are associated with each row. The Check File Names process (under Genomics > Experimental Design) helps you to check the accuracy of the file names.
4. One column must have the header name ColumnName. The entries in this column must
correspond to valid SAS variable names in the tall data set that is associated with this experimental design. The Create ColumnName process (under Genomics > Experimental Design) can generate this column.
5. When raw data files have more than one raw data column, a column named Intensity is
required. The names of the columns in the raw data files are listed in this column.
6. When raw data files have a column corresponding to a background signal to be subtracted from the specified Intensity column, include a column named Background. The entries in this column contain the names of the columns in the raw data file that correspond to the background columns.
7. To input other columns, which are shared for all the raw files, such as coordinates of
molecular entities on arrays, you may include columns named _X_varname in your EDF, where varname is the name assigned to these columns in the tall data set you are creating. The entries in this column contain the names of the columns in the raw data file that correspond to the extra data.
8. You may include an arbitrary number of additional columns corresponding to such things as
treatment, dose, time, or any other experimental variable or covariate of interest. Do not use any of the names described in conventions 2-7 above for these additional columns.
9. The file must be in one of the following formats: tab-delimited with .txt extension, comma-
delimited with .csv extension, Microsoft Excel with .xls extension, or a SAS data set, with .sas7bdat extension.
EDFs may be built in a variety of ways. The simplest method assumes you have a file identifies individual raw data files along with the experimental conditions, such as treatment, dosage, time cell line, animal, sex, age, etc, under which they were generated. Such a file may be created using JMP’s
3 Creating Data Sets for Analysis in JMP Genomics 33
DOE capabilities, as discussed in Chapter 2. This file is read into JMP and modified such that it functions as an EDF. Note: If the design information is spread across separate tables, use the Tables > Join command to merge the tables to create the design file. Consult the JMP User Guide for specific instructions on merging tables. Alternatively, JMP Genomics includes a tool called the Experimental Design File Builder (under Genomics > Experimental Design) that you can use to create a new EDF. Let’s use the Affymetrix Latin Square data set, contained in the Sample Data Folder included with JMP Genomics and described in Chapter 1, as an example to demonstrate both methods. Converting an Existing Design File into an EDF This example uses an Excel file called DesignTable.xls that contains information identifying specific raw data files with the experiments from which the data in each file was generated.
Select File > Open.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Specify the file type as an Excel file in the Open Data File box.
Open the DesignTable.xls file.
The file is imported to a JMP table, as shown in Figure 3.1.
Figure 3.1: The Experimental Design File
Note that this table contains the three required elements for an Experimental Design File:
1. An Array column listing the individual array used for each experiment, 2. A File column listing the names of the specific raw data files for each experiment, and
3 Creating Data Sets for Analysis in JMP Genomics 34
3. A ColumnName column listing the column names within those files that contain the relevant data.
As such, this file can serve as an EDF. There is an additional Experiment column specifying the individual experiment from which the data in each row was collected. Presumably, the experimenter is aware of the variables (treatment, dosage, time, etc.) for each experiment. However, additional columns could be added to the table to specify additional information. To add columns, complete the following steps.
Select Cols > New Column.
Specify the name and characteristics for the new column.
Fill in the contents of the new column, either by typing the information into each cell, or by using JMP’s Tables > Join command to merge this table with another containing the information.
Repeat for each new column that you add.
Select File > Save As to save the EDF. Be sure to specify one of the acceptable file types.
Note: You should use the File > Save As SAS Data Set command to save the file as a .sas7bdat file if the file’s column names contain special characters.
Building a New EDF To construct an EDF from scratch, use the Experimental Design File Builder command.
Select Genomics > Experimental Design > Experimental Design File Builder. The dialog shown in Figure 3.2 appears.
Figure 3.2: The EDF Builder
3 Creating Data Sets for Analysis in JMP Genomics 35
Click Choose to specify the folder containing the raw data files. For this example,
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Open the CEL folder and click Select.
The Affymetrix Latin Square folder, which contains the raw data files, is specified in the Experimental Design File Builder dialog as shown in Figure 3.3.
Figure 3.3: The folder containing the raw data files has been selected.
To view only the relevant .cel files, complete the following step.
Select .cel in the File Filter Expression box.
The File Filter Expression box appears as shown in Figure 3.4.
Figure 3.4: The File Filter Expression box
Because the probes were labeled with one dye,
Make sure that 1 channel is selected, as shown in Figure 3.5.
Figure 3.5
3 Creating Data Sets for Analysis in JMP Genomics 36
Finally indicate a name for the EDF and specify to save the file.
Type AffyLatinSquare_Design in the Output File Name box. To specify the output folder, complete the following steps.
Click on Choose.
Navigate to ProcessResults.
Open the ProcessResults folder and click Select to select this folder.
The Experimental Design File Builder dialog appears like the one shown in Figure 3.6.
Figure 3.6: The Completed Dialog
Click Run to generate the EDF.
The EDF is shown in Figure 3.7.
3 Creating Data Sets for Analysis in JMP Genomics 37
Figure 3.7: The Experimental Design File
Compare the EDF illustrated in Figure 3.7 with the EDF displayed in Figure 3.1. Aside from a difference in the column order and the presence of the optional Experiment column in Figure 3.1, the two files are the same. Note: The ColumnName column is empty. The appropriate experimental data can be entered either by typing the data directly in the column or by defining specific SAS code in the Options tab of the Experimental Design File Builder dialog. Refer to the SAS 9.1.3 User’s Guide (http://support.sas.com/onlinedoc/913/docMainpage.jsp) for additional information. To enter the data using SAS code, complete the following steps.
Click on the Experimental Design File Builder dialog to reactivate the dialog.
Click on the Options tab.
Type the following SAS code between the parentheses of the %str() SAS macro.
length Experiment $ 1; Experiment=substr(file,5,1); if Experiment in ("n","o","p") then Experiment = "m"; else if Experiment in ("r","s","t") then Experiment = "q"; ColumnName = Experiment || "_" || trim(left(Array));
Click Run to generate the EDF.
The modified EDF is shown in Figure 3.8.
3 Creating Data Sets for Analysis in JMP Genomics 38
Figure 3.8: A portion of the modified EDF
Note: Except for the values of ColumnName, the modified EDF is identical to the EDF shown in Figure 3.1.
Select File > Save As to save the EDF. Be sure to specify one of the acceptable file types (.xls, .csv, .txt).
Note: You should use the File > Save As SAS Data Set command to save the file as a .sas7bdat file if the file’s column names contain special characters. Additional Tools for Creating an EDF Also available are processes called Create Array Index, Create ColumnName, and Check File Names (under Genomics > Experimental Design). Each works with an open JMP table to help you transform it into a valid EDF. Once you have a complete EDF created in a JMP table, save it as a .txt or .xls file for use as input for one of the Import processes.
Creating Both the Experimental Design Data Set (EDDS) and the SAS Data Set with an EDF
After creating an appropriate EDF, specify it as one of the input parameters in a data-specific process from the Import submenu. Using a Device-Specific Data Import Engine Recall the ArrayTrack example from Chapter 1. In this example, we created both an input data set and EDDS using the parameters specified by a sample EDF. The output of this process usually consists of two SAS data sets, one containing the raw data in tall form, and the other a corresponding EDDS. In the following example, we create corresponding data sets using the sample Affymetrix Latin Square data set and the newly created AffyLatinSquare_Design EDF.
Select Genomics > Import > Affymetrix > Affymetrix Expression CEL, as shown in Figure 3.9.
3 Creating Data Sets for Analysis in JMP Genomics 39
Figure 3.9: Opening the Affymetrix Expression CEL Import Engine
This opens the dialog shown in Figure 3.10.
Figure 3.10: The Affymetrix Input Engine dialog
To select the Experimental Design File included with JMP Genomics,
Click Choose.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Select the DesignTable.txt file and click Open.
To select the folder containing the raw data files,
Click Choose to specify the folder containing the raw data files.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
3 Creating Data Sets for Analysis in JMP Genomics 40
Open the CEL folder and click Select.
A special file, known as the Chip Description File (CDF), must be specified. This file contains information to associate individual probes (extracted from the CEL file) with the corresponding probe set. CDFs are standard files, unique for each chip, and are provided for downloading, by Affymetrix. To select folder containing the CDF file for this data set,
Click Choose.
Navigate to Sample Data > Microarray.
Open the Affymetrix Latin Square folder and click Select.
Now specify where to save the SAS data set and EDDS.
Click Choose.
Navigate to ProcessResults.
Open the ProcessResults folder and click Select.
The dialog should appear as shown in Figure 3.11.
Figure 3.11: The Affymetrix Expression CEL/CHP Import Engine (II)
Click Run to generate the data sets. As discussed in Chapter 1, JMP Genomics dialogs generate and run a SAS program each time you click the Run button. Depending on the size of your data sets and the capacities of your computer, some processes can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (See Figure 1.10). The Windows Task Manager shows a process named sas.exe, and tracks its CPU and I/O activity. Alternatively, monitor the SAS temporary working directory and the Output Folder for results as they are created. The SAS data sets generated by this process are listed in a SAS Message dialog (Figure 3.12).
3 Creating Data Sets for Analysis in JMP Genomics 41
Figure 3.12: The SAS Message dialog
Click Open for each of the data sets to examine their contents and structures.
Using the Import a Designed Experiment from Text, CSV, or Excel Files Command If the data is stored in a generic (.txt, .csv. .xls, or sas7bdat) format, build the input data set and EDDS using the Import a Designed Experiment from Text, CSV, or Excel Files command. This example uses data from the Drosophila aging experiment described in Chapter 1.
Select Genomics > Import > Text > Import a Designed Experiment from Text, CSV, or Excel Files, as shown in Figure 3.13.
Figure 3.13: Selecting the Import a Designed Experiment from Text, CSV, or Excel Files Command
The Import a Designed Experiment from Text, CSV or Excel Files dialog opens, as shown in Figure 3.14.
3 Creating Data Sets for Analysis in JMP Genomics 42
Figure 3.14 The Import a Designed Experiment from Text, CSV or Excel Files dialog
To select the Experimental Design File, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Select the AgingExperimentTable.txt file and click Open to select the file.
To select the folder containing the raw data files, complete the following steps.
Click Choose to specify the folder containing the raw data files.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Open the Scanalyze Drosophila folder and click Select.
If files do not end with the .csv, .sas7bdat, .txt, or .xls extension, specify their file type. The raw data files for the Drosophila Aging experiment, used for this example, end with .dat. These are tab delimited files.
Select Tab Delimited from the Data File Type drop-down menu, as shown in Figure 3.15.
Figure 3.15
The first row of a tall SAS data set always lists the name of the variable or column.
Enter 1 in the Row Number of Variable Names box, as shown in Figure 3.16.
3 Creating Data Sets for Analysis in JMP Genomics 43
Figure 3.16
The first seven rows in each of the raw data files contain information about the samples. Data entries begin in row 9. Specify 9 in the Data Start Row box, as shown in Figure 3.17.
Figure 3.17
ID Variables are required. For this example, the variable being measured is the intensity of the spots on the microarray.
Type Spot in the ID Variables box, as shown in Figure 3.18.
Figure 3.18
Finally, select a location to save the SAS data set and the EDDS.
Navigate to ProcessResults.
Open the ProcessResults folder and click Select.
The completed dialog appears as shown in Figure 3.19.
Figure 3.19: The completed Import a Designed Experiment from Text, CSV or Excel Files dialog
Click Run to generate the data sets.
3 Creating Data Sets for Analysis in JMP Genomics 44
The locations of the output data sets generated by this process are listed in a SAS Message dialog, as shown in Figure 3.20.
Figure 3.20: The SAS Message dialog
Click Open for each of the data sets to examine their contents and structures.
Creating the Input Data Set and EDDS from a Single, Tall Data File
Suppose all experimental data are assembled into one Excel spreadsheet like the one illustrated in Figure 3.21.
Figure 3.21: An Excel spreadsheet containing data from the Drosophila aging experiment
In this case, the data set is already in tall form, so a SAS input data set and a corresponding experimental design data set is all that is needed. You should create the two data sets separately using the following steps. For the input data set:
Select File > Open to open an Open Data File dialog.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Select Excel Files (*.xls) from the Files of type drop-down menu.
3 Creating Data Sets for Analysis in JMP Genomics 45
Select the drosophilaaging.xls file and click Open to select the file.
The file opens as a JMP table, as shown in Figure 3.22.
Figure 3.22: A portion of the JMP data table containing data from the Drosophila aging experiment
Select File > Save As SAS Data Set to open the Save As SAS Data Set dialog.
Type drosophilaaging_tall as the name of the output data set.
Choose the ProcessResults folder as the save destination.
Click Save to save the file.
For the EDDS:
Select Genomics > Experimental Design > Experimental Design Data Set Builder.
The dialog shown in Figure 3.23 opens.
3 Creating Data Sets for Analysis in JMP Genomics 46
Figure 3.23: The EDDS Builder dialog
Follow these steps to select your converted file as the Input Data Set.
Click Choose to select the Input Data Set.
Navigate to ProcessResults.
Select the drosophilaaging_tall.sas7bdat file.
Click Open to select the file.
Examine Figure 3.22. Note that all of the columns except for Spot contain raw data. To select the columns containing raw data, complete the following steps.
Hold the Ctrl key down while clicking on all of the columns listed in the Available Variables box except for Spot, as shown in Figure 3.24. Do not select Spot.
Figure 3.24: Selecting the variables containing raw data
Specify the ProcessResults folder as the output folder.
The completed General tab of the dialog should look like the one illustrated in Figure 3.25.
3 Creating Data Sets for Analysis in JMP Genomics 47
Figure 3.25: The completed General tab of the EDDS Builder dialog
As described in Chapter 1, the data in this example came from an experiment comparing the effects of age, sex and line on Drosophila gene expression. The data in each of the columns in the raw data file describes channel-specific results from one combination of those experimental conditions. To create additional columns in the EDDS to further describe these conditions, complete the following steps.
Click on the Options tab.
Type the following SAS code to create five additional columns: Line, Sex. Age, Channel, and Array within the SAS Code to Create New Design Variables field.
Line = scan(columnname,1,"_"); Sex = scan(columnname,2,"_"); Age = scan(columnname,3,"_"); Channel = scan(columnname,4,"_"); Array = scan(columnname,5,"_");
The SAS Code to Create New Design Variables field should appear as shown in Figure 3.26.
Figure 3.26 The SAS Code to Create New Design Variables field
3 Creating Data Sets for Analysis in JMP Genomics 48
Note: Each new column identifies one of the conditions in the experiment. Each column is specified on its own line. The name of each new column is specified on the left side of the equal sign, while the location within the original column name that describes the condition is defined on the right side of the equals sign. Refer to the SAS 9.1.3 User’s Guide for additional information on writing SAS syntax.
Make no other changes to the tab.
Click the Run button to generate the EDDS.
The new EDDS opens, as shown in Figure 3.27.
Figure 3.27: A portion of the EDDS
Now That You Have Your Data Sets Keep a tall data set and its corresponding EDDS together in subsequent processes that call for them. If needed, pair the same experimental design data set with updated versions of the input data such as those created by processes in the Normalization submenu. You can also create subsets of the original data, set by deleting rows from the EDDS and saving the result under a new name, to concentrate the focus of your analysis. Tall data sets and EDDSs can also be mixed and matched, depending on your analysis needs. These procedures are discussed in greater detail in later chapters.
Data Set Utilities
4 C H A P T E R
The Data Set Utilities menu provides a collection of processes for managing and modifying SAS data sets. These utilities can be used at any point during your JMP Genomics session. The utilities are divided into four main sections:
• Column Utilities • Joins and Transpositions • Statistics and Transforms • Export
These are shown in Figure 4.1.
Export
Statistics and Transforms
Joins and Transpositions
Column Utilities
Figure 4.1: The Data Set Utilities menu
The purpose of this chapter is to provide descriptions and examples for these commands. Note that a similar set of utilities for JMP tables is available under the Tables menu.
4 Data Set Utilities 50
Column Utilities
The Column Utilities group offers analytical procedures and manipulations frequently used in genomic analyses that:
• display detailed contents about the columns and structures of a SAS data set, • change the lengths, labels, names, or order of SAS data columns (also known as SAS
variables).
Column Contents The Column Contents command displays the contents of a SAS data set in .html format.
Select Genomics > Data Set Utilities > Column Contents.
The dialog shown in Figure 4.2 opens.
Figure 4.2: The Data Contents dialog
Click Load.
Select the default settings for the AffymetrixLatinSqureExample.
Click OK to bring up the Column Contents dialog shown in Figure 4.3.
4 Data Set Utilities 51
Figure 4.3: The completed Column Contents dialog
In the Print Options field, specify whether to print all the data or a subset of the data. In this example, only the first 100 observations are displayed.
Click Run.
JMP displays the results in a series of tables (Figure 4.4).
Figure 4.4: The output of the Column Contents process
Each output table, shown sequentially in the frame on the right, is identified in the Table of Contents column. You can specify certain print options to selectively print all or part of the data set. For further information, see the SAS documentation for the CONTENTS and PRINT procedures.
4 Data Set Utilities 52
Change Labels The Change Labels command modifies multiple column labels by writing simple SAS syntax. This command is particularly useful if you want to change the labels of multiple columns in multiple data sets. This example changes the labels of two columns, Unit No and Probe No, in the affylatin.sas7bdat data set included in the Sample Data folder. The original file is shown in Figure 4.5.
Figure 4.5 The original affylatin.sas7bdat data set
Select Genomics > Data Set Utilities > Change Labels.
The dialog shown in Figure 4.6 opens.
Figure 4.6: The Change Labels dialog
Click Load.
4 Data Set Utilities 53
Select the default settings for the AffymetrixLatinSquareExample.
Click OK to bring up the Data Contents dialog shown in Figure 4.7.
Figure 4.7: The Completed Change Labels dialog
To remove labels from any number of columns, click on the variable name and then click to add the name to the Remove Labels from these Variables box.
Do not select any column labels to remove.
Specify the SAS syntax for multiple new labels in the New Label Specifications box, as shown in Figure 4.8.
Variable Name
New Label
Figure 4.8 The Completed New Label Specifications box
Note: In this example, the new syntax has already been entered. The name for each column is on the left side of the equals sign and the new label for each column is on the right side of the equals sign and is contained in quotes. The changes for each column must be entered on a separate line. In your own analyses, specify both the name and the location of the relabeled output file. However, since this is an example, proceed without changing the default specifications.
Do not change either the name or location of the output folder.
Click Run to relabel the columns.
4 Data Set Utilities 54
The location of the relabeled data set generated by this process is listed in a SAS Message dialog shown in Figure 4.9.
Figure 4.9: The SAS Message dialog
Click Open to examine the relabeled file.
The relabeled file appears as shown in Figure 4.10.
Figure 4.10: The relabeled data set
Compare the original and modified table labels. Note that the Unit No and Probe No variables in the original table were changed to Affy Internal Unit No. and Probe Sequential No. in the new table respectively.
Note: Recall that variables in a SAS data set can have both labels and names. A variable must have a variable name that conforms to certain conventions. Data labels are less stringent and can be any ASCII text. JMP automatically uses SAS variable labels as column names.
Change Lengths
The Change Lengths command shortens the lengths of variables in a SAS data set to save space. This command is used only for character variables; it does not change the lengths of numeric variables. This example changes the length of the variables in the Probe Set ID column, in the affylatin.sas7bdat data set included in the Sample Data folder. A portion of the original file is shown in Figure 4.5.
4 Data Set Utilities 55
Select Genomics > Data Set Utilities > Change Lengths. The dialog shown in Figure 4.11
opens.
Figure 4.11: The Data Length dialog
Click Load.
Select the default settings for the AffymetrixLatinSquareExample and click OK.
Uncheck the Minimize Lengths of Selected Variables box.
Change the default setting in the New Length for Variables Select above [0, 64] box from 16 to
2, as shown in Figure 4.12.
Figure 4.12
Click Run to change the length of the selected variable.
The location of the modified data set generated by this process is listed in a SAS Message dialog (shown in Figure 4.13).
4 Data Set Utilities 56
Figure 4.13: The SAS Message dialog
Click Open to examine the modified file.
The modified file appears as shown in Figure 4.14. Compare the length of variables in the Probe Set ID column in the modified data set with those in the original data set (Figure 4.5).
Figure 4.14: The modified data set.
Rename The names of the columns in the input data set were initially established by the ColumnName variables in the Experimental Design table. The Rename process systematically changes the names of the columns in the input data table and the corresponding values in the Experimental Design Data Set (EDDS). This example changes the column names in the input data set from the Drosophila aging experiment described in Chapter 1. Potions of the original input data set and EDDS are shown in Figures 4.15 and 4.16, respectively.
4 Data Set Utilities 57
Figure 4.15: The Drosophila Aging Input Data Set
Figure 4.16: The Drosophila Aging EDDS
Select Genomics > Data Set Utilities > Rename. The dialog shown in Figure 4.17 opens.
4 Data Set Utilities 58
Figure 4.17: The Data Rename dialog
Click Load.
Select the default settings for the DrosophilaAgingExample and click OK.
This dialog allows selection of the variable whose values are used for the column names from the list of available variables. To select this variable, click on the desired variable, then click to add the variable to the Variable Containing Current Column Names box, as shown in Figure 4.18.
Figure 4.18: Selecting the variable
Because CurrColumnName is already selected, complete the following steps.
Do not change the default setting for the Variable Containing Current Column Names box. This dialog allows selection of the variable whose values are used for the column names from the list of available variables. To select this variable, click on the desired variable, then click to add the variable to the Variable Containing New Column Names box, as shown in Figure 4.18. Because ColumnName is already selected, complete the following steps.
Do not change the default setting for the Variable Containing New Column Names box.
Click Run to rename the columns.
4 Data Set Utilities 59
The location of the data sets generated by this process is listed in a SAS Message dialog, shown in Figure 4.19.
Figure 4.19: The SAS Message dialog
Note: By leaving the Output Data Set and Output Experimental Design Data Set boxes blank in the Data Rename dialog, the file names do not change, except that the abbreviation _drn is appended to each of the output file names.
Click Open next to each file to examine the output files. The first listed file (Figure 4.20) is the output data set with new column names.
Figure 4.20: The output data set
The second listed file (Figure 4.21) is the output EDDS that excludes the OldColumnName column.
4 Data Set Utilities 60
Figure 4.21: The output EDDS
Compare the input (Figures 4.15 and 4.16) and output (Figures 4.20 and 4.21) data sets to see the results of the Rename command. Reorder The Reorder command sorts the columns according to the order of the values in the ColumnName variable in the EDDS. This example changes the column order in the input data set from the Drosophila aging experiment described in Chapter 1. A portion of the original input data set is shown in Figures 4.15.
Select Genomics > Data Set Utilities > Reorder. The dialog shown in Figure 4.22 opens.
4 Data Set Utilities 61
Figure 4.22: The Data Reorder dialog
Click Load.
Select the default settings for the DrosophilaAgingExample and click OK.
Click Run to reorder the columns in the input data set.
The location of the modified data set generated by this process is listed in a SAS Message dialog (shown in Figure 4.23).
Figure 4.23: The SAS Message dialog
Click Open to examine the reordered data set.
The reordered file appears as shown in Figure 4.24.
4 Data Set Utilities 62
Figure 4.24: The reordered data set
Compare the original (Figure 4.15) and reordered (Figure 4.24) data sets to see the different column order from the Reorder command.
Joins and Transpositions The Joins and Transpositions section contains utilities to append, merge, and transpose SAS data sets.
Append The Append command appends two SAS tables together, end−to−end. This example appends two tables with identical column labels from the Drosophila aging experiment included in the Sample Data folder.
Select Genomics > Data Set Utilities > Append. The dialog shown in Figure 4.25 opens.
Figure 4.25: The Data Append dialog
Click Load.
Select the default settings for the DrosophilaAgingExample and click OK.
4 Data Set Utilities 63
The default Base Input Data Set is the Drosophila input data shown in Figure 4.15 and the Append Input Data Set is the normalized Drosophila input data set. Each table has 100 rows.
Click the Run button to append the data sets. The location of the appended data set generated by this process is listed in a SAS Message dialog.
Click the Open button to examine the appended data set.
The appended table (shown in Figure 4.26) has the same number of columns as either of the input data sets, but twice the number of rows (circled).
Figure 4.26: The appended data set
To append two tables with different column labels, check the Force Append checkbox in the Data Append dialog (see Figure 4.27). This forces two tables with different column labels to append together using the base input data set column labels.
Figure 4.27: The Force Append checkbox.
This example appends tables with different column labels.
Select Genomics > Data Set Utilities > Append.
Follow these steps to select the Affymetrix Latin Square Experimental Design File as the Base Input Data Set.
Click Choose to select the Base Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin_exp.sas7bdat file and click Open to select the file.
This file is shown in Figure 4.28. Note that there are 59 rows (circled) in this table.
4 Data Set Utilities 64
Figure 4.28: The Base Input Data Set
Follow these steps to select the Drosophila Aging Experimental Design File as the Append Input Data Set.
Click Choose to select the Append Input Data Set.
Navigate to Sample Data > MicroArray > Scanalyze Drosophila.
Select the drosophilaaging_exp.sas7bdat file and click Open to select the file.
This file is partially shown in Figure 4.29. Note that there are 48 rows (circled) in this table.
Figure 4.29: The Append Input Data Set
Check the Force Append checkbox (as shown in Figure 4.27).
To select the Output Folder, complete the following steps.
Click Choose to select the Output folder.
4 Data Set Utilities 65
Navigate to the Genomics folder.
Select the ProcessResults folder and click Open to select the folder.
Click Select to select the folder.
Click Run to append the data sets. The location of the appended data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the appended data sets (partially shown in Figure 4.30).
From Base Input Data Set
From AppendInput Data Set
Figure 4.30: The appended data set
Compare the appended data set with the both the input data sets to see the results of the Append command. The 48 rows from the append input data set are added after the 59 rows from the base input data set, for a total of 107 rows (circled). The appended table has the same labels as the Affymetrix Latin Square experimental design file. The columns that are common to both tables, Array, file and ColumnName, are filled in with their respective values. Note that the Variable Length parameter is retained from the base input data set. The columns that are present in the base input data set but absent in the append input data set (for example, Experiment) are retained in the concatenated table. However, values for these columns are missing. The columns that are absent in base but present in the append input data set (for example, Sex and Line) are not retained in the concatenated table. Note: Inverting the roles of the two input data sets results in the table shown in Figure 4.31.
4 Data Set Utilities 66
From Base Input Data Set
From Append Input Data Set
Figure 4.31: The appended data set (inverted)
Merge The Merge command joins two tables, side−by−side, with matching row variables. This example merges the annotation data set for the Drosophila aging experiment with the input data set for this experiment. Recall from Chapter 3 that annotation data sets contain specific biological or chemical information for each row of a tall data set.
Select Genomics > Data Set Utilities > Merge.
Follow these steps to select the Drosophila Aging Annotation Data Set as the Base Input Data Set.
Click Choose to select the Base Input Data Set.
Navigate to Sample Data > MicroArray > Scanalyze Drosophila.
Select the drosophila_annotation.sas7bdat file and click Open to select the file.
This file is shown in Figure 4.32. Note that there are five columns and 3933 rows.
Figure 4.32: The Drosophila Aging Annotation File
4 Data Set Utilities 67
The variables available in this data set are listed in the Available Variables box (Figure 4.33).
Select Spot.
Click to add Spot to the Key Variables from Base Input Data Set box.
Figure 4.33: Selecting the key variable from the Base Input Data Set
Follow these steps to select the Drosophila Aging Input Data Set as the Merge Input Data Set.
Click Choose to select the Merge Input Data Set.
Navigate to Sample Data > MicroArray > Scanalyze Drosophila.
Select the drosophilaaging.sas7bdat file and click Open to select the file.
This file is shown in Figure 4.34. Note that there are 49 columns and 100 rows.
Figure 4.34: The Drosophila Aging Data Set
The variables available in this data set are listed in the Available Variables box (Figure 4.35).
Select Spot.
Click to add Spot to the Corresponding Key Variables from Merge Input Data Set box.
Figure 4.35: Selecting the Key variable from the Merge Input Data Set
Specify an output folder.
4 Data Set Utilities 68
Click Run to merge the data sets. The location of the merged data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the appended data set (shown in Figure 4.36).
From Base Input Data
Set
From MergeInput Data
Set
Common Identifiers
Figure 4.36: The merged data set
Compare the merged data set with both the input data sets to see the results of the Merge command. Note: Only the 100 rows common to both of the input data sets are found in the merged data set. However, all of the columns present in either of the input data sets are in the merged data set. Transpose Tall and Wide The Transpose Tall and Wide command converts a tall data set into a wide data set or vice-versa (see more detail about tall and wide format in Chapter 3). This example transforms the Affymetrix Latin Square Input Data Set and its accompanying EDDS from the tall format to the wide format. A portion of the tall input data set appears as shown in Figure 4.37.
4 Data Set Utilities 69
Figure 4.37: The Affymetrix Latin Square Input Data Set (tall)
Note that there are 59 data columns and 1604 data rows.
Select Genomics > Data Set Utilities > Transpose Tall and Wide. The Transpose Tall and Wide dialog opens, as shown in Figure 4.38.
Figure 4.38: The Data Transpose Dialog
Note that there are two tabs in the dialog. Because you are transposing a tall data set into a wide data set,
Make sure that the Tall -> Wide tab is selected.
4 Data Set Utilities 70
To select the Affymetrix Latin Square Input Data Set as the Base Input Data Set, complete the following steps.
Click Choose to select the input tall data set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin.sas7bdat file and click Open to select the file.
In this example, there is no need to specify either the variables or prefixes for wide column names. To select the Affymetrix Latin Square EDDS as the EDDS, complete the following steps.
Click Choose to select the EDDS.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin_exp.sas7bdat file and click Open to select the file.
Specify an output folder.
Click Run to transpose the data sets.
The location of the transposed data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the transposed data set.
The transposed data set with wide data format is shown in Figure 4.39.
Figure 4.39: The transposed data set
Compare the transposed (wide) data set (Figure 4.39) with the original tall data set (Figure 4.37). Note that the data has been transposed; there are now 1604 data columns and 59 data rows. In addition, the _wid abbreviation has been added to the transposed file name. A SAS data set in wide format can be transposed into a data set in tall format in a similar manner by selecting the Wide -> Tall tab.
4 Data Set Utilities 71
Transpose Rectangular The Transpose Rectangular command creates a new SAS data set by transposing a block, or subset, of variables in a SAS data set. The variables (columns) become observations (rows) and observations become variables.
Select Genomics > Data Set Utilities > Transpose Rectangular. The Data Transpose Rectangular dialog opens.
Click Load.
Select the default settings for the AffymetrixLatinSquareExample and click OK.
The completed dialog appears as shown in Figure 4.40.
Figure 4.40: The Data Transpose Rectangular dialog
A portion of the input data set is from the Affymetrix Latin Square Example included with JMP Genomics. The input data set is partially shown in Figure 4.41.
4 Data Set Utilities 72
Figure 4.41: The input data set
Do not change any of the default settings in the dialog.
Click Run to transpose the data set.
The location of the transposed data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the transposed data set.
The transposed data set with the wide data format is partially shown in Figure 4.42.
Figure 4.42: A portion of the transposed data set
Compare the transposed (wide) data set (Figure 4.42) with the original tall data set (Figure 4.41) to see the transposed data. In addition, the identifiers listed in the Probe_Set_ID column in the input data set have been separated into two columns (Array and Treatment), as specified in the SAS Code parameter pane in the dialog.
4 Data Set Utilities 73
Unstack The Unstack command transposes a stacked data set into a tall data set and an EDDS. A stacked data set has the variables of interest stacked into a single column. This example converts a stacked data set (Figure 4.43) to a tall data set and an EDDS.
Figure 4.43: A portion of the stacked data set
Typically, stacked data sets contain a smaller number of columns when compared with the number of rows (circled in Figure 4.43). Note the repetitiveness in the ChipID, Experiment and Series columns.
Select Genomics > Data Set Utilities > Unstack. The Data Unstack dialog opens, as shown in Figure 4.44.
4 Data Set Utilities 74
Figure 4.44: The Data Unstack dialog
To select the Affymetrix Latin Square Stacked Data Set as the Input Data Set, complete the following steps.
Click the Choose button to select the Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin_stack.sas7bdat file and click Open to select the file.
The variable names for the data set are listed in the Available Variables box (shown in Figure 4.45).
4 Data Set Utilities 75
Figure 4.45: Variables
To unstack the data set, first specify the variables containing the numerical data, the variables to transpose by, and the variables to make up the columns in the new data set. The criteria and procedures for this process are outlined in the following sections. The Response Variable is the variable that contains the actual numeric data to be transposed. In this example, the data is in the log2i column in the input data set.
Click log2i.
Click to add log2i to the Response Variable box.
JMP Genomics uses the unique levels in the Row Variables to form the rows in the output tall data set. To select the Row Variables,
Click Unit.
Click to add Unit to the Row Variables box.
Repeat for AffyID and Probe. JMP Genomics uses the unique combinations of levels in the Column Variables to form the columns in the output tall data set. These levels must not overlap the Row Variables. To select the Column Variables, complete the following steps.
Click ChipID.
Click to add ChipID to the Column Variables box.
The Array Variable identifies the array, chip or spectrum in the input data set. This variable is typically also identified as a Column Variable. To select the Array Variable, complete the following steps.
Click ChipID.
Click to add ChipID to the Array Variable box.
The Channel Variable identifies the channel or dye column in the input data set. Since this is a one-channel experiment,
4 Data Set Utilities 76
Leave the Channel Variable box blank.
Because the values in the Series column offer no valuable information, complete the following steps to drop this column from the output tall data set.
Click Series.
Click to add Series to the Drop Variables box.
To specify a prefix for the names of the Response columns in the output tall data set complete the following steps.
Type Chip_ in the Prefix for Column Names in Tall Data Set box.
Specify an Output Folder.
The completed dialog should appear like the one shown in Figure 4.46.
Figure 4.46: The completed dialog
Click Run to transpose the data set. The location of the transposed data set and EDDS generated by this process is listed in a SAS Message dialog (shown in Figure 4.9).
Click Open to examine the output data set (shown in Figure 4.47).
4 Data Set Utilities 77
Figure 4.47: The output, tall data set
Compare the stacked, input data set (Figure 4.43) with the tall, output data set (Figure 4.47) to see the results of the unstack process. Note that the output data are grouped by probe, chip, and Affymetrix ID. In addition, the output data set has more columns but many fewer rows.
Statistics and Transforms
The Statistics and Transforms section includes Data Step, Merge and Transform, Rank Rows, Sort Rows, Statistics for Columns, Statistics for Rows, and Transform.
Data Step The Data Step command modifies a SAS data set by executing SAS Data Step commands on the data set. You must be familiar with SAS programming to use this utility. The SAS language has an array of statements and functions to perform a vast number of manipulations of a SAS data set. Refer to the DATA STEP documentation for further details. This documentation is available at http://support.sas.com/documentation. Merge and Transform The Data Merge and Transform command merges two SAS data sets that share a common set of variables, uses SAS syntax to compute an arbitrary function of each pair of variables having the same name, and generates an output data set consisting of a transformed merge of the two input data sets. You must be familiar with SAS programming to use this utility. Refer to the Base SAS documentation for further details. This documentation is available at http://support.sas.com/documentation. Rank Rows The Rank Rows command creates a new table in which each observation within each of the variables in the data set is replaced by that observation’s numerical ranking. This example ranks the responses of the 100 genes observed in the Drosophila aging experiment to the different experimental conditions.
Select Genomics > Data Set Utilities > Rank Rows. The Data Rank dialog opens, as shown in Figure 4.48.
4 Data Set Utilities 78
Figure 4.48: The Data Rank dialog
To select the Drosophila Aging Input Data Set as the Input Data Set, complete the following steps.
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Scanalyze Drosophila.
Select the drosophilaaging.sas7bdat file and click Open to select the file.
This file is shown in Figure 4.34. The variable names for the data set are listed in the Available Variables box (shown in Figure 4.49).
Select all of the available variables except for Spot.
Click to add the variables to the Rank Variable box.
Figure 4.49: Selecting the variables to rank
The Advanced tab allows specification of the rank order, rank method, and method for handling ties. You may also add new variable names.
Specify an Output Folder.
Click Run to transpose the data set.
4 Data Set Utilities 79
The location of the ranked data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the ranked data set (shown in Figure 4.50).
Figure 4.50: The ranked data set
Compare the input data set (Figure 4.34) with the ranked, output data set (Figure 4.50) to see that the observed values in each of the columns have been replaced with the ranks (from 1- 100) for the observations within each column. Sort Rows The Sort Rows command sorts a data set’s rows by the values in one or more columns. This example sorts the data from the Drosophila aging experiment according to age, line, and sex.
Select Genomics > Data Set Utilities > Sort Rows. The Data Sort dialog opens, as shown in Figure 4.51.
Figure 4.51: The Data Sort dialog
To select the Drosophila Aging EDDS as the Input Data Set, complete the following steps.
4 Data Set Utilities 80
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Scanalyze Drosophila.
Select the drosophilaaging_exp.sas7bdat file and click Open to select the file. A portion of this file is shown in Figure 4.52.
Figure 4.52: The Drosophila Aging Experiment EDDS
The variable names for the data set are listed in the Available Variables box, shown in Figure 4.53. The output table contains the variables according to the order that you enter in the Sort Variables box. In the following example, the sort order is the same as the order in Sort Variables: age, line and then sex.
Select Age, Line, and Sex.
Click to add the variables to the Sort Variable box.
Figure 4.53: Selecting the variables to sort.
Specify an Output Folder.
Click Run to transpose the data set.
The location of the sorted data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the sorted data set, shown in Figure 4.54.
4 Data Set Utilities 81
Figure 4.54: The sorted data set
Note that the rows are sorted first by age, then by line, and then by sex. Statistics for Columns The Statistics for Columns command calculates a variety of statistics for the columns in a SAS data set. This example calculates the mean, median, standard deviation, minimum, and maximum for each column and probe set in the Affymetrix Latin Square data set.
Select Genomics > Data Set Utilities > Statistics for Columns.
The Statistics for Columns dialog opens, as shown in Figure 4.55.
4 Data Set Utilities 82
Figure 4.55: The Statistics for Columns dialog
To select the Affymetrix Latin Square Data Set as the Input Data Set, follow these steps.
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin.sas7bdat file and click Open to select the file.
The variable names for the data set are listed in the Available Variables box (shown in Figure 4.56). To select the variables to be summarized, complete the following steps.
Select all of the available variables from a_01 through q_59.
Click to add the variables to the Variables to be Summarized box.
Figure 4.56: Selecting the variables to be summarized
To calculate the statistics for all the rows in the columns,
Leave the Variables by Which to Summarize box blank.
Specify an output folder.
4 Data Set Utilities 83
Click on the Options tab to select the statistics to run.
Hold the Ctrl key down while clicking Max, Mean, Median, Min, and StdDev.
Click the Run button to summarize the data set.
The location of the output data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the output data set, partially shown in Figure 4.57, which lists the statistics for each column.
Figure 4.55: The summarized data set
Statistics for Rows The Statistics for Rows command computes row-wise statistics for a data set. This example computes the standard deviation and standard error for each row in the Affymetrix Latin Square data set and displays the results based on a condition.
Select Genomics > Data Set Utilities > Statistics for Rows.
The Data Row Statistics dialog opens, as shown in Figure 4.58.
4 Data Set Utilities 84
Figure 4.56: The Data Row Statistics dialog
To select the Affymetrix Latin Square Data Set as the Input Data Set, complete the following steps.
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin.sas7bdat file and click Open to select the file.
The variable names for the data set are listed in the Available Variables box, shown in Figure 4.59. To select the variables to be summarized,
Select all of the available variables from a_01 through q_59.
Click to add the variables to the Variables to be Summarized box.
Figure 4. 59: Selecting the variables to be summarized
Specify an output folder.
Click the Statistics tab.
Select STD and STDERR as statistics method to compute, as shown in Figure 4.60.
4 Data Set Utilities 85
Figure 4.60: Selecting the statistics
tab. For
example, to eliminate any rows for which the standard deviation value is greater than 2,
Click Options.
Specify the SAS syntax STD>2 as shown in Figure 4.61.
To filter rows based on these statistics, you specify the filtering condition in the Options
by typing
Figure 4.61
as the name of the output data set.
Click Run to summarize the data set.
he location of the output data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the output data set, partially shown in Figure 4.62.
Type Affylatin_std2
T
Figure 4.62: The summarized data set
n of Compare the input data set (Figure 4.37) with the summarized output data set. Note the additio
two columns to the data set, containing the summary statistics for each of the rows. In addition,
4 Data Set Utilities 86
whereas the original data set had 1604 rows, the sorted data set has 64 rows. 1540 rows have been the condition set in the Options tab.
tion on specified variables. ransformations include exp2, exp, exp10, log2, log, log10 and sqrt (in Type of Transformation
the Affymetrix Latin Square Data Set.
The Data Transform dialog opens, as shown in Figure 4.63.
filtered out, based on Data Transform The Transform command performs a mathematical transformaTlist), or formulas specified in the Transform Expression box.
This example calculates the square root of each data point in
Select Genomics > Data Set Utilities > Transform.
Figure 4.63: The Data Transform dialog
the Input Data Set, complete the following steps.
.
ox, shown in Figure 4.64.
teps.
Click
To select the Affymetrix Latin Square Data Set as
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square
Select the affylatin.sas7bdat file and click Open to select the file.
The variable names for the data set are listed in the Available Variables b To select the variables to be summarized, complete the following s
Select all of the available variables from a_01 through q_59.
to add the variables to the Variables to be Transformed box.
4 Data Set Utilities 87
Figure 4.64: Selecting the variables to be transformed
Select sqrt from the Type of Transformation drop-down menu.
Specify an output folder.
Click Run to transform the data set.
The location of the output data set generated by this process is listed in a SAS Message dialog.
Click Open to examine the output data set, shown in Figure 4.65.
Figure 4.65: The transformed data set
Compare the input data set (Figure 4.37) with the transformed output data set to see the differences between the input and transformed data sets.
4 Data Set Utilities 88
Export The Export command exports data from a SAS data set to a file. Supported formats are Tab-delimited text (.txt), Comma-separated values (.csv), Blank-delimited text (.txt), Excel (.xls) files, or JMP (.jmp).
This example exports the Affymetrix Latin Square Data Set as an Excel file.
Select Genomics > Data Set Utilities > Export.
The Data Summary dialog opens, as shown in Figure 4.66.
Figure 4.66: The Data Export dialog
To select the Affymetrix Latin Square Data Set as the Input Data Set, complete the following steps.
Click Choose to select the Input Data Set.
Navigate to Sample Data > MicroArray > Affymetrix Latin Square.
Select the affylatin.sas7bdat file and click Open to select the file.
To specify the format of the output file, complete the following steps.
Choose the Excel format.
Specify an output folder.
Click Run to generate the Excel file, shown in Figure 4.67.
4 Data Set Utilities 89
Figure 4.67: The exported Excel file
4 Data Set Utilities 90
Genetic Marker Case-Control Data
5C H A P T E R
In addition to a set of general statistical and data processing routines, JMP Genomics offers a collection of processes for analysis of genetic marker data. Access these processes from five submenus of the JMP Genomics menu, as shown in Figure 5.1.
Genetics Core Submenus
Figure 5.1: The Genetics submenus This chapter focuses on processes appropriate for case-control data. In case-control data, individuals are assumed to be:
• unrelated in recent generations, and • classifiable according to some phenotype.
The phenotype is typically binary with two generic levels, “case” and “control”, although several of the methods handle multi-category or continuous / quantitative phenotypes. Analysis of data for which family or pedigree information is available, in addition to markers and phenotypes, is discussed in Chapter 6. Note: Nearly all of the processes discussed in this and the next chapter call procedures from SAS/Genetics™. Detailed descriptions of these procedures and the computations performed are available in the SAS/Genetics™ 9.1.3 User’s Guide. Refer to this guide for details concerning the usage and computational methods of these SAS procedures.
5 Genetic Marker Case-Control Data 92
The Genetic Marker Example
The example used in the analyses described in this chapter is the Genetic Marker data set described in Chapter 1. The data set and associated files can be found in the Sample Data folder that comes with JMP Genomics. To familiarize yourself with the data set for this example, complete the following steps.
Select File > Open to open an Open Data File dialog.
Navigate to Sample Data > Genetics.
Select the samplegmdata.sas7bdat file and click Open to select the file.
The file opens the JMP table, shown in Figure 5.2.
Figure 5.2: Partial view of the samplegmdata.sas7bdat file
Examine the data contained in Figure 5.2. The data are in wide form, with 1000 rows corresponding to individuals and 130 columns corresponding to various data on these individuals. These data do, in fact, contain family and pedigree information, but this chapter considers only the unrelated individuals for which both father=0 and mother=0 (the founders). The disease column contains the binary trait of primary interest. There are also four quantitative traits and sixty markers for each individual. The marker data occur in pairs, so that the ma1 and ma2 column entries contain the alleles in the first genotype, ma3 and ma4 the second genotype, and so on. The data are computer-simulated.
Genetic Marker Data Format
The genetics processes in JMP Genomics analyze data consisting of individuals that have been genotyped at a set of genetic markers of interest. The required data structure for most of the genetics processes is the wide form, in which rows correspond to individuals and columns correspond to pedigree information, phenotypes, and genotypes. Refer to Chapter 3 for a more thorough discussion of tall and wide data sets. Genotypes can be represented in two different ways, and the two data sets partially illustrated in Figure 5.3 illustrate these different representations of the marker genotypes. JMP Genomics can process either representation.
5 Genetic Marker Case-Control Data 93
Figure 5.3: Two different ways of representing marker genotypes
These data sets list the genotypes for the same group of individuals. Each individual is represented in a row. In the data set on the left, the alleles that comprise the genotype at each marker are listed in sequential pairs of columns. Each column in the pair contains one of the two alleles that make up the genotype. For example, the genotype of the first marker is listed in columns ma1 and ma2; the alleles that make up the genotype of the second marker are listed in columns ma3 and ma4, and so on. Alternatively, the alleles that make up the genotype at each locus can be listed in a single column with a delimiter (such as the “/” character used in the data set on the right in Figure 5.3 in columns g1−g3) separating the two alleles observed at the marker for the individual. Each of the genetics processes that contain a Marker Variables field for specifying the marker genotype variables offers a Format of Marker Variables option that indicates whether the variables in the data set correspond to individual alleles, two per marker, or genotypes with the delimiter of your choice.
In addition to the main data set containing pedigree, phenotype, and genotype information, there might also be information about the genetic markers in an annotation data set. For the annotation data set, the rows represent markers and they must match the order of the markers in the main data set. Label, chromosome, physical position, GenBank accession number, and dbSNP identifier are examples of the variables that the annotation data set could include. Most of JMP’s Genetics processes provide an Annotation tab that allows you to specify this data set and cast variables into particular roles to be used in the analysis and output.
Importing Genetic Marker Data
There are a number of different ways to prepare genetic marker data for processing with JMP Genomics. Your choice of import methods depends on the format of the raw data files and the types of analyses you want to perform. The goal is to create a wide SAS data set, and optionally a corresponding SAS annotation data set. With SAS programming experience, these data sets can be created directly in SAS before working with them in JMP Genomics. Alternatively, if the data are already in wide form, but are in text or Excel formats, open them directly in JMP, alter them as needed, and then save them as SAS data files (see Chapter 3 for an example of generating a SAS data set from an Excel file). JMP Genomics also offers customized import routines for seven different specialized genetics formats (Affymetrix SNP CHP, Affymetrix SNP CEL, Illumina SNP, Arlequin, HapMap, NEXUS, and Pedigree) divided among the Affymetrix, Illumina and Other Genetics submenus.
5 Genetic Marker Case-Control Data 94
Finally, the generic Import Individual Text, CSV, or Excel Files process directly creates a SAS data set from one file. The Import a Designed Experiment from Text, CSV, or Excel Files process does the same if the data are spread across multiple files. The latter requires an accompanying Experimental Design File. See Chapter 3 for examples illustrating the generation of SAS data sets.
Genetic Marker Statistics The Genetic Marker Statistics submenu offers five analytical processes, as shown in Figure 5.4.
Figure 5.4: The Genetic Marker Statistics submenu.
These processes calculate a variety of measurements and statistics for both phenotypic and genotypic markers and often serve as the starting point for further experiments and analyses.
Marker Properties A convenient way to explore several properties of all the markers is to use the Marker Properties analytical process. Use the following steps to run this process on the samplegmdata.sas7bdat data set described in Chapter 1.
Select Genomics > Genetic Marker Statistics > Marker Properties.
The dialog shown in Figure 5.5 opens.
Figure 5.5: The Marker Properties dialog
5 Genetic Marker Case-Control Data 95
Click Load.
Select the settings for the GeneticMarkerExample.
Click OK to complete the Marker Properties dialog, as shown in Figure 5.6.
Figure 5.6: The completed General tab of the Marker Properties dialog
Recall that this input data set contains variables ma1 – ma120, and that each specifies a single allele. These markers were selected from the list in the Available Variables box and added to the Marker Variables box. Note: This data set also contains family data. In order to run the analysis on the subset of unrelated individuals, the Filter to Include Observations field should contain a filter that is used to specify the inclusion of only the founders in the analysis.
Do not make any changes to the General tab.
Click on the Annotation tab to bring up the tab shown in Figure 5.7.
5 Genetic Marker Case-Control Data 96
Figure 5.7: The Annotation tab.
Examine the Annotation tab. This tab specifies a separate annotation data set that contains information about the markers being analyzed. The annotations used for the markers in this example are listed in the annotation data set samplemap.sas7bdat, found in the Sample Data folder.
Click Open to examine the annotation data set, as shown in Figure 5.8.
Figure 5.8: The samplemap.sas7bdat file.
Sixty different markers, corresponding to the 60 pairs (ma1 – ma120) are described in this data set. Note that the rows in this data set must be in exactly the same order as the marker columns in the input data set. There are three columns in the annotation data set. Each of these variables serves a different role in the analysis. The values in the Marker column label the markers in the output data set and any plots. The values in the CandGene column designate the candidate gene in which each marker resides. This variable groups analyses with identical CandGene levels and produces separate plots of the HWE p-values for each group. The values in the Location column list the chromosomal location of each of the markers. The x-axis of this plot uses the values in the Location variable. Each of these variables is specified by default in this example. The Filter to Include Markers field located at the bottom of this tab allows you to enter text that subsets the annotation data set to restrict the markers from the input data set that you want to analyze. This can be especially useful when selecting marker variables with the List-Style Specification of Marker Variables field on the General tab. There might be marker genotypes in columns that all begin with the same prefix, so the list-style specification is a convenient way to select all markers, then the Filter to Include Markers can filter out particular marker variables based on values of variables that are in the annotation data set.
5 Genetic Marker Case-Control Data 97
Do not make any changes to the Annotation tab.
Click on the Options tab to bring up the tab shown in Figure 5.9.
Figure 5.9: The Options tab
Examine the Options tab. Because consecutive pairs of these columns make up the genotype at each of the 60 markers, the Alleles radio button is selected for the Format of Marker Variables parameter.
Do not make any changes to the Options tab.
Click the Output tab to bring up the tab shown in Figure 5.10.
Figure 5.4: The Output tab
Note that the Create Frequency Charts box, Create HTML box, and Create Cell Plot box are all checked. With all three boxes checked, the output from this process includes JMP frequency charts for the alleles and genotypes, HTML files containing SAS PROC ALLELE tables summarizing marker information and allele and genotype frequencies, and a cell plot representing marker genotypes. Note that the Output File prefix box is blank. When this box is left blank, the name of the input data set is used as the prefix when naming the output files. This allows all analyses performed on the same genetic marker data to be named similarly and thus easily identified. Alternatively, you can specify a different prefix to use; for example, a project identifier for the analyses you are running. Note: If the same prefix is used for multiple runs of the same process and the same output folder is specified, results from the previous run will be overwritten.
Click Run.
Figure 5.11 shows some of the output.
5 Genetic Marker Case-Control Data 98
Figure 5.5: Output from the Marker Statistics process
Explore the results in the different windows. The cell plot provides a global view of the genotypes and lets you see patterns of homozygousity / heterozygosity using three colors. The histograms of allele and genotype frequencies provide locus-by-locus details. Note that each set of graphs is dynamically associated with a JMP table containing corresponding numerical results.
Linkage Disequilibrium The Linkage Disequilibrium (LD) process offers various displays representing measures of linkage disequilibrium between pairs of markers. Note: LD measures statistical association between groups of alleles at different loci. This is a different process than linkage analysis, which refers to techniques quantifying genetic distances. Due to the modern availability of fine-scale marker data, JMP Genomics currently focuses more on LD than on linkage analysis, although certain methods available in JMP Genomics provide information on linkage.
Select Genomics > Genetic Marker Statistics > Linkage Disequilibrium.
The dialog shown in Figure 5.12 opens.
5 Genetic Marker Case-Control Data 99
Figure 5.6: The Linkage Disequilibrium dialog
Click Load.
Select the default settings for the GeneticMarkerExample.
Click OK to complete the Linkage Disequilibrium dialog, as shown in Figure 5.13.
Figure 5.7: The General (left) and Annotation (right) tabs of the completed Linkage Disequilibrium
dialog
Examine the General and Annotation tabs. As discussed for Marker Properties, the marker variables have been selected, a filter to limit the analysis to the founders has been specified, the annotation data set has been chosen, and the annotation markers have been defined.
Do not make any changes to either the General or the Annotation tabs.
Click the Options tab to bring up the tab shown in Figure 5.14.
5 Genetic Marker Case-Control Data 100
Figure 5.8: The Options tab
Examine the Options tab. As discussed for Marker Properties, Alleles is selected for the Format of Marker Variables parameter.
Do not make any changes to the Option tab.
Click on the Output tab to bring up the tab shown in Figure 5.15.
Figure 5.9: The Output tab
Examine the Output tab. This tab specifies parameters for the LD contour plot as well as other output.
Click Run. Figure 5.16 shows some of the results.
5 Genetic Marker Case-Control Data 101
Figure 5.10: The output of the Linkage Disequilibrium process
Explore the results in the various windows. Note that each set of graphs is dynamically associated with a JMP table containing numerical results. Other Processes
Three other processes are available under Genomics > Genetic Marker Statistics. Phenotype Summary provides a means to explore non-genetic variables that you have collected about the sample individuals. LD tagSNP Selection uses an LD measure to define bins of SNPs. Each bin is represented by a single SNP that is used in association studies. This grouping effectively reduces the number of SNPs to a small subset of tagSNPs that need to be considered. Malecot LD Map fits the Malecot model to pair wise marker statistics and constructs an associated one-dimensional map in terms of LD units. Default example settings are available for both the Phenotype Summary and LD TagSNP Selection processes and you are encouraged to run them in order to see what functionality these two processes offer. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
Association Testing There are six processes available in JMP Genomics for the association mapping of a trait or disease using genetic marker data. These include Case-Control Association, PCA for Population Stratification, Marker-Trait Association, SNP-Trait Association, transmission disequilibrium tests for either quantitative or binary traits (Quantitative TDT, and TDT, respectively), and SNP Interaction Testing (experimental) as shown in Figure 5.17.
5 Genetic Marker Case-Control Data 102
Figure 5.11: The Association Testing submenu
For a sample of unrelated individuals, Case-Control Association and Marker-Trait Association are appropriate, while the Quantitative TDT and TDT processes are designed for family data. The latter three processes include example data consisting of samples of genotyped parent-offspring trios or sibships, discussed further in Chapter 6. The processes can be further distinguished by the type of trait on which they perform association testing. Case-Control Association and TDT offer chi-square tests for binary traits such as disease status. The other three processes provide methods for analyzing quantitative traits and can accommodate covariates. Marker-Trait Association can additionally handle binary or count trait variables and can adjust for strata variables or random effects, and survival traits can be tested in the Marker-Trait Association process. Table 5.1 provides a summary of the appropriate process for each type of analysis. Table 5.1: Selection of Appropriate JMP Genomics Process for Different Types of Analyses
Type of Trait JMP Genomics Process
Family Relationship Binary Quantitative Count Survival Nominal Ordinal
Case-Control Association
PCA for Population
Stratification
Marker-Trait Association
SNP-Trait Association
SNP Interaction
Testing
Unrelated individuals
Quantitative TDT
TDT
Individuals grouped in
families
The following example uses the Case-Control Association process to analyze the binary variable indicating disease status for the samplegmdata.sas7bdat data set. Default example settings are available for the Marker-Trait Association, SNP-Trait Association, SNP Interaction Testing, Quantitative TDT, and TDT processes. Refer to the JMP Genomics User Guide – Supplement for more details on the SNP-Trait Association process. The TDT process is discussed in detail in Chapter 6. You are encouraged to run the remaining two processes to see what functionality they offer.
Case-Control Association
Select Genomics > Association Testing > Case-Control Association.
5 Genetic Marker Case-Control Data 103
The Control-Case Association dialog shown in Figure 5.18 opens.
Figure 5.12: The Case-Control Association dialog
Click Load.
Select the default settings for the GeneticMarkerExample.
Click OK to complete the dialog as shown in Figure 5.19.
5 Genetic Marker Case-Control Data 104
Figure 5.19: The completed General (left) and Annotation (right) tabs of the Case-Control
Association dialogs
Examine the General and Annotation tabs of the completed dialog shown in Figure 5.19. As discussed for Marker Properties, the marker variables have been selected, a filter to limit the analysis to the founders has been specified, the annotation data set has been chosen, and the annotation markers have been defined. When the number of marker variables is large, it is often more convenient to type the list of marker variables into the List-Style Specification of Marker Variables box, rather than entering each variable into the Marker Variables box. For this example, first remove all the variables in the Marker Variables box and type ma1-ma120 in the List-Style Specification of Marker Variables box. Remember, SAS variable names are not case-sensitive. The disease variable is listed in the Trait Variables box.
Do not make any changes to either the General or the Annotation tabs.
Click the Options tab to bring up the tab shown in Figure 5.20.
Figure 5.13: The Options tab
5 Genetic Marker Case-Control Data 105
Examine the Options tab. As discussed for Marker Properties, Alleles is selected for the Format of Marker Variables parameter. All three association tests, the Pearson Chi-squared tests for alleles and genotypes, and the linear trend test, are selected in the Association Tests box.
Do not make any changes to the Option tab.
Click the P-Value Plots tab to bring up the tab shown in Figure 5.21.
Figure 5.14: The P-Value Plots tab
Examine the P-Value Plots tab. This tab specifies parameters for conversion, corrections, and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.
Click Run.
The output window illustrated in Figure 5.22 opens.
Figure 5.15: The output of the Case-Control Association process
Examine the overlay plots in the output window. The y-axis in the two plots displays the negative log p-value for three different tests of association. Peaks indicate locations of significant association, and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The two different graphs appear because of the specification of CandGene as the Annotation Group Variable in the Annotation tab.
Haplotype Analysis Instead of examining markers individually, it can often be more informative to look at a set of alleles and markers from the same chromosome as a single entity; that is, as a haplotype. Estimates of haplotype frequencies
5 Genetic Marker Case-Control Data 106
can be used in a variety of ways: to test for multilocus LD, to test for association between a trait and several markers at once, and to infer the parental haplotypes that an individual receives. There are three processes available in JMP Genomics for analyzing haplotypes using genetic marker data. These include Haplotype Estimation, Haplotype Trend Regression, and htSNP Selection, as shown in Figure 5.23.
Figure 5.16: The Haplotype Analysis submenu
When genotype data are collected, the two haplotypes that compose a multilocus genotype are not typically observed. Thus, the alleles, passed together from one parent, for each of the set of markers, remain unknown. The expectation-maximization (EM) algorithm can be used to estimate these unobserved haplotype frequencies and can be invoked with the Haplotype Estimation process, generally as the first step in your haplotype analysis. You can estimate haplotype frequencies for one particular set of markers, or many sets. To perform estimation for multiple marker sets, define a group variable from your annotation data set, a sliding window of specified-width markers, or both. For each set of markers, you can perform tests for LD and association with a binary trait. In order to further determine the particular haplotype from a set of markers that may be influencing a trait (binary, quantitative, or survival), use output data sets from the Haplotype Estimation process as input for the Haplotype Trend Regression process. Output data sets can also feed the htSNP Selection process to determine the subset(s) of markers that explain much of the haplotype diversity within a block of strongly associated markers. The following example uses the Haplotype Trend Regression process to analyze the binary variable (disease) indicating disease status for the samplegmdata.sas7bdat data set. Default example settings are available for the Haplotype Estimation and htSNP Selection processes. Run the remaining two processes to see what functionality they offer.
Haplotype Trend Regression
Select Genomics > Haplotype Analysis > Haplotype Trend Regression. The Haplotype Trend Regression dialog shown in Figure 5.24 opens.
5 Genetic Marker Case-Control Data 107
Figure 5.17: The Haplotype Trend Regression dialog
Click Load.
Select the default settings for the GeneticMarkerExample.
Click OK to complete the dialog as shown in Figure 5.25.
Figure 5.18: The completed Haplotype Trend Regression dialog
5 Genetic Marker Case-Control Data 108
Click Open to open the samplegmdata_phase.sas7bdat input data set (Figure 5.26).
Figure 5.19: The samplegmdata_phase.sas7bdat file
Note that the data set in this example contains columns from samplegmdata.sas7bdat, shown in Figure 5.2. This is the Phase Assignment data set created by the Haplotype Estimation process. The columns selected as ID variables from the original data set are included in this data set, namely Individual ID, disease, Qtrt1, and Qtrt2. Columns _A_1 through _A_10 contain the alleles at the five markers in the sliding window. Examine the General tab in the completed dialog (Figure 5.26). All of the columns from the input data set are listed in the Available Variables box. Qtrt1 is selected as the Trait Variable and Qtrt2 is selected as the Covariate. The SAS expression windows=7 is entered in the Where Clause box to perform the haplotype trend regression using the five markers from sliding window 7, which correspond to the first five single nucleotide polymorphisms (SNPs) from candidate gene 2. When the Sliding Window option is specified for the Haplotype Estimation run that creates the input data set for Haplotype Trend Regression, either a single sliding window can be analyzed using the Where Clause as shown here, or Window must be selected as a By Variable.
Do not make any changes to the General tab.
Click the Options tab to bring up the tab shown in Figure 5.27.
Figure 5.20: The Option tab
Examine the Option tab. The Type of Trait is specified as Continuous to allow for a linear regression of the trait variable (Qtrt1, specified in the General tab) on the haplotypes. The Frequency Cutoff for Combining Haplotypes is set to 0.005. Any haplotypes with a frequency below this value are
5 Genetic Marker Case-Control Data 109
combined into a single group for analysis. The frequencies are provided by the data set specified as the Haplotype Frequency Data Set, also created as an output data set by the Haplotype Estimation process.
Click Run.
The output window shown in Figure 5.28 opens.
Figure 5.21: The output of the Haplotype Trend Regression process
Be sure to scroll down and examine the entire second table. This table lists the F-statistics and associated probabilities for each of the 14 estimated haplotypes. Haplotypes 14 and 1 are revealed as the most significant.
5 Genetic Marker Case-Control Data 110
Genetic Marker Family or Pedigree Data
6C H A P T E R
While Chapter 5 considered genetic marker data from unrelated individuals, this chapter describes methods in JMP Genomics appropriate when family or pedigree information is available for the individuals. These methods include the Transmission Disequilibrium test for both binary traits (TDT) and quantitative traits (Quantitative TDT) in the Association Testing submenu, as shown in Figure 6.1, and the three processes grouped in the Model-free Linkage submenu, as shown in Figure 6.15.
Figure 6.1: The Association Testing submenu
Note: Nearly all of the processes discussed in this and the previous chapter call procedures from SAS/Genetics™. Detailed descriptions of these procedures and the computations performed are available in the SAS/Genetics™ 9.1.3 User’s Guide. This reference can be accessed from http://support.sas.com/documentation/index.html or viewed in PDF format from http://support.sas.com/documentation/onlinedoc/91pdf/. You should refer to this guide for details concerning the usage and computational methods of these SAS procedures.
The Sample Data Sets
The analyses described in this chapter use two sample data sets. The first data set is the genetic marker data set, samplegmdata.sas7bdat, considered in the previous chapter and described in Chapter 1. To familiarize yourself with the genetic marker data set, complete the following steps.
Select File > Open to open an Open Data File dialog.
Navigate to Sample Data > Genetics.
Select the samplegmdata.sas7bdat file and click Open to select the file.
The file opens as a JMP table, as partially shown in Figure 6.2.
6 Genetic Marker Family or Pedigree Data 112
Figure 6.2: The samplegmdata.sas7bdat file
The first four columns describe the family data structure for the 1000 individuals in the samplegmdata.sas7bdat data set. Ped_id is a variable whose values correspond to distinct family units. Ind_id is the individual identifier and is unique within each level of Ped_id. The father and mother columns contain the Ind_id values corresponding to that individual’s father and mother within their specific family. If the individual is a founder in the population (that is, data on that individual’s father and mother is not available), a value of 0 is coded for their father and mother. See Chapter 5 for further details about the other variables in this data set. Chapter 5 also describes the Marker Properties and Linkage Disequilibrium processes for investigating basic statistics on the markers, and provides an overview of the association testing methods available in JMP Genomics. The second data set, used for the Model-free Linkage processes, is the affected sib-pair (ASP) data kindly provided by Gonçalo Abecasis (University of Michigan Center for Statistical Genetics). This data set is discussed later in this chapter. Both data sets and associated files are found in the Sample Data folder that came with JMP Genomics.
Importing Family Data
There are a number of different ways to prepare family genetic marker data for processing with JMP Genomics, depending upon the format of the raw data files. The goal is to create a wide SAS data set as described previously and, optionally, a corresponding annotation SAS data set. See Chapter 3 for more details on generating SAS data sets.
As discussed in Chapter 5. JMP Genomics also offers customized import routines for six different specialized formats (Affymetrix SNP CHP, Arlequin, HapMap, Illumina, NEXUS, and Pedigree). These are found in the Data Set Creation submenu, as shown in Figure 5.4. This example uses the Pedigree process to import family-specific data.
The following steps describe how to use the customized Family import process.
Select Genomics > Data Set Creation > Other Genetics > Pedigree.
The dialog shown in Figure 6.3 opens.
6 Genetic Marker Family or Pedigree Data 113
Figure 6.3: The Pedigree Input Engine dialog
The main input file is specified in the Input Pedigree File box. This example uses the ped_all_columns.txt file included with JMP Genomics. To view this file, complete the following steps.
Navigate to Sample Data > Genetics.
Select the ped_all_columns.txt file.
Click Open to open the file shown in Figure 6.4.
Figure 6.4: The ped_all_columns.txt file
Note: This file is formatted as a blank-delimited test file. Each column is separated by a space. The columns, in order, indicate pedigree, individual ID, father’s ID, mother’s ID, sex, disease status, genotypes for 5 markers, and data for five quantitative traits. In addition to .txt files, the Pedigree process accommodates standard input file formats such as LINKAGE, QTDT, Genehunter, and FBAT. To choose the ped_all_columns.txt file as the input file, complete the following steps.
Click Load.
6 Genetic Marker Family or Pedigree Data 114
Select the settings for PedigreeExample1.
Click OK to complete the Pedigree Input Engine dialog, as shown in Figure 6.5.
Figure 6.5: The completed General tab of the Pedigree Input Engine dialog
Examine the General tab. Note that the ped_all_columns.txt file has been selected as the input file for this example. The destination folder for the output from this process has also been specified.
Do not make any changes to the General tab.
Click the Options tab to bring up the tab shown in Figure 6.6.
Figure 6.6: The Options tab
Examine the Options tab. This tab is where you specify the format of the input data file, the labels and identities of the different variables and an optional name for the output data set. Note that
6 Genetic Marker Family or Pedigree Data 115
SPACE is selected in the Column Delimiter field, thus matching the format of the input data set. Also note that specific names are listed, in order, for each of the columns, and that the quantitative variables are identified. Note: The order of the values listed in the List of Variable Names and in the Quantitative Variables fields must exactly match the order of the columns in the Input Pedigree File.
Do not make any changes to the Options tab.
Click Run.
The location of the output data set generated by this process is listed in a SAS Message dialog, as shown in Figure 6.7.
Figure 6.7: The SAS Message dialog
Click Open to examine the contents and structure of the output data set (partially shown in Figure 6.8).
Figure 6.8: The output data set
The data have been imported into a JMP data table, organized into columns labeled as specified in the dialog. Note that the columns, except those containing quantitative traits, have missing values in place of any 0s that were present in the original text file. This recoding is done automatically for any column not listed in the Quantitative Variables field.
The Transmission Disequilibrium Test (TDT)
The Transmission Disequilibrium Test (TDT) process offers various chi-square tests for binary traits such as disease status for genotyped parent-offspring trios or sibships. Use the following steps to compute TDT statistics for the disease variable in the samplegmdata.sas7bdat data set.
6 Genetic Marker Family or Pedigree Data 116
Select Genomics > Association Testing > TDT. The TDT dialog shown in Figure 6.9 opens.
Figure 6.9: The TDT dialog
To choose the ped_all_columns.txt file as the input file, complete the following steps.
Click Load.
Select the settings for the GeneticMarkerExample.
Click OK to complete the TDT dialog, as shown in Figure 6.10.
6 Genetic Marker Family or Pedigree Data 117
Figure 6.10: The General tab of the TDT dialog
Examine the General tab of the completed TDT dialog. Note that the marker (ma1 – ma120) and disease variables, as well as the four family variables (Ped_id, Ind_id, father, and mother), are specified in their required fields. The Filter to Include Observations field is left blank because this example uses the entire data set of 1000 individuals.
Do not make any changes to the General tab.
Click the Annotation tab to bring up the tab shown in Figure 6.11.
6 Genetic Marker Family or Pedigree Data 118
Figure 6.11: The Annotation Tab
As discussed for the examples in Chapter 5, an annotation data set has been selected and required variables have been specified.
Do not make any changes to the Annotation tab.
Click the Options tab to bring up the tab shown in Figure 6.12.
Figure 6.12: The Options tab
As discussed for the examples in Chapter 5, Alleles is selected for the Format of Marker Variables parameter. The TDT, along with the continuity correction option, is selected for the Family Association test. Information about these parameters can be found in the PROC TDT chapter of the SAS/Genetics User’s Guide.
Do not make any changes to the Options tab.
Click the P-Value Plots tab to bring up the tab shown in Figure 6.13.
6 Genetic Marker Family or Pedigree Data 119
Figure 6.13: The P-Value Plots tab
Examine the P-Value Plots tab. This tab specifies parameters for conversion, corrections and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.
Do not make any changes to the P-Value Plots tab.
Click Run.
The output window illustrated in Figure 6.14 opens.
Figure 6. 14: The output of the TDT process
Examine the overlay plots in the output window. The y-axis in the two plots displays the negative log p-value for the TDTs. Peaks indicate locations of significant association and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The two different graphs appear because of the specification of CandGene as the Annotation Group Variable in the Annotation tab. The SAS Output window (not shown) contains detailed tabulated statistics from the tests.
Model-Free Linkage Tests on IBD Data Three methods for performing model-free linkage tests are available in the Model-free Linkage submenu, as shown in Figure 6.15. These methods include the Affected Sib-Pair Tests, Haseman-Elston Regression, and Variance Components processes.
6 Genetic Marker Family or Pedigree Data 120
Figure 6.15: The Model-free Linkage submenu
When the data contain sibling pairs where both siblings are affected with the disease (or, more generally, possess the trait of interest) the Affected Sib-Pair Tests process can be used for performing simple chi-square tests for linkage between the trait and the available genetic markers. Both the Haseman-Elston Regression and Variance Components processes are designed for quantitative traits and can accommodate covariates. However, the Haseman-Elston Regression utilizes sib-pairs from the pedigrees sampled, whereas the Variance Components process uses any related pairs when testing for linkage of the trait with a marker. The three Model-free Linkage processes are not applied to genetic marker data as are the other genetic processes; instead, they analyze data containing information about the probabilities of pairs of individuals sharing alleles that are identical-by-descent (IBD) at the markers of interest. The required input IBD data set must contain one row for each pair of related individuals being analyzed at each marker, with variables z0, z1, and z2 representing the probability of the two individuals in the pair sharing 0, 1, or 2 alleles IBD, respectively. All possible pair-wise comparisons within each family should be made. Variables for the pedigree or family, the two individual IDs, and the marker are also required in this data set. Pairs of individuals should be grouped by marker, then by pedigree or family prior to carrying out these processes. The Identical-by-Descent (IBD) Data Sets The example illustrated for the Model-free Linkage processes uses the affected sib-pair (ASP) data provided by Gonçalo Abecasis (University of Michigan Center for Statistical Genetics) and described in Chapter 1. This example comprises three associated data sets:
1) the IBD data set that contains the IBD probabilities for 20 markers in 200 families, with 4 individuals in each family
2) a pedigree data set that lists the family relationships, affected status, and marker
genotypes for each of the 800 individuals (4 per family) in the data set
3) a map data set that lists the physical location of each of the markers on human chromosome 24.
Note: If you are curious about chromosome 24, recall that these are fictitious data. To examine the IBD data set, complete the following steps.
Select File > Open to open the Open Data File dialog.
Navigate to Sample Data > Genetics.
Select the asp_ibd.sas7bdat file.
Click Open to open the file partially shown in Figure 6.16.
6 Genetic Marker Family or Pedigree Data 121
Figure 6.16: The IBD data file
Note: All pair-wise comparisons within each family are listed. MERLIN was used to estimate identical-by-descent (IBD) allele-sharing probabilities at these markers for all pairs of related individuals. To examine the IBD pedigree data set, complete the following steps.
Select File > Open to open the Open Data File dialog.
Navigate to Sample Data > Genetics.
Select the asp_ped.sas7bdat file.
Click Open to open the file partially shown in Figure 6.17.
Figure 6.17: The pedigree data file
Note: The alleles for each of the 20 markers are listed in successive pairs of marker columns, such that the alleles for the first marker are listed in columns a1 and a2, the alleles for the second marker are listed in columns a3 and a4, and so on. The 400 offspring are also measured for a quantitative trait of interest. To examine the IBD map data set, complete the following steps.
Select File > Open to open the Open Data File dialog.
6 Genetic Marker Family or Pedigree Data 122
Navigate to Sample Data > Genetics.
Select the asp_map.sas7bdat file.
Click Open to open the file shown in Figure 6.18.
Figure 6.18: The map data file
Note: The location of each marker is listed. The following example uses Variance Components process to test for linkage between the 20 markers and the quantitative trait for the families in the ASP IBD, pedigree, and map data sets. Variance Components
Select Genomics > Model-free Linkage > Variance Components.
The Variance Components dialog shown in Figure 6.19 opens.
6 Genetic Marker Family or Pedigree Data 123
Figure 6.19: The Variance Components dialog
Click Load.
Select the default settings for the Merlin_asp example.
Click OK to complete the dialog, as shown in Figure 6.20.
6 Genetic Marker Family or Pedigree Data 124
Figure 6.20: The completed General tab of the Variance Components dialog
Examine the General tab of the completed dialog. The data set asp_ibd.sas7bdat is specified in the IBD Data Set field, and asp_ped.sas7bdat is specified in the Pedigree Data Set field. The column headings from the pedigree data set are listed as variables listed in the Available Variables box. The QTrait variable from the latter is the quantitative trait of interest, and the Family, ID, Parent1, and Parent2 variables specify the family structure. The Filter to Include Observations field is left blank because we are using the entire data set of 800 individuals.
Do not make any changes to the General tab.
Click the Annotation tab to bring up the tab shown in Figure 6.21.
Figure 6.21: The Annotation Tab
Examine the Annotation tab. The asp_map.sas7bdat is specified as the annotation data set. The column headings from the file (illustrated in Figure 6.18) are listed in the Available Variables box. The variables marker and location are specified in the Annotation Label Variable and Annotation Location Variable boxes, respectively.
Do not make any changes to the Annotation tab.
6 Genetic Marker Family or Pedigree Data 125
Click the Options tab to bring up the tab shown in Figure 6.22.
Figure 6.22: The Options Tab
Examine the Options tab. Likelihood Ratio is selected as the test statistic.
Do not make any changes to the Options tab.
Click on the P-Value Plots tab to bring up the tab shown in Figure 6.23.
Figure 6.23: The P-Value Plots tab
This tab specifies parameters for conversion, corrections and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.
Click Run.
The output window shown in Figure 6.24 opens.
Figure 6.24: The output of the Variance Components process
Examine the output window. The y-axis, labeled ProbChi, in the plot displays the negative log p-value for likelihood ratio tests at each locus. The peak occurs at the fourth marker, and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The SAS Output window (not shown) contains details for the SAS Proc Mixed runs used to generate the tests.
6 Genetic Marker Family or Pedigree Data
126
Microarray Case Study I: The Drosophila Aging Experiment
7C H A P T E R
In this chapter we use a small subset of the Drosophila aging experiment data from Jin et al. (2001) to work through several analytical processes as a case study. The experiment consisted of 24 two-color cDNA microarrays, six for each experimental combination of two lines (Oregon and Samarkand), two sexes (Female and Male), and two ages (1 week and 6 weeks). The Cy3 and Cy5 dyes were flipped for two of the six replicates for each genotype and sex combination. The design is a split-plot design, with Age and Dye as subplot factors, and Line and Sex as whole-plot factors. A total of 4256 clones were spotted on the arrays, but for this example, we use a subset containing 100 randomly selected genes.
Sample Workflow for Analysis of Microarray Data
The workflow∗ for this example is as follows:
1. Generation of the Data Sets i. Experimental Design File Builder
ii. Data Set Creation 2. Evaluation of the Data Quality
i. Raw Data Distribution Analysis ii. Ratio Analysis (Raw Data)
iii. Ratio Analysis (Loess Normalization) 3. Comparison of Different Methods for Data Normalization
i. Data Standardization (Median) & Standardized Distribution Analysis ii. Loess Normalization Across Arrays & Distribution Analysis (Loess Normalized
Data) 4. Evaluation of Normalized Data Quality
i. Correlation and Principal Components ii. Correlation and Grouped Scatter Plots
5. Primary Data Analysis for Determining Significant Differences in Gene Expression i. Analysis of Variance
ii. Mixed Model Analysis 6. Further Analysis
i. Transpose Tall and Wide ii. K-Means Clustering
iii. Distance Matrix 7. Predictive Modeling
While this is a fairly standard sequence of processes to run, the order of the processes can change to suit any experimental objectives.
∗ Outline topics correspond to subsections of this chapter.
7 Microarray Case Study I: The Drosophila Aging Experiment 128
Generation of the Data Sets As described in Chapter 3, JMP Genomics requires the generation of specific data sets. The first step in generating these data sets is the building the Experimental Design File.
Experimental Design File Builder Many of the processes in JMP Genomics require an Experimental Design Data Set, (EDDS) which contains the corresponding experimental factors for each channel in a multi-channel platform or for each array in a single-channel platform. In order to bulk-load a set of raw data files, you need to prepare a corresponding Experimental Design File (EDF) that contains the file names and all experimental factors. Refer to Chapter 3 for detailed instructions on how to create an EDF. Here, we use the Experimental Design File Builder process to generate an EDF for the trimmed Drosophila Aging Data. The raw data consists of 24 .DAT files located in the Sample Data folder. To build an EDF using these files,
Select Genomics > Experimental Design > Experimental Design File Builder.
The Experimental Design File Builder dialog appears, as shown in Figure 7.1.
Figure 7.1: The Experimental Design File Builder dialog
Click Choose to select the folder containing the raw data files.
Navigate to Sample Data > Microarray.
Open the Scanalyze Drosophila folder.
7 Microarray Case Study I: The Drosophila Aging Experiment 129
Click Select (circled in Figure 7.2) to select the folder.
Figure 7.2: Selecting the folder that contains the raw data files
When selecting folders in JMP Genomics, navigate into the folder containing the raw data files and select it.
Because the raw data files are in the .DAT file format, filter out all file types but .DAT files.
Select .dat from the File Filter Expression drop-down menu.
The File Filter Expression box appears as shown in Figure 7.3.
Figure 7.3: Selecting the file filter
Recall that this is a two color array using Cy3 and Cy5. Because the probes were labeled with two dyes,
Enter 2 in the Number of Channels in Each File box, as shown in Figure 7.4.
Figure 7.4: Specifying two channels
Type Line, Sex, and Age, in the New Variable Names for Experimental Design box, as
shown in Figure 7.5.
Figure 7.5: Entering new variable names
Note: These variable names may be entered on the same line, but must be separated by a space.
7 Microarray Case Study I: The Drosophila Aging Experiment 130
Specifying a name for the output file is optional and you may specify any name you like here. For this example, DrosophilaAging_Exp.txt is the name used for the output file.
Type DrosophilaAging_Exp.txt in the Output File Name box, as shown in Figure 7.6.
Figure 7.6: Specifying the output file
To specify the output folder, complete the following steps.
Click on Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The Experimental Design File Builder dialog should appear like the one shown in Figure 7.7.
Figure 7.7: The completed General tab of the EDF Builder dialog
Click Run to generate the EDF.
The EDF is shown in Figure 7.8.
7 Microarray Case Study I: The Drosophila Aging Experiment 131
Figure 7.8: The EDF
The EDF contains several empty columns. You can type the corresponding information into them and use the Create Array Index, Create ColumnName, and Check File Names commands located under the Data Set Creation submenu to add or modify certain columns. Alternatively, since the raw file names contain the sufficient information about the empty columns, you can write SAS code to create the values of Line, Sex, Age, and Intensity.
Click the Experimental Design File Builder dialog to make it the active window.
Click the Options tab.
Type the following SAS commands to the SAS Code to Create Columns box.
Name = scan(File,Channel); if substr(Name,1,1) = "O" then Line = "ORE"; else Line = "SAM"; if substr(Name,2,1) = "M" then Sex = "MAL"; else Sex = "FEM"; if substr(Name,3,1)="1" then Age = "WK1"; else Age="WK6"; if Channel = 1 then do; Dye = "Cy3"; Intensity = "Ch1i"; end; else do; Dye = "Cy5"; Intensity = "Ch2i"; end; if Array < 10 then ArrayString = "0" || trim(left(Array)); else ArrayString = trim(left(Array)); ColumnName = trim(Line) || "_" || trim(Sex) || "_" || trim(Age) || "_" || trim(Dye) || "_" || ArrayString; drop Name Channel ArrayString; rename Dye = Channel;
7 Microarray Case Study I: The Drosophila Aging Experiment 132
Note: The first part of the File variable (before the first “.”) and the second part (between the first and the second “.”) of the raw file name contains the experimental information associated with the Cy3 channel and Cy5, respectively. These commands may be modified to fit most experimental conditions. Refer to the SAS 9.1.3 User’s Guide (http://support.sas.com/onlinedoc/913/docMainpage.jsp) for additional information.
Click Run to generate the modified EDF. The modified EDF is partially shown in Figure 7.9.
Figure 7.9: The modified EDF
The EDF is automatically saved as a text file in the output folder you specified in the Experimental Data File Builder dialog.
Data Set Creation
To generate a SAS data set and EDDS from the raw data files that can be used for further analysis by JMP Genomics using a device-specific import engine, complete the following steps.
Select Genomics > Import > Other Expression > ScanAlyze. The ScanAlyze Import Engine dialog opens, as shown in Figure 7.10.
7 Microarray Case Study I: The Drosophila Aging Experiment 133
Figure 7.10: The ScanAlyze Import Engine dialog
Make sure the General tab is selected.
To choose the Experimental Design File you created in the previous section, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select Text Import Files (*.TXT; *.CSV; *.DAT) from the File of type drop-down menu.
Select the DrosophilaAging_Exp.txt file and click Open.
To choose the folder containing the raw data files, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray.
Open the Scanalyze Drosophila folder.
Click OK to select the folder. The first row of one of the raw .DAT files lists the column names and the primary numerical data does not start until the 9th row of the file, as shown in the partial view of one of the raw data files, illustrated in Figure 7.11.
7 Microarray Case Study I: The Drosophila Aging Experiment 134
Figure 7.11: A portion of one of the raw data files from the Drosophila Aging experiment
Numerical Data
The value of 9 is specified as a default setting for the Data Start Row, as shown in Figure 7.12, because of the structure of the raw data files generated by the Scanalyze device. For other ScanAlyze experiments, the setting in the Data Start Row box may need to be changed.
Figure 7.12: Specifying the data start row
Do not change the Data Start Row default setting.
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.13.
Figure 7.13: The completed General tab
Select the Options tab.
The Options tab appears as shown in Figure 7.14.
7 Microarray Case Study I: The Drosophila Aging Experiment 135
Figure 7.14: The Options tab
There are three output data sets generated by the ScanAlyze Input Engine:
1. Output Experimental Design Data Set 2. Output Data Set 3. Spot Coordinates Output Data Set
Specify drosophilaaging_exp as the name of the output experimental design data set.
Specify drosophilaaging as the name of the output data set.
The Perform Log2 Transform checkbox provides an option to apply a logarithm base 2 transformation to the intensities in the output data.
Make sure that the Perform Log2 Transform box is checked. The third data set, Spot Coordinates Output Data Set, specifies location data for the individual spots on the microarray. This data set is not required for the analyses described in this chapter.
Do not specify a spot coordinates output data set. Number of Rows to Scan is used to specify the numbers of rows to be scanned in order to determine the attributes of the variables in the output SAS data set. The default value is set to 100.
Make sure the default value is specified.
The Options tab of the dialog should appear like the one shown in Figure 7.15.
Figure 7.15: The completed Options tab
Click Run to generate the data sets.
As discussed in Chapter 1, JMP Genomics dialogs generate and run a SAS program each time you click Run. Depending upon the size of your data sets and capacities of your computer, some processes can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (See Figure 1.10) . The Windows Task Manager shows a process named
7 Microarray Case Study I: The Drosophila Aging Experiment 136
sas.exe running, and you can track its CPU and I/O activity. You can also monitor the SAS temporary working directory and the Output Folder for results as they are created. The SAS data sets generated by this process are listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.16).
Figure 7.16: The SAS Message dialog
The dialog lists the EDDS and the primary data set.
Click Open for each of the data sets to examine their contents and structures. The output EDDS is partially illustrated in Figure 7.17.
Figure 7.17: The drosophilaaging_exp EDDS
Figure 7.18 shows a partial listing of the output data.
7 Microarray Case Study I: The Drosophila Aging Experiment 137
Figure 7.18: The drosophilaaging data set
Note: The output data set is formatted in the tall SAS data set form, required for subsequent analyses.
Evaluation of Data Quality Numerous factors can affect the quality of the data generated in any microarray experiment. These factors may include experimental errors in labeling, gene-specific differences, minor slide defects, differences in hybridization conditions, variability in printing quality, and so forth. Because these factors can interfere with interpretations, the first step in any analysis of microarray data should be to assess the quality of the raw data. Performing quality control (QC) at the beginning of an analysis can save a great deal of time downstream and leads to more reliable results.
Distribution Analysis for Raw Data For the Drosophila example, we start with a simple distribution analysis to get a feel for overall intensity characteristics for the spots on the arrays.
Select Genomics > Quality Control > Distribution Analysis, as shown in Figure 7.19.
7 Microarray Case Study I: The Drosophila Aging Experiment 138
Figure 7.19: Selecting Distribution Analysis
The Data Distribution dialog opens, as shown in Figure 7.20.
Figure 7.20: The Data Distribution dialog
To choose the drosophilaaging.sas7bdat input data set created previously, complete the following steps.
Click Choose.
7 Microarray Case Study I: The Drosophila Aging Experiment 139
Navigate into the ProcessResults folder.
Select the drosophilaaging.sas7bdat file and click Open.
Note that the file path and all the column labels from the input data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.21.
Figure 7.21
Select from the available variables those for which you wish to view the distributions. Leaving the Variables for which to Display Distributions field blank displays distributions for all the available variables.
Leave the Variables for which to Display Distributions field blank.
Leave the ID Variables field blank.
Leave the List-Style Specification field blank To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.22.
7 Microarray Case Study I: The Drosophila Aging Experiment 140
Figure 7.22: The General tab of the completed Data Distribution dialog
The Experimental Design tab allows you to specify the experimental design data set (EDDS) and specific variables used to modify the analysis.
Click Experimental Design. To choose the Experimental Design Data Set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. Note that the file path and all the column labels from the experimental design data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.23.
Figure 7.23
Leave the Variables Defining Groups, Color Variables and Label Variable fields blank.
The Option tab allows you to specify how the results of this process are displayed.
Click Options to view the default settings.
Do not make any changes to the Options tab.
7 Microarray Case Study I: The Drosophila Aging Experiment 141
Click Run to generate the distributions.
Several windows open.
1. The drosophilaaging.sas7bdat data table for creating distribution details 2. A drosophilaaging_stack data set for creating box plots
3. A drosophilaaging_densities data set for creating the Overlay Kernel Density Estimates
4. A Box Plots summary window (Figure 7.24) that shows the distributions and outliers
for all the variables in the input data set
Figure 7.24: The Box Plots summary window
5. A Distribution Details window (partially shown in Figure 7.25) that shows histograms,
box plots, quantiles, and statistical moments for each row of the experimental design. Note that each row refers to an individual Cy3 or Cy5 channel in this case.
7 Microarray Case Study I: The Drosophila Aging Experiment 142
Figure 7.25: The Distribution Details window
6. A Parallel Plot window (Figure 7.26) that shows Overlayed Kernel Density Estimated curves
Figure 7.26: The Parallel Plot window
The overlay plot shows the raw univariate distributions of all 48 channels from the 24 arrays. Visually, the estimated distributions significantly vary among all the 48 channels here. This inherent variability among arrays and dye indicates that normalization across arrays and channels is essential for effective analysis of these data. Ratio Analysis and Checking for Dye Effects Dye effects are often significant for multi-channel microarray data. To investigate the dye effects, you should by inspect plots of log ratios versus average (or sum) log intensities of the two channels for each array. Such plots are known as MA plots.
7 Microarray Case Study I: The Drosophila Aging Experiment 143
Select Genomics > Normalization > Ratio Analysis, as shown in Figure 7.27.
Figure 7.27: Selecting the Ratio Analysis process
The Ratio Analysis dialog opens, as shown in Figure 7.28.
Figure 7.28: The Ratio Analysis dialog
Examine the General tab. This example uses the same EDDS, input data set, and output path as used in the Distribution Analysis done previously. To choose the drosophilaaging.sas7bdat input data set, complete the following steps.
7 Microarray Case Study I: The Drosophila Aging Experiment 144
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging.sas7bdat file and click Open.
Note that the file path and all of the column labels from the input data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.29.
Figure 7.29
Select from the available variables those for which you wish to view the distributions.
Select Spot.
Click to add Spot to the Feature Variable box, as shown in Figure 7.30.
Figure 7.30: Selecting the Feature Variable from the Input Data Set The data is presented as hybridization intensity.
Make sure Intensity is selected as the input data type. To choose the Experimental Design Data Set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.31.
7 Microarray Case Study I: The Drosophila Aging Experiment 145
Figure 7.31: The completed General tab of the Ratio Analysis dialog
Click Analysis to open the tab shown in Figure 7.32.
Figure 7.32: The Analysis tab of the Ratio Analysis dialog
Two parameters, the Variable to Define Ratio and the Value of Variable above to Be Used as Denominator, provide options to construct ratios for two channels within a single array.
Leave both parameters unselected.
De-select Perform Loess Normalization.
By leaving both parameters blank and by not performing the Loess Normalization, the Ratio Analysis process creates MA plots with the original raw data.
Click the Options tab to open the tab shown in Figure 7.33.
7 Microarray Case Study I: The Drosophila Aging Experiment 146
Figure 7.33: The Options tab of the Ratio Analysis dialog
Four parameters under Options tab let you specify the output data set and file names. Specify the names or leave them blank to use default names.
Do not specify names for any of the output files.
Click Run to carry out the ratio analysis.
A window with MA plots appears (shown in Figure 7.34).
Figure 7.34: MA plots of raw data
Figure 7.34 shows MA plots for arrays 1 and 2. The red curve in each plot is a smoothing spline applied on the data of each array. A large discrepancy between the spline and the zero horizontal line indicates a significant dye effect within that array.
7 Microarray Case Study I: The Drosophila Aging Experiment 147
Scroll up and down on the MA Plots window to view MA plots for other arrays. All of the MA plots show significant deviation from the zero horizontal line, indicating significant dye effects and necessitating data normalization before further analysis. Loess Normalization within Arrays
Click the Ratio Analysis dialog to reactivate this window.
Click Analysis.
Click the checkbox to select the Perform Loess Normalization option.
Click Options.
Type drosophilaaging_loess1 in the Output Data Set field.
Click Run to rerun the Ratio Analysis process.
The MA plots that are generated by this process, illustrated in Figure 7.35, are now constructed from data that have been loess normalized within each array (Dudoit, Yang et al. 2002).
Figure 7.35: MA plots of Loess normalized data.
Compare the plots illustrated in Figure 7.35 with those in Figure 7.34. After within-array Loess normalization the smoothing spline becomes much closer to the zero horizontal line in each MA plot.
7 Microarray Case Study I: The Drosophila Aging Experiment 148
Note: The within-array Loess normalization performed by the Ratio Analysis process is different from the across-array normalization performed by the Loess Normalization process, which is described later in this chapter.
Comparison of Different Methods for Data Normalization JMP Genomics provides several methods for normalizing your data set. Deciding which process to use is best
done on a case-by-case basis. This example considers two of these methods: median standardization and Loess
normalization. Both procedures use the within-array loess normalized data set,
drosophilaaging_loess1.sas7bdat, created in the preceding section, as the input data set.
Median Standardization of Data across Arrays The first example method for normalizing the data across arrays is median standardization.
Select Genomics > Normalization > Data Standardize, as shown in Figure 7.36.
Figure 7.36: Selecting the Data Standardize process
The Data Standardize dialog, shown in Figure 7.37, opens.
7 Microarray Case Study I: The Drosophila Aging Experiment 149
Figure 7.37: The Data Standardize dialog
Make sure that the General tab is selected.
To choose the Loess normalized data set generated previously as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1.sas7bdat file and click Open. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the method for standardization, complete the following step.
Click the downward arrow in the Standardization Method box and select MEDIAN.
This type of standardization centers the median of each channel to zero. To specify the Output Folder, complete the following steps.
7 Microarray Case Study I: The Drosophila Aging Experiment 150
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder. The General tab of the dialog should appear like the one shown in Figure 7.38.
Figure 7.38: The completed General tab of the Data Standardize dialog
Click Run to standardize the data.
The standardized SAS data set, drosophilaaging_loess1_med.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.39).
Figure 7.39: The SAS Message dialog
Rerun the Distribution Analysis process, using the median adjusted data set we just generated as the Input Data Set.
Select Genomics > Quality Control > Distribution Analysis.
Specify drosophilaaging_loess1_med.sas7bdat and drosophilaaging_exp.sas7bdat as the input data sets and EDDS, respectively.
Specify the ProcessResults folder as the output folder.
7 Microarray Case Study I: The Drosophila Aging Experiment 151
The Data Distribution dialog should appear as shown in Figure 7.40.
Figure 7.40: The completed General (top) and Experimental Design (bottom) tabs of the Data
Distribution dialog
Do not make any changes to the Options tab.
Click Run to generate the distributions. As before, several results windows open. Compare the overlayed kernel density plot for the normalized and standardized data set, shown in Figure 7.41, with the overlayed kernel density plot for the raw data set, shown in Figure 7.26. Note that the marginal univariate distributions of each channel are now much more consistent than before.
Figure 7.41: The overlayed kernel density plot for the normalized, standardized data set
7 Microarray Case Study I: The Drosophila Aging Experiment 152
Loess Normalization Across Arrays The alternative method for standardizing the data across arrays is Loess normalization.
Select Genomics > Normalization > Loess Model Normalization, as shown in Figure 7.42.
Figure 7.42: Selecting the Loess Normalization process
The Loess Model Normalization dialog opens.
Make sure that the General tab is selected.
To choose the Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1.sas7bdat file and click Open. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the Output Folder, complete the following steps.
Click Choose.
7 Microarray Case Study I: The Drosophila Aging Experiment 153
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.43.
Figure 7.43: The completed General tab of the Loess Model Normalization dialog
Click Options.
drosophilaaging_loess1_loess2 in the Output Data Set field.
Make no other changes to the Options tab.
the process automatically uses the mean across all annels and all arrays as the common baseline.
Click Run to generate the Loess normalized data set.
fore and after normalization on e left and right panels, respectively, as illustrated in Figure 7.44.
Type
Note that without specifying a Baseline variable,ch
A Loess Normalization results window appears with scatter plots beth
7 Microarray Case Study I: The Drosophila Aging Experiment 154
Figure 7.44: Scatterplots of individual array data before (left) and after (right) normalization
All the scatter plots have a common baseline as the x-coordinate. The y-coordinates in the left graphs are computed as the within-array normalized data minus the corresponding baseline, whereas the y-coordinates on the right are computed as the across-array normalized data minus the corresponding baseline. The red horizontal line, seen in all four plots, is a smoothing spline curve fit to the data nonparametrically. Rerun the Distribution Analysis process, using the Loess-normalized data set we just generated as the Input Data Set. This data set and the EDDS are found in the ProcessResults folder.
Select Genomics > Quality Control > Distribution Analysis.
Specify drosophilaaging_loess1_loess2.sas7bdat and drosophilaaging_exp.sas7bdat as the input data sets and EDDS, respectively.
Specify the ProcessResults folder as the output folder.
Click the Options tab.
Change the number of grid points from 100 to 40.
Reducing the number of grid points smoothes out the resulting curves, but does not otherwise change the distributions.
Click Run to generate the distributions.
As before, several results windows open. The overlayed kernel density plot for the within-array and across-array Loess-normalized data set are shown in Figure 7.45.
7 Microarray Case Study I: The Drosophila Aging Experiment 155
Figure 7.45: The overlayed kernel density plot for the within-array, across-array Loess-normalized
data set Compare the overlayed kernel density plot for the within-array and across-array Loess-normalized data set, shown in Figure 7.45, with the previous overlayed kernel density plots, shown in Figures 7.26 and 7.41. Note that the curves show an even greater consistency than seen previously. The drosophilaaging_loess1_loess2.sas7bdat data set is used in subsequent analyses.
Evaluation of Normalized Data Quality
Correlation and Principal Components
Select Genomics > Quality Control > Correlation and Principal Components, as shown in Figure 7.46.
Figure 7.46: Selecting the Correlation and Principal Components process
7 Microarray Case Study I: The Drosophila Aging Experiment 156
The Correlation and Principal Components dialog appears, as shown in Figure 7.47.
Figure 7.47: The Correlation and Principal Components dialog
Make sure that the General tab is selected.
To choose the Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
7 Microarray Case Study I: The Drosophila Aging Experiment 157
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.48
Figure 7.48: The completed Correlation and Principal Components dialog
Make no changes to the Analysis, Variance Components or Options tabs.
Click Run.
Running this process produces several windows. These windows are linked together, so selecting a point on one graph or table, highlights the corresponding point on all of the graphs or tables. Refer to the JMP User’s Guide for more details on this feature. The 3-D principal components scatterplot matrix is shown in Figure 7.49.
7 Microarray Case Study I: The Drosophila Aging Experiment 158
Figure 7.49: The 3-D principal components scatterplot matrix
Examine the scatterplots shown in Figure 7.49.The points aggregate into two groups, the identity of which is not of yet known. To investigate which experimental factor is driving the segregation of the data into the two groups, change the colors of the points using the Rows > Color or Mark by Column, under the main JMP menu, by completing the following steps.
Make sure the principal components scatterplot matrix window is active.
Select Rows > Color or Mark by Column, as illustrated in Figure 7.50.
Figure 7.50: Selecting the Rows > Color or Mark by Column process
The JMP: Color by Mark or Column dialog opens, as shown in Figure 7.51.
7 Microarray Case Study I: The Drosophila Aging Experiment 159
Figure 7.51: The JMP: Color by Mark or Column dialog
Select one of the columns by which to set the color.
Because we might expect the points to cluster by arrays or one of the treatments, select one of the columns describing the experimental factors (Line, Sex, Age, and Channel). Coloring the points by sex yields the principal components scatterplot matrix shown in Figure 7.52, indicting a near perfect correlation between sex and the clustered points.
Figure 7.52: The principal components scatterplot matrix, colored by sex
A second plot produced by this procedure is the Correlation Heat Map, shown in Figure 7.50.
7 Microarray Case Study I: The Drosophila Aging Experiment 160
Figure 7.53: The correlation heat map
There are two large blocks apparent in the correlation heat map, corresponding to the same two groups in the principal components display. By studying the labels on the left hand side, it is apparent that the females are all clustered at the top, and the males are all clustered at the bottom. This clustering phenomenon has been made even more obvious because the variables have been differentially colored by sex. Even though the primary focus of this experiment was aging, the initial results show that sex-to-sex differences are much larger overall.
Correlation and Grouped Scatterplots The Correlation and Grouped ScatterPlots process computes correlations and scatterplot matrices for expression measurements across groups of arrays. This process also merges annotations for each gene with the measurements to quickly provide information on genes of interest.
Select Genomics > Quality Control > Correlation and Grouped ScatterPlots, as shown in Figure 7.54.
7 Microarray Case Study I: The Drosophila Aging Experiment 161
Figure 7.54: Selecting the Correlation and Grouped ScatterPlots process
The Correlation and Group Scatterplots dialog opens, as shown in Figure 7.55.
Figure 7.55: The Correlation and Group Scatterplots dialog
7 Microarray Case Study I: The Drosophila Aging Experiment 162
Make sure that the General tab is selected.
To choose the Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. This list is used to choose the variable(s) by which to merge the annotation data. Since we are interested in coordinating expression data with information about the gene represented in each spot on the arrays,
Select Spot from the list of available variables.
Click to add Spot to the Variable By Which to Merge Annotation Data box, as shown in Figure 7.56.
Figure 7.56: Selecting the Variable By Which to Merge Annotation Data from the Input Data Set
To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. This list is used to choose the variable(s) to plot against each other. Specifying Line, Sex, and Age lets you check for repeatability within each of the treatment groups. Since we are interested in investigating the effects of these experimental conditions on the expression data,
Select Line, Sex, and Age from the list of available variables.
Click to add Line, Sex, and Age to the Variables Defining Groups box, as shown in Figure 7.57.
Figure 7.57: Selecting the Variables Defining Groups from the EDDS
To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
7 Microarray Case Study I: The Drosophila Aging Experiment 163
Open the ProcessResults folder and click Select to select this folder.
The General tab of the dialog should appear like the one shown in Figure 7.58.
Figure 7.58: The completed General tab of the Correlation and Group Scatterplots dialog
Click on the Annotation 1 tab to open the tab shown in Figure 7.59.
Figure 7.59: The Annotation tab of the Correlation and Group Scatterplots dialog
To choose the annotation data set for this experiment complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Select the drosophilaaging_annotation.sas7bdat file and click Open.
7 Microarray Case Study I: The Drosophila Aging Experiment 164
Click Open to examine the annotation data set.
The data set appears as shown in Figure 7.60.
Figure 7.60: The Drosophila Aging experiment annotation data set
This data set lists the gene identity and GenBank accession number for each spot on the arrays and provides a short description of the function (where known) of the gene and its product. A list of the ColumnLabels from the annotation data set appears in the Available Variables field.
Select Spot from the list of available variables.
Click to add Spot to the Annotation Merge Variables box, as shown in Figure 7.62.
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession Variable box, as shown in Figure 7.61.
Select ShortDescription from the list of available variables.
Click to add ShortDescription to the Annotation Label Variable box, as shown in Figure 7.61.
Figure 7.61: The selected annotation variables
Click Annotation 2.
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession Variable box, as shown in Figure 7.62.
7 Microarray Case Study I: The Drosophila Aging Experiment 165
Figure 7.62: The selected annotation variables
Click Run to produce a Scatterplot Matrix for each of these groups, as partially shown in
Figure 7.63.
Figure 7.63: Scatterplot matrix
Most of the arrays fall into an elliptical space along the diagonal axis. In general, a tighter ellipse means a higher correlation, and more circular ellipses indicate increased noise. The spots outside the ellipses may be outlier genes of interest in terms of quality control or an inconsistent signal across replicates. Clicking on one of the genes highlights it across all arrays. In one example, CG10992 falls outside the ellipse across many of the arrays. To obtain more detailed annotation about this gene, select the spot and click on GenBank-Nucleotide, on the top of the Correlation ScatterPlots window, to bring up the GenBank web page for CG10992, as shown in Figure 7.64.
7 Microarray Case Study I: The Drosophila Aging Experiment 166
Figure 7.64: A portion of the GenBank web page for spot CG10992
You can mouse over other spots too in the scatterplot matrix to see their label. Also, you can drag a rectangle around the spots to select them in the associated JMP table. With a considerable number of outliers, users often check raw image files for abnormalities at those spots. You may also apply the Pseudo Image and Surface Summary processes under the Quality Control submenu to check on the raw image. This can help to decide whether to keep the outliers or filter them out of the analysis. To filter a set of spots, select them in the corresponding JMP table in any of the following ways:
• Click and drag a rectangle in a scatterplot matrix window. • Use the lasso tool. • Hold the Shift key and click spots one by one. • Click Rows > Row Selection > Select Where to define a filtering rule.
To delete the selected rows, complete the following steps.
Select Rows > Row Selection > Invert Row Selection. This command inverts the selection to the desired rows.
Select Tables > Subset to create a subset table with the desired rows.
Select File > Save As SAS Data Set to save the subset table as a .sas7bdat file. The new data set can now be used as input for further processes. Refer to the JMP Users’ Guide for additional information and directions on row selection and creating subset data sets.
7 Microarray Case Study I: The Drosophila Aging Experiment 167
Primary Data Analysis for Determining Significant Differences in Gene Expression
ANOVA After performing quality control and normalization on your microarray data set, Analysis of Variance (ANOVA) is a popular method for inferring differentially expressed genes. The ANOVA process in JMP Genomics is quite flexible and enables you to specify multi-factor models that might also include random effects. Input data should be normalized, either by the Data Standardize and Loess Normalization processes, as described previously, or by other methods available in the Normalization menu, prior to carrying out the ANOVA process. This example uses the drosophilaaging_loess1_loess2.sas7bdat input data set, described previously.
Select Genomics > Row-by-Row Modeling > ANOVA, as shown in Figure 7.65.
Figure 7.65: Selecting the ANOVA process
The ANOVA dialog opens, as shown in Figure 7.66.
7 Microarray Case Study I: The Drosophila Aging Experiment 168
Figure 7.66: The ANOVA dialog
The ANOVA process contains eight tabs: General, Annotation1, Annotation 2, Model, LSMeans, Multiple Testing, Residuals, and Options.
Make sure that the General tab is selected. To choose the Loess-normalized data set as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the variables from the input data set appears in the Available Variables field. In this example Spot served as the gene identifier, and is selected as the By Variable to generate a separate model fit for each gene. Typically, the By Variable should be a specific identifier. If the annotation file is included on the annotation tab, a gene identifier must be listed in the Variable by which to Merge Annotation Data field to link the input file and annotation file. In this example, choose Spot as this Link Key. The data set contains no chromosome or position information. To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
7 Microarray Case Study I: The Drosophila Aging Experiment 169
Open the ProcessResults folder and click Select to select this folder. The General tab of the dialog should appear like the one shown in Figure 7.67
Figure 7.67: The completed General tab of the ANOVA dialog
Click Annotation 1.
To choose the annotation data set for this experiment, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Select the drosophilaaging_annotation.sas7bdat file and click Open.
A list of the ColumnLabels from the annotation data set appears in the Available Variables field.
Select Spot from the list of available variables.
Click to add Spot to the Annotation Merge Variables box, as shown in Figure 7.68.
Select ShortDescription from the list of available variables.
Click to add ShortDescription to the Annotation Label Variable box, as shown in Figure 7.68.
Click Annotation 2.
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession Variable box, as shown in Figure 7.68.
Select Drosophila melanogaster from the Organism drop-down menu.
The completed Annotation tabs of the dialog should appear like those shown in Figure 7.68.
7 Microarray Case Study I: The Drosophila Aging Experiment 170
Figure 7.68: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the ANOVA
dialog
Click the Model tab. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. This list is used to choose the Class Variables. Any variable that is selected in Class Variables is used as a non-numerical value or nominal (group) type of numerical value in the model. The order of the variables in the Class Variables determines the way LSMeans and/or interaction effects are sorted.
Select Array, Channel, Line, Sex, and Age from the list of available variables
Click to add Array, Channel, Line, Sex, and Age to the Class Variables box. Fixed Effects allow users to set the one-, two-, or multiple-way ANOVA to model the mean of the response variable. Variables entered into this field are delimited by a space. Variables can be composed of either main effects, such as Line or Sex as in this example, or as interactions between effects, such as Line*Sex.
Type Line Sex Age Channel Line*Sex Line*Age Sex*Age Line*Sex*Age in the Fixed Effects field.
7 Microarray Case Study I: The Drosophila Aging Experiment 171
LSMeans Effects are used to construct differences and least-squares means profiles. Any effects listed here must also be listed in the Fixed Effects field and all the variables comprising them must be listed as Class Variables.
Type Line*Sex*Age in the LSMeans Effects field.
Estimate statements are arbitrarily complex hypothesis tests of the relative importance of different combinations of different fixed effects on gene expression. They can be constructed using the Estimate Builder AP. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
Leave the File Containing Estimate Statements field blank. Random Effects are used to construct the covariance structure of the response variable. In this example and in most of the two-color microarray data, Array should be considered as a random effect since the arrays applied in the experiment are randomly selected and are not re-usable. They are typically comprised of class variables and their interactions, but cannot include those effects already specified in the Fixed Effects field.
Type Array in the Random Effects field. The completed Model tab of the dialog should appear like the one shown in Figure 7.69.
Figure 7.69: The completed Model tab of the ANOVA dialog
Click the LSMeans tab.
On the LSMeans tab, you can choose different preferences for the LSMeans Difference Set for Volcano Plots.
All Pairwise Differences lists all the possible combinations for all the experimental conditions in the volcano plots and significant gene list.
Differences with a Control compares the conditions to the control only by defining the control in the LSMeans Control Values.
None results in having no LSMean differences listed.
Make sure that the All Pairwise Differences option is selected. The LSMeans Standardization Method may be selected from among the 17 choices in the drop-down menu.
Select STD as the LSMeans Standardization Method.
7 Microarray Case Study I: The Drosophila Aging Experiment 172
The completed LSMeans tab of the dialog should appear like the one shown in Figure 7.70.
Figure 70: The completed LSMeans tab of the ANOVA dialog
Click the Multiple Testing tab.
On the Multiple Testing tab, you can define the –log10(p-value) cutoff value and choose one of nine multiple-testing correction methods.
Make sure that Bonferroni is selected from the drop-down menu as the Multiple Testing Method.
Make sure that the Alpha value is set to the default value of 0.05.
The completed Multiple Testing tab of the dialog should appear like the one shown in Figure 7.71.
Figure 7.71: The completed Multiple Testing tab of the ANOVA dialog
Click on the Residuals tab.
On the Residuals tab, you can define several parameters describing how to handle the residuals from the ANOVA model fits. Residuals are statistics useful for quality control and assessment of goodness-of-fit. Selecting a Filtration Method for Data with Large Residuals allows you to set up rules to filter outliers which are statistically far from fitting the model (Chu, Weir et al. 2002).
Make sure the Plot Standardized Residuals checkbox is checked.
The completed Residuals tab of the dialog should appear like the one shown in Figure 7.72.
7 Microarray Case Study I: The Drosophila Aging Experiment 173
Figure 7. 72: The completed Residuals tab of the ANOVA dialog
Click the Options tab.
On the Options tab, you can specify additional output options including Uniformly Scale Y-Axis in Volcano Plots or even Activate Spoken Description.
Make sure the options are all deselected.
Output model file names may be specified. Otherwise, the program assigns output file names for you based on the name of the Input Data Set.
Do not specify names for the output files.
The completed Options tab of the dialog should appear like the one shown in Figure 7.73.
Figure 7.73: The completed Options tab of the ANOVA dialog
Click Run to run the ANOVA.
Running the ANOVA process produces several windows including Volcano Plots, Parallel and PCA Plots, Clustering, Variability Estimates, Action Buttons, and Significant Differences table. The Variability Estimates window, shown in Figure 7.74, displays estimates of sources of variability for each gene and provides a final QC check to ascertain how well the models fit the data.
7 Microarray Case Study I: The Drosophila Aging Experiment 174
Figure 7.74: The estimates of variability
Here, the PropVar_Array distribution quantifies the proportion of array-to-array (equivalent to spot-to-spot) variability, and the PropVar_Residual distribution quantifies the proportion of within-spot and unexplained variability. The RSquared distribution displays the proportion of variability explained by the model for each gene. The Volcano Plots window, shown in Figure 7.75, indicates the genes that show significant differential expression.
7 Microarray Case Study I: The Drosophila Aging Experiment 175
Figure 7.75: The Volcano plots
The genes above the red dotted line exceed a multiple testing cutoff for significant differential expression. The red dotted line is computed in this case according to the Bonferroni criterion. The Hierarchical Clustering of LSmeans display, shown in Figure 7.76, clusters those genes that are significantly differentially expressed. Zooming in on clusters and comparing them with known biological groups in the annotation table can help interpret the results.
7 Microarray Case Study I: The Drosophila Aging Experiment 176
Figure 7.76: Hierarchical clustering of LSmeans
The Action Buttons window, shown in Figure 7.77, offers easy access to the multiple biological interpretation websites to search for the selected genes. It also provides an opportunity to change the cutoff value without resetting the model.
Figure 7.77: The Action Buttons window
Mixed Model Analysis While the aforementioned ANOVA process is fairly flexible, even more complex mixed models are available by using the Mixed Model Analysis process. For this process you must be familiar with SAS Proc Mixed syntax. Refer to the SAS 9.1.3 User’s Guide for additional information.
7 Microarray Case Study I: The Drosophila Aging Experiment 177
Select Genomics > Row-by-Row Modeling > Mixed Model Analysis, as shown in Figure 7.78.
Figure 7.78: Selecting the Mixed Model Analysis process
The Mixed Model Analysis dialog opens, as shown in Figure 7.79.
7 Microarray Case Study I: The Drosophila Aging Experiment 178
Figure 7.79: The Mixed Model Analysis dialog
Make sure that the General tab is selected.
To choose the Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the variables from the input data set appears in the Available Variables field. In this example Spot served as the gene identifier, and is selected as the By Variable to generate a separate model fit for each gene. Typically, the By Variable should be a specific identifier. If the annotation file is included on the annotation tab, a gene identifier must be listed in the to Keep in Output or By which to Merge Annotation Data field to link the input file and annotation file. In this example, choose Spot as this Link Key. The data set contains no chromosome or position information.
Select Spot from the list of available variables.
Click to add Spot to the By Variables box.
Select Spot from the list of available variables.
Click to add Spot to the Variable to Keep in Output or By which to Merge Annotation Data box.
7 Microarray Case Study I: The Drosophila Aging Experiment 179
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The completed General tab of the dialog should appear like the one shown in Figure 7.80.
Figure 7.80: The completed General tab of the Mixed Model Analysis dialog
Click Annotation 1.
To choose the annotation data set for this experiment, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Scanalyze Drosophila.
Select the drosophilaaging_annotation.sas7bdat file and click Open.
A list of the ColumnLabels from the annotation data set appears in the Available Variables field.
Select Spot from the list of available variables.
Click to add Spot to the Annotation Merge Variables box, as shown in Figure 7.68.
Select ShortDescription from the list of available variables.
Click to add ShortDescription to the Annotation Label Variable box, as shown in Figure 7.68.
Click Annotation 2.
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession Variable box, as shown in Figure 7.68.
7 Microarray Case Study I: The Drosophila Aging Experiment 180
Select Drosophila melanogaster from the Organism drop-down menu.
The completed Annotation tabs of the dialog should appear like those shown in Figure 7.81.
Figure 7.81: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the Mixed Model
dialog
Click the Model tab. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. To ensure that all of the data from any one row is used in the model,
Leave the Design-Level by Variables field blank.
7 Microarray Case Study I: The Drosophila Aging Experiment 181
The Proc Mixed Statements box contains the primary SAS code. The SAS syntax needed to run the model is illustrated in Figure 7.82.
Figure 7.82: The Proc Mixed Statements box and associated SAS syntax
The SAS code can be divided into a number of distinct statements.
o The CLASS statement specifies all variables whose levels form distinct categories in the model.
o The MODEL statement specifies the dependent variable (always set this to RESPONSE) and the fixed effects.
o The RANDOM statement specifies Array as a random effect, which models spot-to-spot
(whole plot) variability for this example.
o The LSMEANS statement requests means for the full three-way interaction. Although not shown here, you can specify the DIFF option in the LSMEANS statement to automatically obtain a set of pairwise differences.
o The ESTIMATE statements specify custom hypothesis tests. Each ESTIMATE statement
generates one volcano plot.
Refer to the SAS/Stat Proc Mixed documentation in the SAS 9.1.3 User’s Guide for further information on these and other statements you can use.
Specify the same parameters in the Multiple Testing, Residuals, and Options tabs as previously done for the ANOVA process, and illustrated in Figures 7.70, 7.71, and 7.72, respectively.
Click Run to run the Mixed Model process.
The output from the Mixed Model Analysis process (not shown) contains the same displays as the ANOVA process.
Further Analysis JMP Genomics provides additional procedures for analyzing microarray data. However, while the preceding analyses have all utilized a tall form of the Drosophila data along with an accompanying Experimental Design Data Set, these additional processes require that data set be in the wide form. We begin by showing you how to
7 Microarray Case Study I: The Drosophila Aging Experiment 182
combine these two data sets into one wide data set that can be used for other processes like those found in the Pattern Discovery and Predictive Modeling folders. Refer to Chapter 4 for additional information Note: SAS data sets are tuned to work best with a large number of rows rather than a large number of columns. For wide data sets with tens of thousands of columns, execution times may be long. One way to address this issue is to work with the tall data set whenever possible. When use of tall data sets is not possible, reducing the number of genes under consideration can help reduce the execution times. For example, use Data Set Utilities > Statistics for Rows to filter genes that have low overall variance; that is, genes that have a flat profile across the whole experiment. For a more rigorous statistical criterion, you can use either the ANOVA or the Mixed Model Analysis process to select only those genes that have a significant difference somewhere in the experiment. However, keep in mind that such a pre-filtering will bias cross-validation rates computed from any of the Predictive Modeling processes. Alternatively, you can use K-Means Clustering on the tall data set to select a representative set of genes that does not depend directly on experimental design variables.
Transpose Tall and Wide Recall from Chapter 4 that a tall data set and its accompanying EDDS can be transformed into a wide data set using the Transpose Tall and Wide command.
Select Genomics > Data Set Utilities > Transpose Tall and Wide, as shown in Figure 7.83.
Figure 7.83: Selecting the Transpose Tall and Wide process
The Transpose Tall and Wide dialog opens, as shown in Figure 7.84.
7 Microarray Case Study I: The Drosophila Aging Experiment 183
Figure 7.84: The Transpose Tall and Wide dialog
Make sure the Tall -> Wide tab is selected.
To choose the Loess normalized data set as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. In this example, the values contained in the Spot column serve as the column names in the wide data set.
Select Spot from the list of available variables.
Click to add Spot to the Variables Defining Wide Column Names box.
To enter a prefix for the wide column names,
Type Spot_ into the Prefix for Wide Column Names box. To choose the EDDS, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_exp.sas7bdat file and click Open.
7 Microarray Case Study I: The Drosophila Aging Experiment 184
To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click on Select to select this folder.
The completed Transpose Tall and Wide dialog should appear like the one shown in Figure 7.85.
Figure 7.85: The completed Transpose Tall and Wide dialog
Click Run to transpose the data set.
The transposed SAS data set, drosophilaaging_loess1_loess2_wide.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.86).
Figure 7.86: The SAS Message dialog
Open the data set and note how the experimental design data has been combined with the expression data, with individual genes now forming columns in the wide data set. This data set is now ready to serve as input for further analyses.
7 Microarray Case Study I: The Drosophila Aging Experiment 185
K-Means Clustering K-Means clustering is a standard technique for partitioning data into a set number of similar groups. The K-Means Clustering process clusters the rows of the input data set, so depending on whether you want to cluster samples or genes, you might need to transpose your data as shown previously. This example clusters the genes, in the normalized, wide data set just transposed, that have similar expression profiles.
Select Genomics >Pattern Discovery > K-Means Clustering, as shown in Figure 7.87.
Figure 7.87: Selecting the K-Means Clustering process
The K-Means Clustering dialog opens, as shown in Figure 7.88.
7 Microarray Case Study I: The Drosophila Aging Experiment 186
Figure 7.88: The K-Means Clustering dialog
Make sure the General tab is selected.
To choose the wide Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_2_wide.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. To choose the variable by which to label points in the plots, complete the following steps.
Select Array from the list of available variables.
Click to add Array to the Label Variable box.
To choose the variables whose observations are to be clustered (in this example, expression data), complete the following steps.
Select all of the numeric variables (Spot_4 through Spot_297) from the list of available variables.
Click to add these variables to the Variables Whose Rows are to be Clustered box.
Alternatively, you could specify variables Spot_4 through Spot_297 by typing Spot_ in the List-Style Specification of Variables Whose Rows are to be Clustered field.
7 Microarray Case Study I: The Drosophila Aging Experiment 187
Type 5 in the Number of Clusters box to cluster the genes into 5 groups.
To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The completed dialog should appear like the one shown in Figure 7.89.
Figure 7.89: The completed K-Means Clustering dialog
Click Run to generate two JMP tables and two graphics.
The drosophilaaging_loess1_2_wide_kmc table (partially shown in Figure 7.90) lists various statistics about the clusters that were generated.
7 Microarray Case Study I: The Drosophila Aging Experiment 188
Figure 7.90: The drosophilaaging_loess1_2_wide_kmc table
The drosophilaaging_loess_1_2_wide_kmd table (shown in Figure 7.91) shows various statistics about the arrays. Note the Cluster and Distance to Cluster Seed columns on the far right hand side of the table denoting which cluster each array fit in, and what the distance was to the cluster seed. The cluster seed is the mean of the cluster.
Figure 7.91: The drosophilaaging_loess_1_2_wide_kmd table
The parallel plots show the cluster profiles across all 100 genes.
Right-click on a parallel plot to change the color scheme or make a legend for a particular variable.
Right-click the first parallel plot and select Row Legend to bring up the list of variables.
Choose the Array variable to produce the plots and legends shown in Figure 7.92.
7 Microarray Case Study I: The Drosophila Aging Experiment 189
Figure 7.92: Parallel plots for the clusters
Hovering the cursor over a peak gives the array that generated it. (Additional label variables can be specified in the JMP table to add more data to the mouse-over pop-up boxes.) Note that the sharp dip in cluster 4 belongs to array 6, which showed up as an outlier in the principal components analysis described previously (see Figure 7.49). Use Rows > Color or Mark by Column to color the profiles according to known variables. Try coloring the profiles by Sex to see if the five clusters segregate according to this variable. In another window, the graphic (shown in Figure 7.93) that displays the frequency cluster shows four histograms instead of five. That is because clusters 2 and 4 have similar frequencies.
7 Microarray Case Study I: The Drosophila Aging Experiment 190
Figure 7.93: Cluster frequencies
Because the graphic is linked to the underlying cluster table, highlighting the tallest bar also highlights the two clusters it represents, making this relationship relatively clear. Distance Matrix The Distance Matrix process computes various measures of distance or dissimilarity between the observations/rows of a data set.
Select Genomics > Pattern Discovery > Distance Matrix. The Distance Matrix dialog opens, as shown in Figure 7.94.
7 Microarray Case Study I: The Drosophila Aging Experiment 191
Figure 7.94: The Distance Matrix dialog
Make sure the General tab is selected.
To choose the wide Loess normalized data set, generated previously, as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
Select the drosophilaaging_loess1_2_wide.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. There are different categories of variables.
o Variables within which to compute differences o ID variables serve to identify rows in the data set and are not a formal part of the clustering
process.
7 Microarray Case Study I: The Drosophila Aging Experiment 192
o Copy variables are simply copied to the output data set.
o By variables instruct the process to perform clustering separately for each distinct
combination of the By variable levels.
Refer to the DISTANCE Procedure documentation in the SAS 9.1.3 User’s Guide for additional information. To choose the Variables within which to compute differences (in this example, expression data), complete the following steps.
Select all of the numeric variables (Spot_4 through Spot_297) from the list of available variables.
Click to add these variables to the Variables Within Which to Compute Differences field.
Alternatively, you could specify variables Spot_4 through Spot_297 by typing Spot_ in the List-Style Specification of Variables Within Which to Compute Differences field. To choose names for the distance variables, complete the following steps.
Select array, sex, age, and line from the list of available variables.
Click to add these variables to the ID Variable box.
To specify the level of measurement used to compute the distance,
Select Interval from the drop-down menu. To specify the method to be used to compute the distance,
Select DSQCORR from the drop-down menu. To specify the Output Folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click on Select to select this folder.
The completed General tab of the Distance Matrix dialog should appear like the one shown in Figure 7.95.
7 Microarray Case Study I: The Drosophila Aging Experiment 193
Figure 7.95: The completed General tab of the Distance Matrix dialog.
Click the Options tab.
To specify the method for standardization,
Select STD from the drop-down menu.
The completed Options tab of the Distance Matrix dialog should appear like the one shown in Figure 7.96.
Figure 7.96: The completed Options tab
Click Run to generate the distance matrix.
A new data set (drosophilaaging_loess1_2_dm.sas7bdat) and a heat map are generated. The heat map is shown in Figure 7.97.
7 Microarray Case Study I: The Drosophila Aging Experiment 194
Figure 7.97: Heat map
This heat map shows that channels from the same array are closest in terms of the DSQCORR metric, and form the 2 × 2 blocks in the heat map.
Predictive Modeling In conjunction with, or as an alternative to, the Row-by-Row Modeling and Pattern Discovery processes described previously, you might want to perform exploratory predictive modeling and data mining. See Chapter 10 for a description of relevant processes. Note that Predictive Modeling processes require the data to be in wide format, as created previously with the Transpose Tall and Wide process.
Microarray Case Study II: Affymetrix Latin Square Data
8 C H A P T E R
In Chapter 7, we considered an experiment conducted with cDNA microarrays. Here we consider oligonucleotide arrays. Oligonucleotide arrays differ from cDNA arrays since the sequence for each oligonucleotide is shorter and is usually determined a priori using bioinformatics techniques. Oligonucleotides are typically mass-produced and are often used to study model systems. This chapter uses the Affymetrix Latin Square data set described in Chapter 1. Recall that this data set was originally generated by Affymetrix Inc. to develop and validate their U95A GeneChip and Microarray Suite (MAS) 5.0 algorithm over a range of known concentrations. The experiment consists of 14 experimental groups. Each group contains a pool of non-specific RNA as well as a set of 14 distinct human transcripts spiked in at known concentrations. The concentrations are staggered in a Latin Square arrangement. The data have been trimmed to only 100 genes and trimmed versions of .CEL files containing just these 100 genes are available in the JMP Genomics Sample Data folder.
Generation of the Required SAS Data Set and EDDS The SAS data set and EDDS required for the analyses presented here were generated from the raw .CEL files and an Experimental Design File, as discussed in Chapter 3. If you have not already generated these files, review the instructions for this example and generate the SAS data set and EDDS now. Make sure the output files are saved in the ProcessResults folder. The output consists of three SAS data sets:
o the affyinputengine.sas7bdat input data set, o the affyinputengine_exp.sas7bdat experimental design data set (EDDS), and o the probemap_hg_u95a_trim.sas7bdat data set listing the physical x and y array coordinates of
each spot. Note: Importing standard .CEL files generates a fourth output data set, containing the quality control (QC) probe sets. These QC probe sets, normally contained in Affymetrix data sets, are not included in the custom trimmed .CEL files in this example.
Assessing the Quality of the Data With a new data set, it is advisable to perform quality control analyses before proceeding to other analyses.
Data Standardization and Distribution Analysis In this example, as in the Drosophila example, we use a univariate distribution analysis to initially assess the quality of the data.
Click Genomics> Quality Control > Distribution Analysis, as shown in Figure 8.1.
8 Microarray Case Study II: Affymetrix Latin Square Data 196
Figure 8.1: Selecting the Distribution Analysis process
The Data Distribution dialog opens, as shown in Figure 8.2.
Figure 8.2: The Data Distribution dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
8 Microarray Case Study II: Affymetrix Latin Square Data 197
Select the affyinputengine.sas7bdat file.
Click Open to select the file.
Select all of the available variables from a_01 to q_59.
Click to add the selected variables to the Variables for which to Display Distributions box, as shown in Figure 8.3.
Figure 8.3: Selecting the variables for distribution view
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The General tab of the Data Distribution dialog appears like the one shown in Figure 8.4.
Figure 8.4: The completed General tab of the Data Distribution dialog
Click Experimental Design.
Click Choose to select the EDDS.
Navigate to ProcessResults.
Click the affyinputengine_exp.sas7bdat file.
Click Open to select the file.
Select Experiment and ColumnName as the color variable and label variable. respectively, as
shown in Figure 8.5.
8 Microarray Case Study II: Affymetrix Latin Square Data 198
Figure 8.5: Selecting the Color and Label variables
The Experimental Design tab of the Data Distribution dialog appears like the one shown in Figure 8.5.
Figure 8.6: Selecting the Label Variable
The Option tab shows display options for the results.
Click the Options tab to view the default settings.
Do not make any changes to the Options tab.
Click Run to generate the distributions.
Running this process produces the overlay plot of kernel density estimates shown in Figure 8.7.
Figure 8.7: The overlayed kernel density plot for the affyinputengine.sas7bdat data set
The distributions for these arrays are very similar, indicating that this is a high quality data set. Correlation and Principal Components
8 Microarray Case Study II: Affymetrix Latin Square Data 199
Now, examine the quality of the data using several variables. Run the Correlation and Principal Components process.
Select Genomics > Quality Control > Correlation and Principal Components, as shown in Figure 8.8.
Figure 8.8: Selecting the Correlation and Principal Components process
The Correlation and Principal Components dialog opens, as shown in Figure 8.9.
8 Microarray Case Study II: Affymetrix Latin Square Data 200
Figure 8.9: The Correlation and Principal Components dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
Click the affyinputengine.sas7bdat file.
Click Open to select the file.
Select all of the available variables from a_01 to q_59 to compute their correlations.
Click to add the selected variables to the Variables box.
Click Choose to select the EDDS.
Navigate to ProcessResults.
Click the affyinputengine_exp.sas7bdat file.
Click Open to select the file.
Select Experiment as the color variable.
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
8 Microarray Case Study II: Affymetrix Latin Square Data 201
Open the ProcessResults folder and click Select to select this folder.
The completed General tab of the Data Distribution dialog appears like the one shown in Figure 8.10.
Figure 8.10: The completed General tab of the Data Distribution dialog
The Analysis tab allows you to transform the data prior to analysis and to specify the type of correlation and number of principal components.
Click Analysis to view the default settings.
Do not make any changes to the Variance Components tab.
The Variance Components tab allows you to compute a variance components decomposition of the principal components, partitioning variability in terms of known effects.
Click Variance Components to view the default settings.
Do not make any changes to the Variance Components tab.
The Option tab contains display preferences for the results.
Click the Options tab to view the default settings.
Do not make any changes to the Options tab.
Click Run.
Running this process produces the correlation heat map shown in Figure 8.11.
8 Microarray Case Study II: Affymetrix Latin Square Data 202
Figure 8.11: The correlation heat map
The clustered heat map shown in Figure 8.11 displays the correlation matrix of the 59 samples. The samples cluster tightly according to their spike in profiles and generate a very distinct pattern of correlation. This plot and its dendrogram are linked to a principal components plot (not shown).
8 Microarray Case Study II: Affymetrix Latin Square Data 203
Correlation and Grouped Scatterplots The Correlation and Grouped Scatterplots process is a related multivariate quality control that annotates and computes correlations and scatterplot matrices for expression measurements across groups of arrays.
Select Genomics >Quality Control > Correlation and Grouped Scatterplots. The Correlation and Grouped Scatterplots dialog opens, as shown in Figure 8.12.
Figure 8.12: The Correlation and Grouped Scatterplots dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
Click the affyinputengine.sas7bdat file.
Click Open to select the file.
Select Unit from the list of available variables.
Click to add Unit to the Variables By Which to Merge Annotation Data box.
Click Choose to select the EDDS.
Navigate to ProcessResults.
Select the affyinputengine_exp.sas7bdat file.
8 Microarray Case Study II: Affymetrix Latin Square Data 204
Click Open to select the file.
Select Experiment from the list of available variables.
Click to add Experiment to the Variables Defining Groups box.
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The completed General tab of the Correlation and Grouped Scatterplots dialog appears like the one shown in Figure 8.13.
Figure 8.13: The completed General tab of the Correlation and Grouped Scatterplots dialog
The Annotation tabs allow you to merge information regarding individual genes and experimental groups into your output.
Click Annotation 1.
Click Choose to select the annotation data set.
Navigate to the Sample Data\Microarray\Affymetrix Latin Square folder.
Select the u95a.sas7bdat file.
8 Microarray Case Study II: Affymetrix Latin Square Data 205
Click Open to select the file.
Select Unit from the list of available variables.
Click to add Unit to the Annotation Merge Variables box.
Select Description from the list of available variables.
Click to add Description to the Annotation Label Variable box.
Click Annotation 2.
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession Variable box.
Select Gene Symbol from the list of available variables.
Click to add Gene Symbol to the Gene Symbol Variable box.
Select Description from the list of available variables.
Click to add Description to the Gene Description Variable box.
Select LocusLink from the list of available variables.
Click to add LocusLink to the Gene or LocusLink ID Variable box.
Select Homo sapiens from the Organism drop-down menu.
The completed Annotation tabs of the Correlation and Grouped Scatterplots dialog appear as shown in Figure 8.14.
8 Microarray Case Study II: Affymetrix Latin Square Data 206
Figure 8.14: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the Correlation
and Grouped Scatterplots dialog
The Option tab contains display options for the results.
Click the Options tab to view the default settings.
Do not make any changes to the Options tab.
Click Run.
Running this process produces the correlation scatterplots shown in Figure 8.15.
8 Microarray Case Study II: Affymetrix Latin Square Data 207
Figure 8.15: The correlations and scatterplot matrices for the AffymetrixLatinSquare input data
example
There is a separate scatterplot matrix for each of the 14 experimental groups. Note the cigar-shaped distribution along the 45-degree diagonal and the very high correlations (shown above the scatterplot matrices). These results indicate very high repeatability within sample groups. There are a few outlying probes that appear far from the main diagonal. These represent probes whose measurements were inconsistent across the arrays, and should be handled carefully. Mouse over them to see their label, and then drag a rectangle around them to select them in the associated JMP table. With a considerable number of outliers, go back and check the raw image files for abnormalities at those spots. This can help decide whether to keep them or delete them from the analysis. To filter a set of spots, select them in the corresponding JMP table using one of the tools available in JMP and delete them. To select rows, choose from one of these options:
• click and drag a rectangle around the spots in one of the scatterplot matrix windows • use the lasso tool • hold the Shift key and click spots one by one • click Rows > Row Selection > Select Where to define a filtering rule
Refer to the JMP User Guide for more details on selecting rows. With the rows selected, complete the following steps.
Click Rows > Row Selection > Invert Row Selection to invert the selection to include only the rows that should be kept.
Click Tables > Subset to create a subset table with the desired rows.
8 Microarray Case Study II: Affymetrix Latin Square Data 208
Click File > Save As SAS Data Set to save the subset table as a .sas7bdat file.
This new subset data set can now be used as input for further analyses. Note: The ANOVA and Mixed Model Analysis processes described later also provide a means to automatically filter outliers based on the magnitude of discrepancy from a fitted statistical model. Feature Flagger This quality control process flags specific probe-level observations that have unusually low signals, as compared to a specified group median.
Select Genomics > Quality Control > Feature Flagger. The Feature Flagger dialog opens, as shown in Figure 8.16.
Figure 8.16: The Feature Flagger dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
Select the affyinputengine.sas7bdat file.
Click Open to select the file.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the Feature Variable box.
Select Probe from the list of available variables.
8 Microarray Case Study II: Affymetrix Latin Square Data 209
Click to add Probe to the Sub-Feature Variable box.
Click Choose to select the EDDS.
Navigate to ProcessResults.
Click the affyinputengine_exp.sas7bdat file.
Click Open to select the file.
Select Array from the list of available variables.
Click to add Experiment to the Design-Level Grouping Variables box.
The Threshold is specified as 5 by default. Observations, whose intensities differ from the median intensity by more than this value, are flagged in the output.
Do not change the threshold value. To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click Select to select this folder.
The completed Input tab of the Feature Flagger dialog appears like the one shown in Figure 8.17.
8 Microarray Case Study II: Affymetrix Latin Square Data 210
Figure 8.17: The completed Input tab of the Feature Flagger dialog
The Options tab allows you to specify the various types of output from this process.
Make no changes to the Options tab. Click Run to generate the table shown in Figure 8.18.
Figure 8.18: The Flagged Features table
The probes highlighted in red have unusually low signals.
8 Microarray Case Study II: Affymetrix Latin Square Data 211
Array PseudoImage In the event that original images of the arrays are not available, JMP Genomics can generate a pseudo-color representation of the data on a given array. Note that because the trimmed data set used previously to illustrate the processes discussed in this chapter does not have complete information for all of the probes, the image generated with this data set may not accurately reflect the real image of the array. In this example, therefore, we use the default example to generate a pseudo-image of array f_45 from the Affymetrix Latin Square Example data set.
Select Genomics > Quality Control > Pseudo Image.
The Array Pseudo Image dialog opens, as shown in Figure 8.19.
Figure 8.19: The Array PseudoImage dialog
To load the default AffymetrixLatinSquareExample, complete the following steps.
Click Load.
Select the settings for the AffymetrixLatinSquareExample.
Click OK to complete the Array Pseudo Image dialog, as shown in Figure 8.20.
8 Microarray Case Study II: Affymetrix Latin Square Data 212
Figure 8.20: The completed Input tab of the Array PseudoImage dialog
Click Run.
Select 45 in the Array Data Library dialog to generate the pseudoimage shown in Figure 8.21.
8 Microarray Case Study II: Affymetrix Latin Square Data 213
Figure 8.21: The pseudoimage of array f_45
A data set listing probes, x- and y-coordinates and response for each spot is generated in addition to the pseudo image. Highlighting the appropriate gene in the JMP table also highlights it in the pseudo image, and vice-versa, thus providing another potential way to filter data. Surface Summary Another technique that can be helpful for quality control and normalization constructs a spatially smoothed surface plot of the background intensity of a chip. This process plots the surface data in three dimensions. Anomalies in the area surface might indicate areas of poor quality due to technical issues. Note that because the trimmed data set used previously to illustrate the processes discussed in this chapter does not have complete information for all the probes, the image generated with this data set might not accurately reflect the real surface of the array. In this example, therefore, we use the default example to generate surface summaries of array f_45 and m_55 from the Affymetrix Latin Square Example data set used previously. To generate surface plots of the Affymetrix sample data, complete the following steps.
Select Genomics > Quality Control > Surface Summary. The Surface Summary dialog opens, as shown in Figure 8.22.
8 Microarray Case Study II: Affymetrix Latin Square Data 214
Figure 8.22: The Surface Summary dialog
To load the default Affymetrix Latin Square Example, complete the following steps.
Click Load.
Select the settings for the AffymetrixLatinSquareExample.
Click OK to complete the Surface Summary dialog, as shown in Figure 8.23.
8 Microarray Case Study II: Affymetrix Latin Square Data 215
Figure 8.23: The completed General tab of the Surface Summary dialog
The Analysis tab allows specification of the following parameters:
o the number of blocks in the surface plot, o the range of acceptable z-values, o the summary statistic calculated for z in each x-y block, o the origin, and o any subsetting of the data.
Click the Analysis tab.
Examine the default settings. Default specifications include a 32 by 32 grid, no minimal/maximal z-values, the Min summary statistic, a bandwidth multiplier of 1 for moderate smoothing, and the top left corner designated as the origin.
Make no changes to the Analysis tab. The Options tab allows you to specify the various types of output from this process.
Make no changes to the Options tab.
Click Run to generate the surface plots for arrays f_45 and m_55 of the Affymetrix Latin Square Example, as shown in Figure 8.24.
8 Microarray Case Study II: Affymetrix Latin Square Data 216
Figure 8.24: The surface plot for array f_45 (left) and m_55 (right)
Note that the background surface of chip f_45 appears fairly smooth whereas that for chip m_55 has a region of unusually high background signal.
Data Normalization Since the data set contains high quality data, prepare it for analysis. To do this, use the Data Standardization process to normalize the affyinputengine.sas7bdat data set used previously.
Select Genomics > Normalization > Data Standardize.
The Data Standardize dialog opens, as shown in Figure 8.25.
8 Microarray Case Study II: Affymetrix Latin Square Data 217
Figure 8.25: The Data Standardize dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
Select the affyinputengine.sas7bdat file.
Click Open to select the file.
Click Choose to select the EDDS.
Navigate to ProcessResults.
Select the affyinputengine_exp.sas7bdat file.
Click Open to select the file.
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click on Select to select this folder.
The completed Data Standardize dialog appears as shown in Figure 8.26.
8 Microarray Case Study II: Affymetrix Latin Square Data 218
Figure 8.26: The completed Data Standardize dialog
Click Run to standardize the data.
The standardized SAS data set, affyinputengine_std.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 8.27).
Figure 8.27: The SAS Message dialog
The Data Distribution process was run for the normalized affyinputengine_std.sas7bdat data set. Examination of the resulting overlayed kernel distribution (not shown) indicates that the sample distributions of the normalized data set are even more consistent than those seen in Figure 8.7. The normalized data set is therefore used for subsequent analysis.
Pattern Discovery Once you have performed quality control and normalization on your data, you might want to run pattern discovery processes on the data to understand them better. Chapter 7 provides examples of these processes. For this case study, we move directly to statistical modeling of the probe-level data.
Analysis of Variance (ANOVA)
8 Microarray Case Study II: Affymetrix Latin Square Data 219
The ANOVA process fits a linear model to each probe set in a normalized data set.
Select Genomics > Row-by-Row Modeling > ANOVA. The ANOVA dialog opens, as shown in Figure 8.28.
Figure 8.28: The ANOVA dialog
Click Choose to select the input data set.
Navigate to ProcessResults.
Select the normalized affyinputengine_std.sas7bdat file.
Click Open to select the file.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the By Variables box.
Select Unit from the list of available variables.
Click to add Probe_Set_ID to the Variables to Keep in Output or By Which to Merge Annotation Data box.
There is no chromosome or position data in the data set.
8 Microarray Case Study II: Affymetrix Latin Square Data 220
Leave both the Chromosome Variable and Position Variables fields blank. Select Probe from the list of available variables.
Click to add Probe to the Class Variables box.
To specify the output folder, complete the following steps.
Click Choose.
Navigate to the ProcessResults folder.
Open the ProcessResults folder and click on Select to select this folder.
The completed General tab of the ANOVA dialog appears like the one shown in Figure 8.29.
Figure 8.29: The completed General tab of the ANOVA dialog
Click Annotation 1.
Click Choose to select the annotation data set.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Select the u95a_trim.sas7bdat file.
Click Open to select the file.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the Annotation Merge Variables box.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the Annotation Label Variable box.
Click Annotation 2.
8 Microarray Case Study II: Affymetrix Latin Square Data 221
Select Sequence_Derived_From from the list of available variables.
Click to add Sequence_Derived_From to the GenBank Accession Variable box.
Select Gene_Symbol from the list of available variables.
Click to add Gene_Symbol to the Gene Symbol Variable box.
Select Title from the list of available variables.
Click to add Title to the Gene Description Variable box.
Select LocusLink from the list of available variables.
Click to add LocusLink to the Gene or LocusLink ID Variable box.
Select Homo sapiens from the Organism drop-down menu.
The completed Annotation tabs of the ANOVA dialog appears as shown in Figure 8.30.
Figure 8.30: The completed Annotation 1 tab (top) and Annotation 2 tab (bottom) of the ANOVA
dialog
8 Microarray Case Study II: Affymetrix Latin Square Data 222
The Model tab allows you to specify different variables and effects, taken from your experimental design that may affect your model. It is important to appropriately specify class variables, fixed effects and random effects.
Click the Model tab. To specify the EDDS, complete the following steps.
Click Choose.
Navigate to ProcessResults.
Click the affyinputengine_exp.sas7bdat file.
Click Open to select the file. Class variables are those whose levels form distinct categories in the model. (They are distinguished from continuous variables whose numeric values are used directly in the model.) Here both Experiment and Array are class variables.
Select both Array and Experiment from the list of available variables.
Click to add Array and Experiment to the Class Variables box. Fixed effects contain a specific set of levels that are of sole interest for comparison. They are typically the primary variables of interest in the design.
Type Experiment and Probe in the Fixed Effects box. LSMeans effects are used to construct differences and least-squares means profiles.
Type Experiment in the LSMeans Effects box. Random effects model correlation patterns in the data and are assumed to arise randomly from a population of observable effects. Those observations in the data which share the same level of a random effect are assumed to be correlated. Here, Array is specified as a random effect to model the correlation between probe-level data from the same array (and also from the same probe set, since Probe_Set_ID is specified as the By Variable on the first tab).
Type Array in the Random Effects box.
The completed Model tab of the ANOVA dialog appears as shown in Figure 8.31.
8 Microarray Case Study II: Affymetrix Latin Square Data 223
Figure 8.31: The completed Model tab of the ANOVA dialog
The LS Means tab allows you to specify which LS Means difference set to use for volcano plots and how those means are to be standardized.
Click the LS Means tab. By default, all pair wise LSMeans differences are selected. In addition, STD is chosen as the default LSMeans test.
Do not make any changes to the LSMeans tab. The Multiple Testing tab allows you to run multiple hypothesis tests across all LSMeans differences to identify a cutoff for determining significant expression differences.
Click the Multiple Testing tab. The default test is the Bonferroni test. In this example, instead of running multiple hypothesis tests, we define this cutoff value directly using the –log10(p-value) cutoff parameter. To change the default setting, complete the steps.
Select the blank space in the middle of the Multiple Testing Method drop-down menu.
Type 15 in the –log10(p-value) Cutoff text box. The Residuals tab allows you to define several parameters describing how to handle the residuals from the ANOVA model fits. Residuals are statistics useful for quality control and assessment of goodness-of-fit. Selecting a Filtration Method for Data with Large Residuals allows you to set up rules to filter outliers which are statistically far from fitting the model (Chu, Weir et al. 2002).
Do not make any changes to the Residuals tab. On the Options tab, you can choose different preferences for the output of this procedure. The only change you should make to the Output tab is to select a name for the Mixed Model Expression Index Output Data Set.
8 Microarray Case Study II: Affymetrix Latin Square Data 224
Type affylatin_mmei in the Mixed Model Expression Index Output Data Set Name field. The completed Options tab of the ANOVA dialog appears like the one shown in Figure 8.32.
Figure 8.32: The completed Options tab of the ANOVA dialog
Running the ANOVA process produces various graphical displays of statistical results. These graphics are all driven by JMP tables, one of which lists the significant genes and is illustrated in Figure 8.33.
Figure 33: A portion of the table listing differentially-expressed genes
Note that there are 17 differentially expressed genes. Fourteen of these genes correspond to the transcripts that were experimentally spiked in as expected. Two sets of genes, probes #36202_at and #546_at, and probes #407 and #37777, respectively, correspond to the same spiked-in genes. Two genes, probe #33818_at and probe #1032_at, are unexpected and warrant further investigation. To highlight these genes, complete the following step.
Hold down the Ctrl key and click each of the selected genes, as shown in Figure 8.34.
8 Microarray Case Study II: Affymetrix Latin Square Data 225
Figure 8.34: Highlighting selected genes
Highlighting these genes in the data table allows us to visualize them in other windows as well, such as the Hierarchical Clustering window, shown in Figure 8.35.
Figure 8.35: Clustering window
Examination of the Hierarchical Clustering window reveals that the 33818_at gene, which encodes a valosin-containing protein, clusters with the interleukin receptor-like 40322_at gene. In addition, the 1032_at gene, which encodes the beta-subunit of the interleukin 8 receptor, clusters with the angiotensinogen proteinase inhibitor 684_at gene, as shown in Figure 8.36.
8 Microarray Case Study II: Affymetrix Latin Square Data 226
Figure 8.36: Clustering of differentially-expressed genes
With these genes highlighted, we can select the Action Buttons window (shown in Figure 8.37) and use the search options to further explore their relationships.
Figure 8.37: The Action Buttons window
Click Annotation Summary to open a Gene Summary HTML page (shown in Figure 8.38) providing specific links to information on each of the highlighted genes contained in various online databases.
Figure 8.38: The Gene Annotation Summary
From here, connect to public web pages for further analysis. It turns out that the spike in concentration of the interleukin 1 receptor-like gene (#40322_at) was 0.25pM. A mistake in the experimental setup caused the valosin-containing protein gene (probe set #33818), which was supposed to go into group 12, to be omitted. This gave it a concentration of 0pM, which would intuitively cluster together with a concentration of 0.25pM. The probe set #1032_at contains the motif: 5’GCAGCCGTTT3’. In addition to having specificity for the interleukin 8 receptor (beta) gene, this motif also hybridizes to a similar sequence contained in the K02215 gene (target of the 684_at probe set) (Hsieh, Chu et al. 2003). Therefore, it is not surprising that the genes specified by these two probe sets cluster together.
Predictive Modeling In conjunction with, or as an alternative to, row-by-row modeling as described previously, you might want to perform exploratory predictive modeling and/or data mining. See Chapter 10 for a description of relevant processes available through JMP Genomics.
Proteomics Spectral Preprocessing: The Prostate Cancer Example
9C H A P T E R
JMP Genomics offers analyses for spectrometry data, including those from mass spectrometers and nuclear magnetic resonance instrumentation. In this example, the data set was obtained by Surface-Enhanced Laser Desorption/Ionization (SELDI). This method allows an investigator to detect and resolve multiple proteins bound to protein chip arrays (Merchant and Weinberger 2000). This approach was used by Qu, et al. (2002) to discriminate prostate cancer from non prostate cancer patients. The promise of this approach is that a panel of multiple biomarkers can be used to distinguish important phenotypes such as cancer status. However, great care must be taken to pre-process and analyze the data appropriately to ensure generalizability of results.
The Prostate Cancer Example
The example data set consists of serum samples collected from 165 men. 84 of the men had prostate cancer. The remaining 81 men are considered to be controls. The primary goal is to determine differences in protein expression between these groups. To examine the primary data set, complete the following steps.
Select File > Open, as shown in Figure 9.1.
Figure 9.1: Opening the data set
The Open Data File window opens.
Navigate to Sample Data > Proteomics, as shown in Figure 9.2.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 228
Figure 9.2: Selecting the data set
Select the wright_tall_2k_10k.sas7bdat file.
Click Open to open the data set.
The wright_tall_2k_10k.sas7bdat data set, partially shown in Figure 9.3, opens.
Figure 9.3: The wright_tall_2k_10k.sas7bdat data set
The format of the primary dataset is in tall form, with mass-to-charge (or m/z) values, or as rows and individuals as columns. As with the microarray data, there is an accompanying experimental design file that provides characteristics of the columns. To examine the experimental design for this example, open the wright_design.sas7bdat file in JMP by completing the following steps.
Select File > Open.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 229
Navigate to Sample Data > Proteomics.
Select the wright_design.sas7bdat file.
Click Open to open the design file, as shown in Figure 9.4.
Figure 9.4: The EDF
Note that the format of this file conforms to the EDF specifications described in Chapter 3. The primary variable of interest is status, with values CCD (cancer) and NOR (normal). The Array variable provides a unique numerical indicator for each row, and ColumnName lists the names of the columns in the primary data set.
Preprocessing the Data JMP Genomics contains a few processes to assist in basic preprocessing of spectral datasets. Running these processes before rigorous statistical analyses typically increases the reliability of these analyses.
2-Dimensional Analysis A first step in analyzing this kind of dataset is to get a good view of the entire dataset. For two-dimensional spectral data like these SELDI data, this can be done using the 2D Plot process located under the Spectral Preprocessing menu.
Select Genomics > Spectral Preprocessing > 2D Plot, as shown in Figure 9.5.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 230
Figure 9.5: Selecting the 2D Plot process
The Spectral 2D Plot dialog opens, as shown in Figure 9.6.
Figure 9.6: The Spectral 2D Plot dialog
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 231
To load the prostate cancer example, complete the following steps.
Click Load, as shown in Figure 9.7.
Figure 9.7: Loading the default example
Select the ProstateCancerExample and click OK, as shown in Figure 9.8.
Figure 9.8: Selecting the default example
The completed dialog appears.
Figure 9.9: The completed Spectral 2D Plot dialog
This process plots the spectra and enables comparisons between two groups, designated A and B. This example compares all of the cancer patients versus all of the non-cancer patients. The variables with CCD in their name are assigned to the A group, and those with NOR in their name are assigned to the B group. The index variable is plotted on the x-axis of the overlay plots.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 232
Click Run to generate the overlay plots.
The overlay plots appear as shown in Figure 9.10. Note: Several additional results windows also open.
Figure 9.10: The overlay plots
This plot shows the mean values of the two groups of spectra plotted against each other (CCD is Group A, in red, and NOR is group B, in green). The black spectrum along the bottom, which is indexed on the right axis, displays negative log10 p-values from t-tests between the two groups, conducted separately for each m/z value and without any adjustment for multiple testing. The peaks in this plot represent m/z values exhibiting statistically significant differences between the two groups. The peaks in the black spectrum show places between the red and green groups where there is a significant difference. Use the magnifying tool to select a rectangular region of interest. This shows results in more detail and allows you to explore how and why the peaks were differentiated. This can also be useful to resolve doublet peaks. This analysis produces a rather large set of plots. It can be informative to consider smaller sets of variables. This can be done by removing variables from the Plot Variables Group boxes. Note: To shift or scale the axes, click on either the left or right vertical axes until a hand icon appears. Then drag shift or scale the axes. Double-click on an axis to change its properties. These adjustments can enhance your ability to discern differences between the spectral profiles, as can be seen in Figure 9.11.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 233
Figure 9.11: Portion of the overlay plot between m/z values of 3750 and 4050
The Overlay Plot by MZ graph (not shown) displays a similar graph of all the individual spectra. This can be useful if something in the Mean values plot of interest warrants further exploration. Since all the data are plotted on this graph, manipulating it is memory-intensive and some sluggishness may occur in performance. The Cell Plot graph (not shown) displays all the spectra in a gray scale heat map. All of the plots are driven by a single underlying table, wright_tall_2k_10k_s2g. Scrolling to the extreme right side of this table (shown in Figure 9.12), shows various computed statistics. The last column in the table is the NegLog10 PValue column.
Figure 9.12: Te wright_tall_2k_10k_s2g table
Click on the column label to select the NegLog10 PValue column.
Select Tables > Subset in the JMP menu, as shown in Figure 9.13.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 234
Figure 9.13: Selecting the Subset process
The Subset dialog opens, as shown in Figure 9.14.
Figure 9.14: The Subset dialog
Click OK to generate a subset table of this data (shown in Figure 9.15).
Figure 9.15: A subset of the wright_tall_2k_10k_s2g table
Select Analyze > Distribution from the JMP menu (Figure 9.16).
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 235
Figure 9.16: Selecting the Distribution process
The Report: Distribution dialog opens.
Select the NegLog10PValue column and click Y,Column to select this column for distribution analysis.
Figure 9.17: The completed Report: Distribution dialog
Click OK to generate the histogram of the p-values.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 236
Figure 9.18: Histogram of the p-values of the data reported in the wright_tall_2k_10k_s2g table
This is a highly skewed distribution. The p-values in the top quartile, those above 3.903, are the interesting peaks. To select these p-values directly from the distribution display, click and drag a rectangle in either the histogram or the box plot windows. Then click Tables > Subset to obtain a table of the most significant peaks. Refer to the JMP User Guide for more details on generating subset tables. Note: Results from this and all JMP Genomics processes, including a re-executable JMP script, are saved in the output folder as specified at the bottom of the General tab in the Spectral 2D Plot dialog. The default settings specify this output folder as the ProcessResults folder.
2D Detrend Spectral data often contain an unwanted baseline trend that varies from spectrum to spectrum. Removing these trends is recommended to ensure comparability of the spectra. The 2D Detrend process creates a new SAS data set in the same form as the original input data set, except that baseline trends in the dataset are subtracted out for each spectrum.
Select Genomics > Spectral Preprocessing > 2D Detrend, as shown in Figure 9.19.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 237
Figure 9.19: Selecting the 2D Detrend process
The Spectral 2D Detrend dialog opens, as shown in Figure 9.20.
Figure 9.20: The Spectral 2D Detrend dialog
Click Load to load the default example.
Select the ProstateCancerExample settings and click OK.
The completed dialog appears as shown in Figure 9.21.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 238
Figure 9.21: The completed Spectral 2D Detrend dialog
Examine the General tab of the dialog. The input data set is the same data file used previously. Each column in the input file shows as an available variable. The spectral variables are columns containing the numerical data from the spectra. The index variable is mz. This example automatically specifies the output folder and assigns a name for the output data set.
Click the Analysis tab. The Analysis tab appears, as shown.
Figure 9.22: The Analysis tab
Examine the Analysis tab. The bandwidth represents the moving m/z value width used to calculate the average baseline for subtraction from the points on the spectra. Peaks are determined using the standard cutoff of 3 above baseline.
Click Run to subtract the baseline. The modified SAS data set generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 9.23).
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 239
Figure 9.23: The SAS Message dialog
JMP automatically adds the _dt suffix in the name of the new data set. This data set is available for subsequent analyses.
2D Bin Spectral data sets can be quite large, and it is often useful for rapid initial exploration of the major features of the data to bin them across groups of m/z values. The 2D Bin process (not shown) performs simple binning in this fashion and reduces the total number of rows in the main data set.
2D Peak Find Another way to reduce the size of spectral data is to compute peak locations and their heights or areas. The 2D Peak Find process executes a basic peak-finding algorithm based on a specified number of peaks to be found.
Select Genomics > Spectral Preprocessing > 2D Peak Find. The Spectral 2D Peak Find dialog opens, as shown in Figure 9.24.
Figure 9.24: The Spectral 2D Peak Find dialog
Click Load to load the default example.
Select the ProstateCancerExample settings and click OK.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 240
The completed dialog appears as shown in Figure 9.25.
Figure 9.25: The completed Spectral 2D Peak Find dialog
Examine the General tab of the dialog. Note the automatic specification of the wright_tall_2k_10k.sas7bdat input data file. The x-axis and spectral variables have been specified, as discussed previously. The output folder has been specified by default.
Click on the Noise Estimation tab. The Noise Estimation tab appears, as shown in Figure 9.26.
Figure 9.26: The completed Noise Estimation tab of the Spectral 2D Peak Find dialog
The x-axis value intervals for noise are 2000-2500 and 19500-20000. These regions of the 2-D spectra (Figure 9.10) appear to result from pure noise.
Click the Options tab. The Options tab appears, as shown in Figure 9.27.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 241
Figure 9.27: The completed Options tab of the Spectral 2D Peak Find dialog
Examine the Options tab. Note that the maximum number of peaks is set to 100 by default. Change this number depending upon the resolution of the data.
Click Run to find the peaks. This process invokes SAS/IML and may take several minutes to run for large data sets. Upon completion, several graphs are produced showing various statistics about the peaks, as shown in Figure 9.28.
Figure 9.28: The Peak Finding Statistics plots
The peak-finding process also generates two different output data sets that are listed in a SAS Message dialog as shown in Figure 9.29.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 242
Figure 9.29: The SAS Message dialog
The wright_tall_2k_10K_s2p_det.sas7bdat data set contains peak details that are useful for further exploration. The wright_tall_2k_10k_s2p.sas7bdat data set is useful for subsequent analyses.
Proteomics Data Quality Control and Normalization After pre-processing, run statistical quality control and normalization processes on the spectral data. These capabilities are available in the Quality Control and Normalization submenus of the main Genomics menu. Refer to Chapters 7 and 8 for demonstrations of different Quality Control processes of microarray data. Note: After appropriate pre-processing, from a statistical perspective, protein or metabolite expression data is similar to gene expression. Many of the processes used for analysis of microarray data are, therefore, applicable to proteomic analyses.
Proteomics Pattern Discovery and Row-by-Row Modeling As with Quality Control and Normalization, the processes available under the Pattern Discovery and Row-by-Row Modeling are useful for protein or metabolite expression. These processes are illustrated with microarray data in Chapters 7 and 8 and are not shown here.
Preparing Data for Predictive Modeling
Often the goal of a proteomics study is to find a model for prediction of a categorical or continuous characteristic of the samples. Several processes are available for this in the Predictive Modeling submenu. These processes are fully described in Chapter 11. Before running these processes, the data must be transformed into wide form.
Transform Tall and Wide
Select Genomics > Data Set Utilities > Transpose Tall and Wide. The Data Transpose dialog opens. To select the wright_tall_2k_10k_s2p.sas7bdat file that was generated previously as the input data set, complete the following steps.
Click Choose.
Navigate into the ProcessResults folder.
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 243
Select the wright_tall_2k_10k_s2p.sas7bdat file and click Open. To select the Experimental Design Data Set, complete the following steps.
Click Choose.
Navigate to Sample Data > Proteomics.
Select the wright_design.sas7bdat file and click Open. To select the output folder,
Click Choose.
Navigate into the ProcessResults folder.
Click Select to specify the output folder. The completed Transpose Tall and Wide dialog appears, as shown in Figure 9.30.
Figure 9.30: The completed Transpose Tall and Wide dialog
Click Run to transpose the data.
The transposed SAS data set generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 9.31).
9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 244
Figure 9.31: The SAS Message dialog
Click Open to examine the transposed data set (Figure 9.32).
Figure 9.32: The transposed data set
The transposed data set has individuals as rows and both experimental design variables and peaks as columns. Note the “_wid” suffix on the end of the name of the data set. The wright_tall_2k_10k_s2p_wid.sas7bdat data set can be used as the input data set for the Predictive Modeling processes described in Chapter 10.
Predictive Modeling
10C H A P T E R
The primary focus of JMP Genomics is scientific discovery and understanding through statistics and graphics. However, the software does offer some basic capabilities for creating predictive models. You can construct predictors of either continuous or categorical outcomes using data from genetic markers, microarrays, or proteomics as predictor variables. These processes, which include Discriminant Analysis, Distance Scoring, General Linear Model Selection, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, Radial Basis Machine, and Binary Response Effect Selection, are grouped under the Predictive Modeling submenu, as shown in Figure 10.1. Additional processes (Binary Response Effect Selection, Cross Validation Model Comparison, and Test Set Model Comparison), help you to select the most appropriate model for your data.
Figure 10.1: The Predictive Modeling submenu
Predictive modeling is also known as exploratory modeling or data mining. This chapter discusses the JMP Genomics functions that target exploratory and basic data mining for genomics data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient, workflow-style interface. After the genomics data has been appropriately preprocessed and stored as a wide SAS data set, one or more of the processes, described in this chapter, can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.
10 Predictive Modeling 246
Data Sets All of the processes described in this chapter require the data to be in wide format, with individual samples as rows and experimental design variables, phenotypes, genetic markers, transcripts, and/or peptides as columns. Genetic marker data is likely already in this form, but any microarray or proteomics data that are in tall form must be converted to the wide format. Use the Transpose Tall and Wide command to convert the tall data set and its accompanying experimental design data set data to wide form. See Chapter 4 for detailed instructions on transforming the data set. With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single wide data set using the Data Set Utilities > Merge command, as described in Chapter 4. These data can then be used together to build jointly predictive models. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling. For large data sets with tens or hundreds of thousands of predictors, computing time for some of the JMP Genomics predictive modeling processes can become prohibitively long. In this situation, perform a preliminary reduction of the predictor set by using the Pattern Discovery > K-Means Clustering process to select a thousand or so representative predictors. (The data must be in tall form to execute this process. Use the Transpose Tall and Wide AP to go back and forth between tall and wide forms.) When performing variable selection/reduction with an entire data set, it is important to realize that an optimistic bias can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in cross-validation capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to missing values. While the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross-validation is invaluable in assessing initial performance of the models.
Predictive Modeling Processes The proteomics prostate cancer example, described in Chapters 1 and 9, is used to illustrate several of the predictive modeling processes available from JMP Genomics.
Discriminant Analysis Discriminant Analysis is a traditional method for classifying a categorical variable from a set of continuous responses. To run this process, complete the following steps.
Select Genomics > Predictive Modeling > Discriminant Analyses. The Discriminant Analysis dialog opens, as shown in Figure 10.2.
10 Predictive Modeling 247
Figure 10.2: The Discriminant Analysis dialog
Click Load.
Select the ProstateCancerExample settings.
The completed General tab of the Discriminant Analysis dialog appears, as shown in Figure 10.3.
10 Predictive Modeling 248
Figure 10.3: The completed General tab of the Discriminant Analysis dialog
Click Open to examine the wright_wide_2k_10k_dt_sig6.sas7bdat input data set.
The input data set contains data from 165 men, 84 men with prostate cancer and 81 cancer-free men considered as controls. Samples are listed in rows, while the responses from a set of mass spectrometry peaks are listed in columns beginning with mz. Note this data set is in the wide format.
Examine the completed General tab. The Dependent Class Variable is the variable to be predicted; in this case, status indicates whether or not the individual is likely to develop cancer. In this example, individuals with cancer are identified as CCD, while members of the control group are identified as Nor. A discriminant prediction model can be built from two types of predictor variables, Continuous and Class.
o Predictor continuous variables must be numeric and their numeric values are used directly as predictors as in linear regression.
o Predictor class variables can be numeric or categorical. Their unique values are used to form a set of columns with 0s and 1s indicating class level.
With a large number of predictor variables, it is often more convenient and advisable to use the List-Style specifications rather than selecting and moving variable names to the boxes on the right. For the List-Style specifications, you can use SAS syntax to indicate a range of variables, for example, x1-x12345 specifies the variables x1, x2, x3, …, x12345. For this example, you could clear the Predictor Continuous Variables field and instead specify mz: in the List Style Specification of Predictor Continuous Variables field. This specification is a shorthand syntax that indicates all variables beginning with mz. The variable sample is specified as the Label Variable. The values listed in this variable are used to create labels in the output JMP table and plots.
10 Predictive Modeling 249
The Predictor Reduction tabs allows you to trim down the number of predictor variables used before modeling, eliminating redundant variables and, potentially, increasing the speed of execution.
Click Predictor Reduction 1.
Examine the Predictor Reduction 1 tab.
Do not make any changes to the Predictor Reduction tab.
Click Predictor Reduction 2.
Examine the Predictor Reduction 2 tab.
Do not make any changes to the Predictor Reduction tab. The Analysis tab allows you to select and adjust specific analysis parameters.
Click Analysis.
Do not make any changes to the Analysis tab.
The Genetic Algorithm tab allows you to input the algorithm used to complete the analysis.
Click Genetic Algorithm.
Do not make any changes to the Genetic Algorithm tab. The Options tab allows you to specify the output of the discriminate analysis.
Click the Options tab.
Do not make any changes to the Options tab.
Click Run to launch the JMP Discriminant platform, as shown in Figure 10.4.
10 Predictive Modeling 250
Figure 10.4: The JMP Discriminant platform
The JMP Discriminant platform can be used to interactively select a set of predictors for the discriminant model. The Step Forward and Step Backward commands force JMP to select, in a stepwise manner, the predictors according to statistical significance.
Click Step Forward to add the most significant of the non-selected variables to the list of predictors.
Click Step Backward to remove the least significant variable from the selected predictors.
Alternatively, specific predictors can be selected manually by checking the corresponding boxes in the Entered column.
Click Step Forward five times to select the five most significant variables.
Click Apply This Model to obtain the display of the results shown in Figure 10.5.
Results for a model with the first five variables are shown in Figure 10.5.
10 Predictive Modeling 251
Scores are for these five variables
Figure 10.5: Model derived from the first five variables Note that 11 of the 165 samples are misclassified with this model. To further refine this model, complete the following steps.
Select Stepwise Variable Selection from the drop-down menu in the Discriminant Analysis box, as shown in Figure 10.6, to return to the JMP Discriminant platform shown in Figure 10.4.
Figure 10.6: Selecting the Stepwise Variable Selection
Select additional predictors or deselect inappropriate predictors, as warranted by your scientific
objectives.
Click Apply This Model to obtain the display of the new results (not shown).
10 Predictive Modeling 252
Additionally, refer to the JMP Statistics and Graphics Guide for details on the output displays and further analyses. General Linear Model Selection The General Linear Model Selection process performs predictor variable selections in the framework of general linear models for a continuous dependent variable. A variety of model selection methods are available, including forward, backward, stepwise, lasso, and least-angle regression. This process offers a wide variety of selection and stopping methods, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based rules. It also provides graphical summaries of the selection search. It calls the experimental PROC GLMSELECT from SAS/STAT.
Select Genomics > Predictive Modeling > General Linear Model Selection. The GLM Select dialog opens, as shown in Figure 10.7.
Figure 10.7: The GLM Select dialog
Click Load.
Select the ProstateCancerExample settings.
The completed General tab of the GLM Select dialog appears as shown in Figure 10.8.
10 Predictive Modeling 253
Figure 10.8: The completed General tab of the GLM Select dialog
The input data set was described previously in the example for the Discriminant Analysis process. The Dependent Variable is the variable to be predicted; in this case, status indicates whether or not the individual is likely to develop cancer. In this example, individuals with cancer are identified with a 1, while members of the control group are labeled with a 0. All of the candidate predictor variables have names that begin with mz, so they are specified using the List-Style specification. In this case, mz: has been entered. The colon indicates that all variables with the common prefix mz are to be considered. Because they are all continuous, no Predictor Class Variables are specified. No label variables are used, and because each observation represents a single individual, no Weight variables are specified. The Predictor Reduction 1, Predictor Reduction 1, Analysis, Genetic Algorithm, and Options tabs on the GLM Select dialog are similar to those described for the Discriminant Analysis AP. You should examine the default settings for each tab.
Click Run to run the GLM Select process.
10 Predictive Modeling 254
The resulting output (available in either plain text or HTML) describes the details and results of the general linear model selection process. This example uses a stepwise model selection with entry and stay significance levels of 0.01. In addition to an overall mean value (Intercept), the twelve mass-over-charge values listed in the output Parameter Estimates table (shown in Figure 10.9) are selected as predictive.
Figure 10.9: Output of the GLM Select process
These can be considered to be initial candidate prostate cancer biomarkers, and provide starting points for more extensive computational and experimental cross-validation. For full documentation and details on the underlying options and methods available with this process, complete the following step.
Select Help > JMP Genomics Web Links > The GLMSELECT Procedure Documentation.
K Nearest Neighbors The K Nearest Neighbors process is very similar to the Discriminant Analysis process, but it employs a nonparametric method based on neighboring averages to perform predictions. Example output is not shown. Logistic Regression Logistic regression is another classic method used to predict probability of a response being in a particular categorical class. It models this probability using a link function that transforms a linear function of the predictor variables to a probability scale.
Select Genomics > Predictive Modeling > Logistic Regression. The Logistic Regression dialog opens, as shown in Figure 10.10.
10 Predictive Modeling 255
Figure 10.10: The Logistic Regression dialog
Note that the structure of this dialog is very similar to the Discriminant Analysis dialog shown in Figure 10.3. Variables are specified in the same way as described for Discriminant Analysis.
Click Load.
Select the ProstateCancerExample settings. The completed General tab of the Logistic Regression dialog appears as shown in Figure 10.11.
10 Predictive Modeling 256
Figure 10.11: The completed General tab of the Logistic Regression dialog
Unlike the previous Discriminant Analysis example, predictors are selected in an automated stepwise fashion. Running the example with the default settings invokes SAS PROC LOGISTIC and produces a SAS Output window, a JMP table, and the Logistic Regression Results window shown in Figure 10.12.
10 Predictive Modeling 257
Figure 10.12: Results of the Logistic Regression
The Distributions panel shows the distributions of the original samples and the number of correctly classified observations. The Contingency Analysis panel provides a further breakdown of the results. For this run, a total of 9 of the 165 samples are misclassified. To select rows in the corresponding JMP table, click on bars in the histograms or cells in the mosaic plot. Partial Least Squares Partial least squares (PLS) is a technique popular in chemometrics. It is different from discriminant and logistic regression methods in that it uses all of the predictor variables at one time. It can be viewed as a supervised principal components analysis, in that it constructs linear combinations of the predictor variables that maximize covariability with the dependent response variables.
Select Genomics > Predictive Modeling > Partial Least Squares. The Partial Least Squares dialog opens, as shown in Figure 10.13.
10 Predictive Modeling 258
Figure 10.13: The Partial Least Squares dialog
Click Load.
Select the ProstateCancerExample settings.
The completed General tab of the Partial Least Squares dialog appears as shown in Figure 10.14.
10 Predictive Modeling 259
Figure 10.14: The completed General tab of the Partial Least Squares dialog
Variables are specified as previously described for Discriminant Analysis, with the addition of a Color Variable that is used to color the JMP plots. On the Analysis tab, note that three partial least squares components are specified. As with principle components, this number can be changed.
Click Run to perform the partial least squares analysis. Several results windows open. Figure 10.15 shows the tabular SAS output.
Figure 10.15: The tabular SAS Output window
10 Predictive Modeling 260
The Model Effects columns show that 88.4% of the variability of the mz predictor variables is explained by the three PLS components, whereas the Dependent Variables columns show that 72.4% of the variability of cancer status is explained. The output also contains both 2D and 3D plots of the multivariate scores (Figures 10.16 and 10.17) from the partial least squares analysis.
Figure 10.16: 2D Plots of row multivariate scores from the partial least squares analysis
10 Predictive Modeling 261
Figure 10.17: 3D Plots of row scores (left) and column scores (right) from the partial least squares
analysis
In both plots, the cancer samples are colored red and the normal samples are colored blue. The cancer samples are fairly well separated from the control samples and are more heterogeneous. PLS provides a good initial indication of the difficulty in discriminating groups. PLS can be more difficult to interpret than results from other processes because the prediction is a linear combination of all the variables. Partition Trees Partition trees provide an intuitive way to hierarchically split data in a way that best predicts a response.
Select Genomics > Predictive Modeling > Partition Trees. The Partition Trees dialog opens, as shown in Figure 10.18.
10 Predictive Modeling 262
Figure 10.18: The Partition Trees dialog
Click Load.
Select the ProstateCancerExample settings.
The completed General tab of the Partition Trees dialog appears as shown in Figure 10.19.
10 Predictive Modeling 263
Figure 10.19: The completed Partition Trees dialog
The Predictor Reduction 1, Predictor Reduction 2, Analysis, and Options tabs on the Partition Trees dialog are similar to those described for the Discriminant Analysis AP. You should examine the default settings for each tab. There is an additional Pruning tab
Do not make any changes to the Predictor Reduction, or Options tabs.
Click Analysis.
Examine the Mode.
Partition tree analysis can either be carried out in automated mode, in which partition are generated using SAS programming or interactively, in which you can interactively create a partition tree. By default, the mode for this setting is set to Automated. You can run the example using this mode or change the mode to interactive.
Click Interactive to change the mode.
Click Run to launch the JMP Partition platform shown in Figure 10.20.
10 Predictive Modeling 264
Figure 10.20: The JMP Partition platform
You can use this platform to interactively create a partition tree.
Click Split to generate a new branch on the tree.
Click Prune to remove the last branch added.
Clicking Split three times produces the tree shown in Figure 10.21.
10 Predictive Modeling 265
Figure 10.21: The resulting partition tree
Refer to the JMP Statistics and Graphics Guide for details on how to use and interpret results.
10 Predictive Modeling 266
Annotation Analysis
11C H A P T E R
The Annotation Analysis submenu provides a set of bioinformatic tools that can help scientists incorporate biological meaning with their statistical results. Users can access these tools through the JMP Genomics main menu as shown in Figure 11.1.
Figure 11.1: The Annotation Analysis submenu
Processes available under the Annotation Analysis submenu include:
• Create 0-1 Indicator for Selected Rows creates a new column in the active JMP table whose value equals 1 for all rows that are selected and 0 for those that are not selected. Such columns are needed for the subsequent Column Enrichment process.
• Venn Diagram is a JMP Scripting Language (JSL) script that allows you to examine and compare up
to five variables in a data set using Venn diagrams to explore their similarities and differences and to identify observations of special interest. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
• Create Web Link−enables easy access to gene information, protein information, pathway
information, and so on, that are stored in various biomedical databases such as GenBank, Gene, Pubmed, KEGG pathway, and Genome Map by creating a web link report based on your input annotation table.
11 Annotation Analysis 268
• IPA Upload uploads statistical results directly from JMP Genomics to Ingenuity Pathway Analysis software. Creates an HTML form with a button that launches Ingenuity’s multiple observations analysis platform.
• KEGG Pathway Search searches the KEGG pathway database, enabling identification of the
molecular interactions, reaction networks and functions that are relevant to genes of interest.
• KEGG Pathway Color colors KEGG pathways with statistical results enabling visualization and interpretation of these results in the context of pathways and biological systems.
• UCSC Genome Browser Link creates an HTML table with links to the UCSC Genome Browser
based on locations, gene names or other parameters. This AP allows users to create a custom track for upload to the UCSC Genome Browser by specifying a quantitative variable of interest from an analysis performed in JMP Genomics. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
• Affymetrix > Integrated Genome Browser creates a table with embedded hyperlinks to
chromosomal features or locations within the Affymetrix Integrated Genome Browser. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
• Affymetrix > Download NetAffx Files allows you to search out and retrieve annotation, library,
map, or other accessory files used with Affymetrix arrays. These files, which are associated with different microarrays produced by Affymetrix, are often required for data analysis. Refer to the JMP Genomics User Guide – Supplement for more details on this process.
• Column Enrichment performs an enrichment analysis by comparing a binary significance column
with a set of annotation categories to construct a set of unique categories, based on the annotation, and assigns individual genes to those categories.
• List Enrichment compares a set of curated lists (such as genes, proteins, or metabolites) against a
table of significance values and then tests for significant enrichment using Fisher's exact test for association.
• Configure Proxy Settings resets the proxy server name and port number in the genomics.config
file. If your computer accesses the Internet through a proxy server, you must specify the proxy server name and port number before JMP Genomics will access the Internet. If your computer does not access the Internet through a proxy server, do not change the default settings.
Annotation Data Sets An Annotation Data Set contains biological or chemical information and properties about genes, SNPs, probes, probe sets, or peptides. This annotation information comes from various online bioinformatic resources, including government agencies, academic organizations and commercial entities. It is used to create a custom Annotation Data Set for your analysis. The structure of the Annotation Data Set for JMP Genomics’ genetics processes differs from that of the microarray and proteomics processes. For genetics, each row in the Annotation Data Set represents a marker or SNP used in the analysis, with variables typically containing the following information: a name or identifier for each marker, the chromosome or candidate gene on which it is located, its location (in terms of kilobases or centiMorgans, for example), and an accession number that can be used to retrieve more information about the locus from a publicly available on-line database. Use this data set in the Create Web Link process to combine web links to the appropriate databases for all the markers into a single report as shown in an example later in this chapter. This data set can also be specified on the Annotation tab found on most of the process dialogs where the columns can be assigned to various roles:
11 Annotation Analysis 269
• Annotation Label Variable−the name or ID variable that is used to label markers in the output
• Annotation Group Variable−the variable, such as chromosome, that can be used to group the analyses and output
• Annotation Location Variable−the variable containing marker locations to be used to accurately
represent distances between markers in p-value plots
• Accession Number Variable−the variable containing GenBank accession number or dbSNP reference sequence ID for example, to be used to create buttons on p-value plots that provide direct access to the website for the selected marker from the appropriate on-line database
This tab also allows conditional inclusion of markers in your analysis based on particular values of variables from the Annotation Data Set. The criteria can be entered in the Annotation Where Clause in accordance with SAS syntax for WHERE statements. For the microarray and proteomics processes, the Annotation Data Set must contain a merge key variable whose values exactly match those of some variable in a tall data set. The structure of an Annotation Data Set can vary depending on the application and source(s) of the data. Table 11.1 lists information commonly contained in an Annotation Data Set. Keep in mind that different providers might name annotation information differently.
Table 11.1: Types of information commonly found in an Annotation Data Set
Items or Properties Description
Probe or Probe Set ID A unique identifier given to a probe or probe set in a probe array or microarray
GenBank Accession Number An Accession Number is a unique identifier given to a biological polymer sequence (such as DNA or a protein) when it is submitted to a sequence database (GenBank, EMBL, DDBJ).
UniGene Cluster ID A unique identifier given to a cluster of sequences in UniGene
Gene ID A unique identifier assigned to a gene record in Entrez Gene. It is an integer and is species specific. For genomes that had been represented in LocusLink, the Gene ID is the same as the Locus ID.
Gene Symbol A short-form abbreviation or symbol assigned to a gene by species-specific nomenclature committees. Each symbol is unique and each gene is only given one approved gene symbol.
Description Description about a gene, probe, or probe set
Chromosomal Location The physical location of a gene or a sequence on a chromosome
Ensembl ID A unique identifier assigned to a sequence in Ensembl
Swiss-Prot Id A unique identifier assigned to a protein sequence in Swiss-Prot−a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domain structures, post-translational modifications, variants, etc.), a minimal level of redundancy, and significant integration with other databases
EC number A number assigned to an enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC number is a unique identifier in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.
11 Annotation Analysis 270
OMIM ID A unique identifier assigned to a genetic disorder in the Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.
dbSNP ID A unique identifier assigned to a single nucleotide polymorphism when it is submitted to the SNP database. Also known as a 'rs' ID.
RefSeq Accession A unique identifier given to a sequence in the NCBI RefSeq database. The RefSeq database is a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.
Gene Ontology ID A unique alphanumerical identifier given to a GO term.
Genomic Location/Coordinate
A location assigned to a gene or a sequence at both the chromosome and sequence-levels
Raw annotation data can come in a variety of formats. These include tab delimited (.txt), Comma-separated (.csv), or Excel (.xls) files. You can open any of these file formats in JMP; however, before an Annotation Data Set can be used in JMP Genomics processes, first save it as a SAS data set, with the suffix .sas7bdat. The Genomics > Data Set Creation >Import Individual Text, CSV, or Excel File process can also transform an annotation file into a SAS Annotation Data Set (.sas7bdat). When combining data from multiple sources, the Tables > Join process in JMP can be used to join two JMP tables into one, or the Genomics > Data Set Utilities > Data Merge process to join two SAS data sets. The following example demonstrates how to generate in an input annotation data set in the required format.
Annotation Data Set Creation This example generates an annotation data set for the Affymetrix Latin Square example data described in Chapter 1. The GeneChip® expression array used in the Latin Square experiment is the Human Genome U95 array, described in Chapter 8. The workflow for this process is, as follows: 1. Create a separate directory for the data sets for storing data and files. 2. Download the annotation file from the Affymetrix website, unzip, and save the file in the directory you
created. 3. Use the JMP Genomics data import function to generate the SAS data set.
Create a Separate Directory
Navigate to the ProcessResults folder.
Create a new folder.
Name the folder AnnotationData.
This folder is used for storing data and files. Download the Annotation File
Go to the Affymetrix web site and browse to the technical support documentation for the Human Genome U95 Set.
At the time of printing, the URL for this web page is http://www.affymetrix.com/support/technical/byproduct.affx?product=hgu95.
11 Annotation Analysis 271
Select the HG_U95Av2 Annotations, CSV format link (circled in Figure 11.2) from the list of annotation files.
Figure 11.2: Annotation files available from Affymetrix
Click the link to begin the download process. The File Download window, shown in Figure 11.3,
opens.
Figure 11.3: The File Download window
Click Save (circled in Figure 11.3) to bring up the Save As window shown in Figure 11.4.
11 Annotation Analysis 272
Figure 11.4: The Save As window
Navigate into the AnnotationData folder you just created and click Save (circled in Figure 11.4)
to download the annotation file.
Unzip the downloaded HG_U95Av2.na23.annot.csv.zip file. The file opens in Excel, as shown in Figure 11.5.
Figure 11.5: A portion of the HG_U95Av2.na21.annot.csv file
Copy following columns into a new Excel workbook.
Probe Set ID, Representative Public ID, UniGene ID, Gene Title, Chromosomal Location, Ensembl, Entrez Gene, SwissProt, EC, OMIM, RefSeq Protein ID, RefSeq Transcript ID,
11 Annotation Analysis 273
Gene Ontology Biological Process, Gene Ontology Cellular Component, Gene Ontology Molecular Function
Name the workbook as my_HG_U95Av2_annot and save it in the AnnotationData folder.
Figure 11.6: A portion of the subset my_HG_U95Av2_annot.xls file
The size of the subset my_HG_U95Av2_annot.xls file is about one third the size of the original file. The column names provided by Affymetrix can be renamed to make them more descriptive. For example, The Representative Public ID column lists the GenBank Accession numbers in the Human Genome u95 Set’s annotation file, but it lists the FlyBase Accession number in the corresponding Drosophila Genome Array’s annotation file.
Rename the Representative Public ID column as Accession. Some column values contain multiple entries that are separated by an entry delimiter. For example, values in the SwissProt column contain three forward slashes (///) as the entry delimiter in its annotation file. Some column values contain entries that consist of both identifier and description. In these cases, the identifier and description are separated by an ID delimiter. For example, values in the Gene Ontology Biological Process column contain two forward slashes (//) as the ID delimiter. These delimiters are commonly used in Affymetrix’s annotation files. Be aware that different annotation providers might use different entry and ID delimiters.
Generating the SAS Data Set
Select Genomics > Import > Text > Import Individual Text, CSV, or Excel Files, as shown in
Figure 11.7.
11 Annotation Analysis 274
Figure 11.7: Selecting the Import Individual Text, CSV, or Excel Files process
The Import Individual Text, CSV, or Excel Files dialog opens, as shown in Figure 11.8.
Figure 11.8: The Import Individual Text, CSV, or Excel Files dialog
11 Annotation Analysis 275
To select the annotation file you just created complete the following steps.
Click Choose to select the folder containing the input file.
Navigate to ProcessResults > AnnotationData.
Click OK to choose the folder.
All of the files contained in the AnnotationData folder are listed in the Available Files box in the dialog.
Select the my_HG_U95Av2_annot.csv.xls file.
Click to add the file to the Files to Import box.
You must indicate both the row in which the column names are listed and the first row containing data.
Examine the my_HG_U95Av2_annot.csv.xls file
The column names are listed in row 1 and the data starts in row 2.
Type 1 in the Row Number of Variable Names [0, 10000] box.
Type 2 in the Data Start Row [0, 10000] box To select the output folder, complete the following steps.
Click Choose to select the output folder.
Navigate to ProcessResults > AnnotationData.
Click Select to select the folder.
The completed dialog appears as shown in Figure 11.9.
11 Annotation Analysis 276
Figure 11.9: The completed Import Individual Text, CSV, or Excel Files dialog
Click Run to import the annotation file.
The SAS data set generated by this process is listed in a SAS Message dialog (Figure 11.10).
Figure 11.10: The SAS Message dialog
Click Open to examine the contents and structure of the my_hg_u95av2_annot.sas7bda
annotation data set shown in Figure 11.11.
11 Annotation Analysis 277
Figure 11.11: A portion of the my_hg_u95av2_annot.sas7bdat annotation data set
Annotation Analysis Processes
Create Web Link This example uses the annotation data set my_hg_u95av2_annot.sas7bdat, generated in the Annotation Data Set Creation example, to create a web link report.
Select Genomics > Annotation Analysis > Create Web Link. The Create Web Link dialog opens, as shown in Figure 11.12.
11 Annotation Analysis 278
Figure 11.12: The General (left) and Options (right) tabs of the Create Web Link dialog
Make sure the General tab is selected.
To select the input data set, complete the following steps.
Click Choose to select the input file.
Navigate to ProcessResults > AnnotationData.
Select the my_HG_U95Av2_annot.sas7bdat file.
Click Open to select the file.
11 Annotation Analysis 279
The column names from the input data set are listed in the Available Variables box. To specify the individual parameter variables for the analysis, complete the following steps.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the Probe Id box, as shown in Figure 11.13.
Figure 11.13: Specifying the Probe Id variable
Select Accession from the list of available variables.
Click to add Accession to the GenBank Accession box, as shown in Figure 11.14.
Figure 11.14: Specifying the GenBank Accession variable
Select Unigene_ID from the list of available variables.
Click to add Unigene_ID to the Unigene_Id box.
Select Gene_Title from the list of available variables.
Click to add Gene_Title to the Description box.
Select Entrez_Gene from the list of available variables.
Click to add Entrez_Gene to the Gene_Id box.
Select Chromosome_Location from the list of available variables.
Click to add Chromosome_Location to the Chromosome Location box.
Select Ensembl from the list of available variables.
Click to add Ensembl to the Ensembl Id box.
Select SwissProt from the list of available variables.
Click to add SwissProt to the Swiss-Prot Id box.
Select EC from the list of available variables.
Click to add EC to the Enzyme Id (EC number) box.
Select OMIM from the list of available variables.
Click to add OMIM to the OMIM Id box.
11 Annotation Analysis 280
Select both RefSeq_Protein_ID and RefSeq_Transcript_ID from the list of available variables.
Click to add both RefSeq_Protein_ID and RefSeq_Transcript_ID to the RefSeq Id box.
Select Gene_Ontology_Biological_Process, Gene_Ontology_Cellular_Component and
Gene_Ontology_Molecular_Function from the list of available variables.
Click to add Gene_Ontology_Biological_Process, Gene_Ontology_Cellular_Component and Gene_Ontology_Molecular_Function to the GO Id box.
Leave the Gene Symbol and dbSNP Id boxes blank.
Because the U95Av2 array contains human genome sequences,
Select Homo sapiens from the Organism pull-down menu.
To specify the U95Av2 array,
Select HG_U95Av2(Human_Genome_U95Av2_Array) in the Affymetrix GeneChip Array box. To select the output folder, complete the following steps.
Click Choose to select the output folder.
Navigate to ProcessResults > AnnotationData.
Click Select to select the folder.
The completed General tab of the Create Web Link dialog appears as shown in Figure 11.15.
11 Annotation Analysis 281
Figure 11.15: The completed General tab of the Create Web Link dialog
Click the Options tab.
The Options tab allows you to specify delimiters used in the annotation data set. The Entry and Entry ID Delimiters used by Affymetrix are entered by default. The name of the output file is optional. If left blank, JMP Genomics assigns a default name to the output file.
11 Annotation Analysis 282
The options for generating links to the various databases are initially, by default, disabled (grayed-out). These options are enabled when their dependent, corresponding variables are specified on the General tab. Checkboxes for enabled options are selected by default. Note: Specifying a single variable on the General tab might enable more than one link option. For example, specifying the Gene Id variable enables both the Entrez Gene Link and the KEGG Gene Database Link. The completed Options tab of the Create Web Link dialog appears as shown in Figure 11.16.
Figure 11.16: The completed Options tab of the Create Web Link dialog
11 Annotation Analysis 283
Make no changes to the Options tab.
Click Run to generate a .html file containing the web links.
Figure 11.17: A portion of the .html file listing the web links for the data contained in the annotation
file
Click on the links to explore the information available for each of the genes.
IPA Upload This process creates either an .xls or .txt export file that can be uploaded to the Ingenuity Pathway Analysis system for contextual analysis of expression and/or functional data for a suite of genes under specific experimental conditions. Up to ten different experimental comparisons can be made for each analysis. This example uses a normalized expression data set for the 100 genes in the Affymetrix Latin Square Example discussed in Chapter 1, under two experimental conditions. Two different expression statistics are used for the comparison: the simple difference in expression for each of the genes between the two conditions and the p-values for those differences.
Select Genomics > Annotation Analysis > IPA Upload. The IPA Upload dialog opens, as shown in Figure 11.18.
11 Annotation Analysis 284
Figure 11.18: The IPA Upload dialog
To load the parameters for the Affymetrix Latin Square example, complete the following steps.
Click Load.
Select AffymetrixLatinSquareExample.
Click OK.
The completed IPA Upload dialog appears as shown in Figure 11.19.
11 Annotation Analysis 285
Figure 11.19: The completed General tab
Examine the dialog. The affylatin_norm_amr.sas7bdat file, included in the Sample Data folder, has been selected as the input data set.
Click Open to examine this file.
The column labels in the input data set are listed in the Available Variables box of the dialog. The AffyID column, which contains the probe set IDs, is selected as the gene identifier.
Scroll down the list of available variables.
Variables beginning with the letter d represent differences in expression, between experiments, for individual genes. For example, the values in column da_b_, represent differences in gene expression between experiments a and b. Variables beginning with the letter p represent the –log10 p-values of those differences. For example, the values in column pa_b_, represent the –log10 p-values of the differences in gene expression between experiments a and b. When selected, variables containing –log10 p-values must always be listed in the Negative Log10P-Value Variables box. The Ingenuity Pathway Analysis system requires p-values rather than –log10 p-values. The variables pa_b_ and da_b_ have been selected as first and second expression values, respectively (Figure 11-20). Note the selection of the type for each expression value matches that described above.
11 Annotation Analysis 286
Figure 11.20: Selecting the first (left) and second (right) expression values
Click Run to generate an .html output file (Figure 11-21) that can be uploaded to Ingenuity.
Figure 11.21: The output .html file
Click Upload to IPA to upload the file to Ingenuity.
Note: You must have either an Ingenuity Pathway Analysis System license or trial package to run the analysis.
11 Annotation Analysis 287
KEGG Pathway Search The KEGG Pathway Search function allows users to identify the molecular interaction, reaction networks and functions that are relevant to genes of interest. It searches the KEGG Pathway database by Entrez Gene Id, GenBank Accession, NCBI Protein GI Number, UniGene Cluster Id, UniProt Id, and OMIM Id. Finally, it generates a report listing the search results and links. Note: This process might take a long time to run, depending on internet traffic, the number of genes specified, and the number of pathways found. This example illustrates the search process using two human genes from the Affymetrix Latin Square example that show significant expression differences. The first of these genes, LocusLink ID #5787, encodes protein tyrosine phosphatase, receptor type B. The second, LocusLink ID #5602, encodes mitogen-activated protein kinase 10. The data for these and other significant genes are listed in the u95a_significant_differences.sas7bdat file included in the Sample Data folder.
Select Genomics > Annotation Analysis > KEGG Pathway Search.
The KEGG Pathway Search dialog opens, as shown in Figure 11.22.
Figure 11.22: The KEGG Pathway Search dialog
Type the LocusLink gene numbers 5787 and 5602 into the Gene/Protein Ids box, as shown in
Figure 11.23.
Figure 11.23: The Gene/Protein Ids box
The gene identifiers can be entered on one or more lines. If more than one gene is entered on a line, the identifiers must be separated by a space.
11 Annotation Analysis 288
Note: The same gene can have different identifiers, depending on the species. For example, the Gene Id for the human gene A1BG, which encodes the alpha-1-B glycoprotein, in Human is 1, whereas the Gene Ids for the mouse and rat homologs are 117586 and 140656, respectively. JMP Genomics supports the use of identifiers from different species. Supported gene/protein identifiers include Entrez Gene ID, GenBank Accession, NCBI Protein GI Number, UniGene Cluster ID, UniProt ID, and OMIM ID. All of the gene/protein identifiers in an analysis must be of the same type (GenBank Accession numbers, for example). To indicate the identifier type, complete the following steps:
Click the downward arrow in the Type of Gene/Protein Ids box, as shown in Figure 11.24.
Figure 11.24: Selecting the identifier type
Select Entrez Gene Id from the drop-down menu.
The selected type appears as shown in Figure 11.23. If your computer accesses the internet through a proxy server, say so in the dialog.
Click Yes if you use a proxy server to access the internet. Specify the name of your proxy server before running either the KEGG Pathway Search process or the KEGG Pathway Color process on your computer for the first time. See Chapter 12 for further instructions on specifying the proxy server. To select the output folder, complete the following steps.
Click Choose to select the output folder.
Navigate to ProcessResults > AnnotationData.
Click Select to select the folder.
The completed KEGG Pathway Search dialog appears as shown in Figure 11.25.
11 Annotation Analysis 289
Figure 11.25: The completed KEGG Pathway Search dialog
Click Run to generate an .html report and two SAS data sets.
The report, shown in Figure 11.26, lists and provides links to information on all the metabolic/regulatory pathways involving each of the subject genes and their products and to information on other genes in those pathways.
Figure 11.26: A portion of the KEGG Pathway Search report
The output SAS data sets are listed in a SAS Message window, shown in Figure 11.27.
11 Annotation Analysis 290
Figure 11.27: The SAS Message window
Click Open to examine each of the files.
The first column in the keggpathwaysearch.sas7bdat file (Figure 11.28) lists all the genes involved in the relevant pathways. Subsequent columns that identify each of the pathways relevant to the input genes are listed on the first column. A “1” indicates the gene participates in the pathway, a “0” indicates that it does not.
Figure 11.28: A portion of the keggpathwaysearch.sas7bdat file
The keggpathwaysearch_bypathwayid.sas7bdat file (Figure 11.29) lists the different pathways for each gene and other genes involved in those pathways.
Figure 11.29: The keggpathwaysearch_bypathwayid.sas7bdat file
11 Annotation Analysis 291
KEGG Pathway Color The KEGG Pathway Color function allows users to visualize and interpret their statistical results in the context of pathways and biological systems. This process adds color to the gene nodes in the pathway diagrams, if the genes are found in the input data set. The colors are determined according to the values of one or more numeric variables that you specify. A report is generated to display the results and links. Note: This process can take a long time to run, depending on internet traffic, the number of genes specified and the number of pathways found. This example illustrates the KEGG Pathway Color process with the Adherens junction pathway (hsa04520) identified in the example used to illustrate the KEGG Pathway Search process. This example uses the affylatin_norm_amr.sas7bdat file, included in the Sample Data folder, as the input data set.
Select Genomics > Annotation Analysis > KEGG Pathway Color. The KEGG Pathway Color dialog opens, as shown in Figure 11.30.
Figure 11.30: The KEGG Pathway Color dialog
Kegg Pathway IDs can be found using the KEGG Pathway Search process. Single or multiple pathways can be defined. Multiple pathways should either be entered on separate lines or, if entered on one line, be separated by a space. Identifiers of pathways should be species specific. To enter the KEGG Pathway ID for this example,
Type hsa04520 in the IDs of KEGG Pathways to be Colored box. To select the input data set, complete the following steps.
11 Annotation Analysis 292
Click Choose to select the input file.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Select the affylatin_norm_amr.sas7bdat file.
Click Open to select the file.
The column names of the input data set are listed as available variables. Select specific analysis variables from this list.
Select LocusLink from the list of available variables.
Click to add the variable to the Variables Containing Gene IDs box (Figure 11.31).
Figure 11.31: Selecting the variable containing the gene IDs
The pathways can be colored with one or more variable. All genes found in the input data set will be colored red if this is left blank. Any numeric variables are valid for this selection, although they should all be of the same type (-log p-values or lsmeans, for example), because a common color scale is used for all variables.
Select IsmExperiment_a, IsmExperiment_g, IsmExperiment_j, and IsmExperiment_q from the list of available variables.
Click to add these variables to the Variables by Which to Color Pathways box (Figure
11.32).
Figure 11.32: Selecting the variables by which to color the pathways
If your computer accesses the internet through a proxy server, you must indicate that in the dialog.
Click Yes if you use a proxy server to access the internet. Specify the proxy server name before running either the KEGG Pathway Search process or the KEGG Pathway Color process on your computer for the first time. See Chapter 12 for further instructions on specifying the proxy server. To select the output folder, complete the following steps.
Click Choose to select the output folder.
Navigate to ProcessResults > AnnotationData.
11 Annotation Analysis 293
Click Select to select the folder.
The completed General tab of the KEGG Pathway Color dialog appears as shown in Figure 11.33.
Figure 11.33: The completed KEGG Pathway Color dialog
The Options tab allows you to specify how the output is presented.
Click the Options tab.
Type the RGB number #86CDFF in the Low Color RGB box to specify the low end of the spectrum.
Type the RGB number #E3E4DA in the Middle Color RGB box to specify the midpoint of the
spectrum.
Type the RGB number #FFB19Fin the High Color RGB box to specify the high end of the spectrum.
Specify 0 and 100 as the percentiles to use as the lowest and highest color values, respectively.
Do not specify either a title for the pathway results file or a name for the output file.
Click Run to run the KEGG Pathway Color process.
When the process is completed, the generated report (.html file) opens as shown in Figure 11.34.
11 Annotation Analysis 294
Figure 11.34: The KEGG Pathway Color Results report
The information in the report includes the name and ID number of the colored pathway and the definition and ID number of the colored gene. The links include the web links to Entrez Gene and to the colored pathway map output files in the ProcessResults folder. List Enrichment The List Enrichment process compares a set of curated lists (of genes, proteins, or metabolites, for example) against a table of significance values and then tests for significant enrichment using Fisher's exact test for association. It generates a report on the results in .rtf or .pdf or .html format. This example illustrates the List Enrichment process using the following data set and files:
• u95a_anov_amr.sas7bdat−This file contains a subset of the results dataset from Affymetrix Latin Square ANOVA analyses example and functions as the significance input data set.
• Example_List_Description_File.TXT−This list description file contains the names of the files containing ID lists to be compared with the Significance Input Data Set. Note: This table must have two columns with first-row headers Name and File. Name provides names that are to appear in the output file, and File contains the file names with extensions of the files containing the list data. Each row of this table references a different list, and Fisher exact tests are computed for each. This file must be comma-separated, tab-delimited, or Excel, with corresponding extensions being one of the following: .csv, .txt, or .xls.
• Interleukin_Receptors.TXT and Protein_Kinases.TXT−This file functions as the list data file.
11 Annotation Analysis 295
Select Genomics > Annotation Analysis > List Enrichment. The List Enrichment dialog opens as shown in Figure 11.35.
Figure 11.35: The List Enrichment dialog
To select the significance input data set, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Select the u95a_anov_amr.sas7bdat file.
Click Open to select the file.
The column names of the input data set are listed as available variables. Select specific analysis variables from this list. The ID variable must identify entities (genes or proteins, for example) to be compared with the curated lists. The Values of this variable must match values in the lists. Only one variable should be selected.
Select Probe_Set_ID from the list of available variables.
Click to add Probe_Set_ID to the ID Variable box, as shown in Figure 11.36.
11 Annotation Analysis 296
Figure 11.36: Selecting the ID variable
The significance variable must contain the significance values of the ID Variable values. Values in the significance variable are typically −log10 p-values derived from prior analyses.
Select _Log10_p_value_for_Diff_of_Exp4 from the list of available variables.
Click to add _Log10_p_value_for_Diff_of_Exp4 to the Significance Variable box, as shown in Figure 11.37.
Figure 11.37: Selecting the significance variable
A value between 0 and 100, used to determine the significant difference cutoff, must be specified. For this example, a significance cutoff of 10 is specified. Observations with a Significance Variable greater than 10 are considered significant.
Type 10 in the Significance Cutoff [0,100] box. To select the list description file, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Select All Files(*.*) from the Files of type drop-down menu.
Select the Example_List_Description_File.TXT file.
Click Open to select the file.
To select the folder of list files, complete the following steps.
Click Choose.
Navigate to Sample Data > Microarray > Affymetrix Latin Square.
Click Select to select the folder.
The default format for the output file is .rtf.
Do not change the output file type.
Leave the Output File Name field blank.
To select the output folder, complete the following steps.
Click Choose to select the output folder.
11 Annotation Analysis 297
Navigate to ProcessResults > AnnotationData.
Click Select to select the folder.
The completed List Enrichment dialog appears as shown in Figure 11.38.
Figure 11.38: The List Enrichment dialog
Click Run to run the List Enrichment process.
When the process is completed, the generated report (.rtf file) opens as shown in Figure 11.39.
Figure 11.39: The List Enrichment report
Navigate to ProcessResults > AnnotationData.
All files created by the process are contained in this folder. Output files include the SAS program ListEnrichment_u95a_anov_amr.sas, the result file list_enrichment.rtf, several SAS data sets, and a SAS log.
11 Annotation Analysis 298
Troubleshooting
12C H A P T E R
This troubleshooting guide may help in the diagnosis of any problems when running JMP Genomics and their resolution. After checking for solutions, contact JMP Technical Support at [email protected], if the problem persists
Process Problem Suggested Cause/Resolution
Installation of JMP Genomics
The message: “Existing Client Found” is displayed in the Install Shield Wizard window, indicating that a preexisting copy of SAS has been found on a network server.
A pre-existing copy of SAS has been found that is configured to run as a thin Client from a network server. JMP Genomics will only work with a personal copy of SAS loaded on the same Client machine and configured to work locally. Contact JMP Technical Support ([email protected]) for instructions and assistance in resolving this problem.
A SAS log is displayed in your JMP Genomics session along with a message preceded by ERROR.
The generated SAS code might not complete successfully because of mis-specified parameters. Most of the error messages should be self-explanatory and provide some idea about what to do next. If not, examine the broader context provided by the SAS log to determine the problem. If this fails, consult and search the SAS documentation for the SAS code generating the error by clicking Help > SAS Documentation – Local or Help > SAS Documentation – Web. There is also the possibility of a bug in the SAS macro code. If you have found what appears to be a bug, please send the SAS log and explanation to [email protected]. Please describe your procedure in sufficient detail for us to reproduce the problem. If you are a SAS programmer, you might wish to view and even edit the original SAS code in the ProcessLibrary and/or MacroLib folders. Please also feel free to send suggested changes to the code to [email protected].
Any JMP Genomics process that utilizes one or more SAS programs
A WARNING dialog appears, telling you that SAS is connected and a process is already running.
JMP Genomics can only run one process at a time and does not queue jobs. Click OK in the dialog to wait, disregard the Run you just clicked, and let the current process continue running. Click View Log to view the current SAS log to get information on the current process. Click Disconnect SAS to stop the current process. If the SAS process does not stop in a short period of time, it is okay to kill the sas.exe process directly from Windows Task Manager, and then click Disconnect SAS again.
12 Troubleshooting 300
Process Problem Suggested Cause/Resolution
A process runs longer than expected or produces no output.
In this situation, perform the following steps:
1. Click Run again. A WARNING: SAS is Connected window should appear.
2. Click View Log. If any SAS ERROR messages appear,
click Disconnect SAS and follow the steps in the first box of this guide. If not, proceed to the step below.
3. View the SAS log that is displayed in the JMP Log
window to see the most recently executed code. You can continue to click View Log as many times as you like to check the status of the SAS program. Alternatively, you can monitor generated file activity in the SAS working folder. The location of this folder is specified in your SASV9.CFG file, which is located in <SAS Installation folder>\nls\en\ . The row beginning with –WORK indicates the folder. Open this folder, sort the files by Date Modified, and navigate into the most recent one. You should see various files being generated as the process runs. On Windows, press F5 to refresh the folder while you are monitoring it.
If these steps do not help, try running the process in the SAS 9.1 Display Manager as described below.
Any JMP Genomics process
Output of the process does not automatically open.
The output file name may contain the following characters: (), @, ^ and &, any place of output name, or contains [] at the beginning of the name, (such as [name], for example). If these characters are present, you can open the output by completing the following steps:
1. Navigate to the specified output folder. 2. Double-click on the sasclean.jsl script in the folder.
All of the output should open.
Processes that perform repetitive computations
The SAS log gets truncated.
Processes that specify a lot of variables into one macro
The line length can become too long for SAS batch mode.
In either of these cases, an alternative way to debug the process is to open the .sas file in the SAS 9.1 Display Manager (right-click and select Open with SAS 9.1) and run it from there by pressing F3. The SAS Display Manager provides options for saving or deleting sections of long logs. On Windows operating systems, you can alternatively right-click on a .sas file and select Submit to SAS 9.1. SAS will then run in batch mode and produce .log and .lst files.
12 Troubleshooting 301
Process Problem Suggested Cause/Resolution
Processes using wide data sets composed of long lists of variables
Numerous ERROR messages are generated in the SAS log.
The SAS Macro text expression limit of 65534 bytes might have been exceeded. Workarounds for this situation include the following: 1. Recreate the data set or rename the variables to have the shortest
possible names. 2. Modify the process specification to have list-style input for long
lists of variables, such as Col1-Col20000. 3. Reduce the number of variables using K-Means Clustering, as
follows. Transpose the data to tall form using Transpose Rectangular, run K-Means Clustering to generate a few thousand or less clusters, retain representatives from each cluster to use as the data, and then transpose back to wide form using Transpose Rectangular.
Opening a data file using the File > Open command in any JMP Genomics process
The column names listed in the Available Variables box of a dialog appear different than the original column names in the data set.
SAS employs two ways to name a column: the variable label and the variable name. When a file is opened using the File > Open command from the JMP menu, SAS variable labels will be displayed. These might differ from those displayed in the Available Variables list in the JMP Genomics process dialogs, which display SAS variable names for the available variables. To solve the problem, open the data file using the Open button on the process dialogs. This displays the table with names the same as those in the Available Variables lists. Alternatively, use the File > Open command from the JMP menu and, in the Open Data File dialog, change File of type to SAS Data Sets and click the Use SAS Variable Names for Column Names checkbox.
Changing the name of a column to a SAS data set in JMP
The new column name is not saved when you save the file as a SAS data set (.sas7bdat) using JMP’s File > Save as command.
In the Save JMP File As window, a Preserve SAS Formats and Variable Names check box becomes available when you select SAS V7 Dataset(*.SAS7BDAT) from the pull down menu. You must uncheck this box to save the new column name.
Agilent Import Engine
Running the process generates a long ERROR message along with a SAS Log and a SAS Message dialog indicating the successful generation of the SAS Data Set, EDDS and an Annotation data set.
The process has run successfully despite the appearance of the ERROR message. The likely cause of the ERROR message is the presence of non-numeric character strings in numerical columns. For example, Agilent places the string #IND, in empty numeric cells to indicate missing values. When SAS imports the data from these files, it reports an error and replaces the character string with a period (.). Open the resulting data sets to verify they are as you intended. If so, you may safely ignore the ERROR message and proceed with the data analysis.
12 Troubleshooting 302
Process Problem Suggested Cause/Resolution
Bioconductor Expresso for Affymetrix Import Engine
An ERROR message is generated when you try to choose an input data set using the Universal/ Uniform Naming Convention (UNC).
The Bioconductor Expresso wrapper does not accept the Universal/Uniform Naming Convention (UNC) for describing the location of a volume, directory or file. The UNC format is (\\directory\subdirectory\file). To avoid using a UNC formatted path, do not begin navigating to the desired files/folders by clicking on the directories shown in the box on the left side of the Open Data File window, as this will format the resulting path in the UNC. Instead, begin navigating by clicking within the Look in: box at the top of the window. The format of the resulting path (C:\Directory\Subdirectory\file) is acceptable to the Bioconductor Expresso process.
Any input engine
An ERROR message is generated when you try to use an EDF generated by the EDF Builder and saved as a text file
JMP's Text Data File default Import setting for the End of Field is set to Tab and Comma and the export settings preference for the End of Field is set to Comma. If the EDF is saved as a .txt file and the fields end with commas instead of tabs, the format of the EDF is not recognized by the input engines. JMP Genomics’ default Import and Export should both be set to Tab. To change the preference, select either File > Set Genomics Preference or File > Preferences. Select Text Data Files from the list on the left side of the JMP: Preferences Settings dialog. Change the End of Field default from Comma to Tab in the Data Export box. (Note: you should recheck the preferences after making this change.)Rebuild the EDF. The JMP Genomics installation instructions describe additional preferences that should be changed.
A SAS log is displayed in your JMP Genomics session along with a message preceded by ERROR or there are notes in the SAS log indicating Invalid data for particular variables.
When importing a file to a SAS data set, SAS determines the type of variable (character or numeric) based on the first N observations, where N is the value provided in the Number of Rows to Scan parameter on the Options tab of most of the Import processes. Sometimes, when a character value is present after the first N observations and the previous observations have all been numeric (so that the variable has already been defined as numeric), an error occurs when SAS attempts to read this character value. Try increasing the value for N in the Options tab until you no longer see these notes in the log.
Any Import process
The values in one or more columns are truncated.
When importing a file to a SAS data set, SAS determines the length of variable (character or numeric) based on the first N observations, where N is the value provided in the Number of Rows to Scan parameter on the Options tab of most of the Import processes. Sometimes, when subsequent values are longer than those in the first N observations, SAS will truncate those values to the length determined for the N observations. Try increasing the value for N in the Options tab.
12 Troubleshooting 303
Process Problem Suggested Cause/Resolution
Hierarchical Clustering
Heat map/dendrogram containing sample information is not correctly displayed when saved to a journal.
You have saved the heat map to a journal and closed JMP Genomics. When you open the journal, the sample information heat map displayed to the right of the main heat map does not display normal colors. The sample information has been saved to the output table. To see this information displayed correctly, make sure the data table is open before opening the journal.
An ERROR message is generated stating: You selected to use proxy server to access web, but did not specify proxy server name or port number. Please run Configure Proxy Settings to set the value.
The Proxy Server or Proxy Port number could have been incorrectly specified. Select File > Configure Proxy Settings or Genomics > Annotation Analysis > Configure Proxy Settings. Click and follow the instructions to identify your Proxy Server and Proxy Port Number. Make sure the correct name and number are entered in the dialog and click Run to configure your settings.
KEGG Pathway Search and KEGG Pathway Color
ERROR: KEGG throws RemoteException when searching pathways for hsa04520. Please refer to the Java log for further details.
The KEGG API server is either down, very busy, or the connection to the KEGG API server is denied. Retry the process at another time.
12 Troubleshooting 304
Process Problem Suggested Cause/Resolution
Create Web Link, KEGG Pathway Search, and KEGG Pathway Color
An ERROR message is generated stating: ERROR: Could not find class com/sas/genomics/annotation/ErrorMsgGetter at line 10557 column 222. Please ensure that the CLASSPATH is correct.
Check to see if any of following jar files are missing from the <sasroot>\core\sasmisc directory (the default <sasroot> is C:\Program Files\SAS\SAS 9.1\):
axis.jar axis-ant.jar axis-schema.jar commons-discovery.jar commons-logging.jar jaxrpc.jar keggapi.jar log4j-1.2.8.jar log4j.properties saaj.jar wsdl4j.jar sas.genomics.annotation.jar
If these files are missing, reinstall JMP Genomics. The install copies these jar files to: C:\Program Files\SAS\SAS 9.1\core\sasmisc\.
Create Web Link, KEGG Pathway Search, and KEGG Pathway Color
An ERROR message is generated stating: ERROR: Failed to find genomics.config file.
Check to see if the genomics.config file is missing from the <sasroot>\sds\sasmisc directory (the default <sasroot> is C:\Program Files\SAS\SAS 9.1\): If the config file is missing, reinstall your JMP Genomics. The install copies this configuration file to: C:\Program Files\SAS\SAS 9.1\sds\sasmisc\.
KEGG Pathway Color
A black KEGG pathway map results when you click and open a pathway map-link in your KEGG Color Process result.
Upgrade the SAS private JRE1.4.1 to SAS private JRE 1.4.2._09 (or JRE 1.4.2 and up) as follow.
1. Install the recommended Java JRE. 2. After installing the JRE, verify that it has been installed at the
default destination (in C:\Program Files\Java\ j2re1.4.2_09 , for example)
3. Update the SASV9.CFG file. The typical location for this file
is: C:\Program Files\SAS\SAS 9.1\nls\en\SASV9.cfg
4. Use a text editor to change the line
-Dsas.jre.home=C:\PROGRA~1\SAS\ SHARED~1\JRE\14267D~1.1
to -Dsas.jre.home=C:\PROGRA~1\Java\ j2re1.4.2_09.
5. Save the file.
12 Troubleshooting 305
Process Problem Suggested Cause/Resolution
Partial Least Squares
An ERROR message is generated stating: Error: The model contains more than 32767 effects.
The message is generated whenever the data set contains more than 32,767 columns due to inherent limitations in SAS PROC PLS. Use Predictor Reduction or some other means to get the number of + predictors below the upper bound.
Partial Least Squares Normalization
An ERROR message is generated stating: ERROR: PLS Normalization can be performed on a maximum of 32767 rows, and your data set has XXX.∗ You may wish to summarize, cluster, or subset your data.
The message is generated whenever the data set contains more than 32,767 columns due to inherent limitations in SAS PROC PLS. Use Predictor Reduction or some other means to get the number of + predictors below the upper bound.
Workflow
Attempts to run a second AP or new Workflow fails. The JMP Log shows the following message: A second script is attempting to execute, possibly during a nested click event. It may be necessary to press Escape to terminate the previous script.
Press ESC to exit the JMP script.
∗ XXX represents some number greater than 32767.
12 Troubleshooting 306
References
Abecasis, G.R., W.O.C. Cookson, and L.R. Cardon. (2000). Pedigree tests of transmission disequilibrium.
European Journal of Human Genetics 8: 545-551. Allison, D.B. (1997). Transmission-disequilibrium tests for quantitative traits. American Journal of Human
Genetics 66: 279-292. Allison, D.B., M. Heo, et al. (1999) Sibling based tests of linkage and association for quantitative traits.
American Journal of Human Genetics 64: 1754-1764. Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and powerful
approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289 - 300. Blangero, J., J.T. Williams and L. Almasy. (2001). Variance component methods for detecting complex trait
loci. in Genetic Dissection of Complex Traits, ed. D.C. Rao and M.A. Province, San Diego, CA: Academic Press, 151-181.
Carlson, C.C., M.A. Eberle, et al. (2004). Selecting a maximally informative set of single-nucleotide
polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics 74: 106-120.
Chu, T.-M., B. Weir, et al. (2002). A systematic statistical linear modeling approach to oligonucleotide array
experiments. Mathematical Biosciences 176: 35-51. Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55: 997 – 1004 Dobbin, K. and R. Simon. (2002). Comparison of microarray designs for class comparison and class discovery.
Bioinformatics 8(11): 1438-1445. Dudoit, S., Y. H. Yang, et al. (2002). Statistical methods for identifying genes with differential expression in
replicate cDNA microarray experiments. Statistica Sinica 12: 111-140 Elston, R.C. and H.J. Cordell. (2001). Overview of model-free methods for linkage analysis. in Genetic
Dissection of Complex Traits, ed. D.C. Rao and M.A. Province, San Diego, CA: Academic Press, 135-150.
Haseman, J.K. and R.C. Elston. (1972). The investigation of linkage between a quantitative trait and a marker
locus. Behavior Genetics 2: 3-19. Hsieh, W. P., T.-M. Chu, et al. (2003). Who are those strangers in the Latin Square? in Methods of Microarray
Data Analysis III. K. E. Johnson and S. M. Lin. Boston/New York/Dordrecht/London, Kluwer Academic Publishers: 247 pp.
Jin, W., R. M. Riley, et al. (2001). The contributions of sex, genotype and age to transcriptional variance in
Drosophila melanogaster. Nature Genetics 29: 389-395. Kerr, M. K. and G. A. Churchill. (2001). Experimental design for gene expression microarrays. Biostatistics 2:
183-201.
References 308
Merchant, M. and S. R. Weinberger. (2000). Recent advancements in surface-enhanced laser
desorption/ionization-time of flight-mass spectrometry. Electrophoresis 21: 1164-1177. Monks, S.A. and N.L. Kaplan. (2000). Removing the sampling restrictions from family-based tests of
association for a quantitative-trait locus. American Journal of Human Genetics 66: 576-592. Price, A.L., N.J. Patterson, et al. (2006). Principal components analysis corrects for stratification in genome-
wide association studies. Nature Genetics 38: 904-909. Qu, Y., B.-L. Adam, et al. (2002). Boosted decision tree analysis of surface-enhanced laser
desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clinical Chemistry 48: 1835-1843.
Redon, R., et al. (2006) Global variation in copy number in the human genome. Nature 444: 444-454. Tuzun, E., A.J. Sharp, et al. (2005) Fine-scale structural variation of the human genome. Nature Genetics 37:
727–732. Wang, T. and R.C. Elston. (2004). A modified revisited Haseman-Elston method to further improve power.
Human Heredity 57: 109-116. Whittemore, A.S. and I-P. Tu. (1998). Simple, robust linkage tests for affected sibs. American Journal of
Human Genetics 62: 1228-1242. Wiggington, J.E., D.J. Cutler, and G.R. Abecasis, (2005) A note on exact tests of Hardy-Weinberg
equilibrium. Amer. J. of Hum. Gen. 76: 887-893.
Varambally, S., J. Yu, et al. (2005) Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8: 393-406.
Zaykin, D.V., P.H. Westfall, et al. (2002). Testing association of statistically inferred haplotypes with discrete
and continuous traits in samples of unrelated individuals. Human Heredity 53: 79-91.
Appendix
Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes.
JMP Genomics AP SAS PROCs Called Up Experimental Design
Experimental Design Data Set Builder TRANSPOSE, EXPORT Experimental Design File Builder none∗
Create Array Index No SAS called; JSL only Create ColumnName No SAS called; JSL only Create Row Index No SAS called; JSL only Check File Names No SAS called; JSL only Import Tutorials No SAS called; JSL only
Import Affymetrix
Expression CHP Wizard The wizard generates a workflow of import, quality control, and ANOVA APs. See APs in the workflow for specific PROCs
Download NetAffx Files No SAS called; JSL only ARR File Parser No SAS called; JSL only Expression CEL DATASETS, REGISTRY, IMPORT, SORT Expression CHP DATASETS, REGISTRY, IMPORT, SORT SNP CEL DATASETS, REGISTRY, IMPORT, SORT SNP Chip DATASETS, REGISTRY, IMPORT, SORT CNAT IMPORT, SORT Export to CHP Format none
Illumina Expression IMPORT, DATASETS, SORT SNP IMPORT, SORT, TRANSPOSE Copy Number SORT, IMPORT, TRANSPOSE, DATASETS, CONTENTS,
Other Expression Agilent DATASETS, REGISTRY, IMPORT, SORT ArrayTrack DATASETS, REGISTRY, IMPORT, SORT Bioconductor Expresso for Affymetrix none GenePix DATASETS, REGISTRY, IMPORT, SORT QuantArray DATASETS, REGISTRY, IMPORT, SORT ScanAlyze DATASETS, REGISTRY, IMPORT, SORT
Other Genetics Arlequin SORT HapMap IMPORT, TRANSPOSE NEXUS DATASETS, REGISTRY, IMPORT, SORT Pedigree IMPORT, SORT
Proteomics ABI Analyst DATASETS, REGISTRY, IMPORT, SORT
Text Import Individual Text, CSV, or Excel Files DATASETS, REGISTRY, IMPORT, SORT Import a Designed Experiment from Text, CSV, or Excel Files
DATASETS, REGISTRY, IMPORT, SORT
JMP Genomics Import Tutorials No SAS called; JSL only
∗ none – indicates that while the process calls SAS and uses SAS data step and macro code, no SAS PROCs are used.
310
Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued)
JMP Genomics AP SAS PROCs Called Up Data Set Utilities
Column Contents CONTENTS Change Labels none Change Lengths none Rename none Reorder none Append APPEND Merge SORT Transpose Tall and Wide MEANS, SORT, TRANSPOSE Transpose Rectangular SORT, TRANSPOSE Unstack none Data Step none Merge and Transform none Rank Rows RANK Sort Rows SORT Statistics for Columns SORT, SUMMARY Statistics for Rows none Transform none Export EXPORT
Genetics Data Set Utilities Check Data Contents CONTENTS, PRINT, Subset/Reorder Genetics Data none Recode Genotypes ALLELE, SORT, TRANSPOSE
Genetic Marker Statistics Phenotype Summary SORT, FREQ Marker Properties ALLELE, SORT, TRANSPOSE Linkage Disequilibrium ALLELE, SORT, SUMMARY, PRINT LD tagSNP Selection ALLELE, SORT, IML Malecot LD Map SORT, PRINT, DATASETS, NLMIXED, APPEND
Association Testing Case-Control Association CASECONTROL, PSMOOTH, SORT, PRINT
PCA for Population Stratification STDIZE, DATASETS, SORT, IML, CORR, TRANSPOSE, APPEND, PRINCOMP
Marker-Trait Association ALLELE, LOGISTIC, GLMMIX, PHREG, SORT, PRINT
SNP-Trait Association MIXED, PHREG, LOGISTIC, TRANSPOSE, SORT, ALLELE, DATASETS
Quantitative TDT ALLELE, FAMILY, PSMOOTH, MIXED, GLM, UNIVARIATE, MEANS, SORT, PRINT, IML
TDT FAMILY, PSMOOTH, SRT, PRINT
SNP Interaction Selection (Experimental) SORT, MEANS, TRANSPOSE, FREQ, CONTENTS, APPEND, STDIZE, FASTCLUS, GENESELECT, DATASETS, TTEST
Model-free Linkage
Affected Sib-Pair Tests none Haseman-Elston Regression SORT, MIXED, PSMOOTH Variance Components SORT, MIXED, UNIVARIATE, IML, PRINT, PSMOOTH
Haplotype Analysis Haplotype Estimation HAPLOTYPE, PSMOOTH, SORT
Haplotype Trend Regression HAPLOTYPE, LOGISTIC, REG, PHREG, SORT, PRINT, TRANSPOSE
htSNP Selection HTSNP, PRINT, SORT
Copy Number Distribution Analysis KDE Data Standardize STDIZE Correlation and Principal Components CORR, FACTOR, PRINCOMP Bin MEANS One-Way ANOVA none Bivariate One-Way ANOVA SORT, CONTENTS,
311
Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued) JMP Genomics AP SAS PROCs Called Up
Spectral Preprocessing 2D Bin MEANS 2D Detrend TRANSPOSE 2D Peakfind IML, SORT, TRANSPOSE 2D Plot TRANSPOSE 3D Align KDE 3D Plot none
Quality Control Distribution Analysis KDE Correlation and Principal Components CORR, FACTOR, PRINCOMP Correlation and Grouped Scatterplots none Filter Intensitiies UNIVARIATE, MEANS, Feature Flagger SQL Effect Removal via PLS Normalization No SAS called; JSL only Missing Value Imputation DATA STEP Pseudo Image MEANS, UNIVARIATE Surface Summary KDE, UNIVARIATE, MEANS, SORT, FORMAT, G3D
Normalization ANOVA Normalization MIXED Data Standardize STDIZE Factor Analysis Normalization FACTOR Loess Normalization LOESS, MEANS, SORT, DATASETS, APPEND Mixed Model Normalization MIXED, MEANS, SORT Partial Least Squares Normalization PLS, TRANSPOSE Quantile Normalization MEANS, SORT Ratio Analysis LOESS, MEANS, SORT, CONTENTS
Pattern Discovery Hierarchical Clustering TRANSPOSE K-Means Clustering FASTCLUS Principal Components Analysis PLS Distance Matrix DISTANCE, SORT Multidimensional Scaling MDS, SORT
Row-by-Row Modeling One-Way ANOVA none ANOVA MIXED
Mixed Model Analysis MIXED, MULTTEST, MEANS, STDIZE, DATASETS, CONTENTS, SORT, TRANSPOSE, PRINT
Estimate Builder/Compare Means MIXED, PRINT Two-Way Plotter DATASETS, SORT, TRANSPOSE, GPLOT, GCHART, GREPLAY P-Value Adjustment MULTTEST P-Value Quantile Plotter No SAS called; JSL only
Predictive Modeling Recode Genotypes Transpose Tall and Wide MEANS, SORT, TRANSPOSE Discriminant Analysis DISCRIM, TRANSPOSE Distance Scoring None General Linear Model Selection GLMSELECT K Nearest Neighbors DISCRIM Logistic Regression LOGISTIC Partial Least Squares PLS, GLMMOD, TRANSPOSE Partition Trees TRANSPOSE Radial Basis Machine GLIMMIX
Binary Response Effect Selection (Experimental) SORT, MEANS, TRANSPOSE, FREQ, CONTENTS, APPEND, STDIZE, FASTCLUS, GENESELECT, DATASETS, TTEST
Cross Validation Model Selection No SAS called; JSL only Test Set Model Comparison No SAS called; JSL only
312
Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued) JMP Genomics AP SAS PROCs Called Up
Annotation Analysis Create 0-1 Indicator for Select Rows No SAS called; JSL only Venn Diagram No SAS called; JSL only Create Web Link SQL, EXPORT IPA Upload SQL, EXPORT KEGG Pathway Search SORT, MEANS, UNIVARIATE, TRANSPOSE KEGG Pathway Color SORT, EXPORT, TRANSPOSE UCSC Genome Browser Link MEANS,SORT Affymetrix
Integrated Genome Browser MEANS, SORT Download NetAffx Files No SAS called; JSL only
Column Enrichment GLMMOD, TRANSPOSE, SORT, MEANS, MULTTEST List Enrichment none Configure Proxy Settings none
Power and Sample Size Mixed Model Power MIXED SNP Power IML, SORT, PRINT
Workflow Builder Clear Parameter Defaults No SAS called; JSL only Generate Dialogs from XML none