jmp genomics

JMP Genomics

Version 3.1

User Guide

“Creativity involves breaking out of established patterns in order to look at things in a different way.” Edward de Bono

JMP. A Business Unit of SASSAS Campus Drive

Cary, NC 27513 www.jmp.com

The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2007. JMP ®

Genomics User Guide. Cary, NC: SAS Press.

JMP®

Genomics User Guide Copyright © 2007, SAS Institute Inc., Cary, NC, USA All rights reserved. Produced in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc.

U.S. Government Restricted Rights Notice. Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227–19 Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, August 2006 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hard-copy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228.

JMP®

, SAS®

and all other SAS Institute Inc. product or service names are registered trademarks

or trademarks of SAS Institute Inc. in the USA and other countries. ®

indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies.

Table of Contents

Chapter Title Begins on page

1 Introduction 1

2 Designing New Experiments 11

3 Creating Data Sets for Analysis in JMP Genomics

29

4 Data Set Utilities 49

5 Genetic Marker Case-Control Data 91

6 Genetic Marker Family or Pedigree Data 111

7 Microarray Case Study I: The Drosophila Aging Experiment

127

8 Microarray Case Study II: Affymetrix Latin Square Data

195

9 Proteomics Spectral Preprocessing: The Prostate Cancer Example

227

10 Predictive Modeling 245

11 Annotation Analysis 267

12 Troubleshooting 299

References 307

Appendix 309

Introduction

1 C H A P T E R

Welcome to JMP Genomics, a powerful desktop software system for integrated statistical analysis of genetic marker, microarray, and spectral (proteomics and metabolomics, for example) data. The purpose of this manual is to provide you with informative examples of how to use JMP Genomics to extract the maximum amount of useful information from genomics data. You should be familiar with the terminology and technology associated with modern genomics analyses and standard JMP functionality. The JMP Introductory Guide provides information on getting started with JMP. This manual is organized as a set of tutorials. Follow along with the JMP Genomics software as you read the manual. The conventions illustrated in Table 1.1 are used throughout this manual.

Table 1.1: Text conventions

Symbol/Font/Style Used to designate:

Instruction or task to be

performed

A > B > C Navigation path from A to

B to C, used for paths through nested directories

A > B > C Navigation path from A to

B to C, used for paths through menus

Choose or Run Buttons or Commands

General or Options

Names of data tables, column headings, and other text generated by

JMP are set in a different font

General or Options Text to be typed by the

user This chapter provides an overview of the primary functional aspects of the JMP Genomics system, descriptions of some important differences between standard JMP functionality and JMP Genomics functionality, and descriptions of the included sample data sets.

Genomics Main Menu

JMP Genomics is a fully functional version of JMP plus a collection of analytical process dialogs in the Genomics main menu (Figure 1.1). It provides access to more than 100 analytical processes.

1 Introduction 2

Figure 1.1: The JMP Genomics main menu is organized into submenus

Some Important Differences Between JMP and JMP Genomics

JMP Genomics Dialogs JMP Genomics dialogs function differently from standard JMP dialogs. Standard JMP dialogs invoke calculations in compiled code, whereas JMP Genomics dialogs generate a SAS program (with suffix .sas), execute it in the background, and then return results. The results typically consist of SAS data sets (also known as SAS data tables, with suffix .sas7bdat) along with a JMP scripting language file (with suffix .jsl) that automatically invokes standard JMP platforms. Small Java programs facilitate some of the calculations. The interaction between JMP, SAS, and Java can be depicted as follows:

Data Results

Figure 1.2: Interaction between JMP Genomics, SAS and Java

Data Sets An important distinction of most JMP Genomics dialogs is that they do not process open JMP data tables. Instead, they prompt you to specify one or more SAS data sets that have been created and saved in your file system. This characteristic enables you to work with very large data sets without having to open them as JMP data tables and to specify multiple SAS data sets in one process. The creation and use of JMP Genomics data sets is described more fully in Chapter 3.

1 Introduction 3

Deciding Which Processes to Run

An initial challenge in using JMP Genomics is deciding which processes to run and in what order. The software does not provide detailed guidance on constructing a workflow, and there are a wide variety of possible workflow combinations depending upon your discovery objectives. The Genomics menu organizes the JMP Genomics processes into groups. The groups are organized in an order that is typically employed by bioinformaticians, statisticians, and data analysts. However, you are free to rearrange the menus to your liking. Refer to the JMP Genomics Programmers Guide for details on customizing menus. JMP Genomic processes are modular, so they can be run in any order. Over time, you develop expertise with the system and form favorite workflows. The sample case studies, outlined in this manual, illustrate some typical, frequently used workflows.

Running a Process

To run a JMP Genomics process, select the process from one of the JMP Genomics menus, specify the parameters on all tabbed panes in the process dialog, and then click Run. The following example, which invokes the ArrayTrackInput Engine, illustrates a typical JMP Genomics process.

Select Genomics > Import > Other Expression > ArrayTrack. The following dialog opens.

Description box

Asterisks (*) are used to indicate required parameters.

Parameter panes

Functional buttons

Figure 1.3: A typical JMP Genomics Dialog

Each dialog has three main sections: a description box, one or more tabbed parameter panes, and functional buttons (illustrated in Figure 1.3). The description box on the top of the dialog describes the purpose of the process. The tabbed panes are the main area to specify input parameters. The six functional buttons, common to all of the JMP Genomics dialogs, are described in Table 1.2.

1 Introduction 4

Table 1.2: Functional buttons

Functional Button Used to:

Run the process using the specified parameters

Save the specified parameters

Load selected, saved parameters into the dialog

Apply the specified parameters as default settings

to all relevant JMP Genomic dialogs

Clear all the parameter settings and return the

dialog to its default state

Cancel the process and close the dialog

Use these buttons to load, save, or clear specified parameters, run the process using the specified parameters, or apply those parameters to other JMP Genomics processes. There is a defined order to the specification of some parameters. Such parameters are disabled and grayed until their dependency requirements are fulfilled. Many processes contain multiple tabbed panes with numerous optional parameters. As you develop expertise with particular processes, make sure to investigate the often rich collection of parameters available.

Click to the right of any parameter entry field to obtain help about its specification.

The General tab for each dialog typically contains the most important parameters for the process. For example, most processes require specific types of input files or data sets and an output folder. For our example, we want to open the AT_exp2.txt file. This Experimental Design File, which contains information about the experiment, is needed to import raw data into JMP Genomics and is discussed more fully in Chapter 3.

Click Choose (circled in Figure 1.4).

Figure 1.4: Click Choose to select a file or folder

1 Introduction 5

When you installed JMP Genomics, a folder named Sample Data was also installed. Navigate to this folder and then to a file named AT_exp2.txt by following the path Sample Data > Microarray > ArrayTrack.

Click on the AT_exp2.txt file.

Click Open to select the file (circled in Figure 1.5).

Figure 1.5: Click Open to select the file

The file is added to the dialog, as shown in Figure 1.6.

Figure 1.6: The Experimental Design File has been specified

Our next step is to select the folder containing the raw data files.

1 Introduction 6

Click Choose (circled in Figure 1.7).

Figure 1.7: Click Choose to select a file or folder

Navigate to the Sample Data folder and then to a folder named ArrayTrack by following the path Sample

Data > Microarray > ArrayTrack.

Click the Select button (circled in Figure 1.8) at the bottom of the Choose directory window. Note: to select a folder in JMP Genomics, you must first open the folder.

Figure 1.8: Selecting the ArrayTrack folder

1 Introduction 7

The next step is to choose a folder in which to place and store output. You may choose any folder you like. For this example, select the ProcessResults folder that came with JMP Genomics.

Repeat the selection process to specify the Output Folder.

The completed dialog is shown in Figure 1.9.

Figure 1.9: The completed ArrayTrackImport Engine dialog

Once you have specified the parameters for a process, click Save to save the parameters for later recall, if

needed.

Click Run to run the process. JMP Genomics dialogs generate and run a SAS program each time you click Run. Depending upon the size of your data sets and capacities of your computer, some analyses can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (circled in Figure 1.10).

1 Introduction 8

Figure 1.10: Display during Analysis

While a process is running, it is a good idea to monitor progress using an application that displays statistics such as CPU, memory, and disk usage, like the Windows Task Manager. This can be informative for troubleshooting a hung process. You can only run one process at a time. If you attempt to run a second process while another one is running, you are prompted to disconnect from SAS and stop the current process, to view the current SAS log, or to wait until it completes. The location of each SAS data set generated by your analysis is listed in a new window (shown in Figure 1.11). You can view each of the data sets by clicking Open.

Figure 1.11: The SAS Message generated by our analysis

Saving and Loading Settings JMP Genomics dialogs allow you to save and load parameter settings. This enables you to save, recall, modify and exchange analyses without having to re-enter specifications each time your run a process. You can save and load settings using the Save and Load buttons at the bottom of each dialog. Most of the processes in JMP Genomics come with one or more example settings that use the example data sets that come with the system. A

1 Introduction 9

good way to learn about a new process is to load one of the example settings, study its parameter values, run the process, and explore the results.

SAS Variable Names and Labels Each variable/column in a SAS data set must have a unique name. SAS variable names must adhere to the following conventions:

1) The first character must be a letter (A, B, C, …) or underscore ( _ ). 2) Subsequent characters can be letters, numeric digits (0,1,2 …) or underscores ( _ ). 3) Blank spaces are not allowed. 4) Special characters, except for underscore, are not allowed. 5) Names must not exceed 32 characters.

SAS variable names are not case-sensitive. SAS variables can be either character or numeric. In either case, a fixed length is assigned to store each observation of that variable. Optionally, SAS variables can have a label. Labels have much less restrictive creation rules. For example, SAS labels can be up to 256 characters in length and can contain blanks and special characters. When JMP opens a SAS data set, it reads the labels (when they exist) and uses them as JMP data table column names. If you want information on the variable names and labels for a SAS data set, run the Column Contents process under the Data Set Utilities menu. There are other processes available for changing SAS variable names, labels, and lengths.

Sample Case Studies

The data sets included with JMP Genomics, which are detailed below, allow you to work through many of the analytical processes in JMP Genomics. In addition to the data sets, each case study includes experimental design files and other needed files. These case studies are referred to throughout this manual.

Drosophila Aging Experimental Data This data set represents a small subset of the Drosophila aging experiment data from (Jin, Riley et al. 2001). The

experiment consisted of 24 two-color cDNA microarrays, 6 for each experimental combination of 2 lines (Oregon and Samarkand), 2 sexes (Female and Male), and 2 ages (1 week and 6 weeks). The Cy3 and Cy5 dyes were flipped for two of the 6 replicates for each genotype and sex combination. The design is a split-plot, with Age and Dye as subplot factors, and Line and Sex as whole-plot factors. A total of 4256 clones were spotted on the arrays, but this example uses a subset containing 100 randomly selected

genes from the original data set.

Affymetrix Latin Square Data The spike-in data set used in this example was originally generated by Affymetrix Corporation to develop and validate their U95A GeneChip and Microarray Suite (MAS) 5.0 algorithm over a range of known concentrations. (Affymetrix, 2001) The experiment consists of 59 arrays. There are 14 experimental groups, designated with letters, a, b, c, d, e, f, g, h, i, j, k, l, m, and q. (Group m and group q each have 4 within-chip replicates, group m replicates were originally designated n, o, and p and group q replicates were originally designated r, s, and t, The extra letters are not needed because they are replicates of m and q, respectively.) Each experiment was repeated in triplicate using Affymetrix chips cut from different wafers. The last four digits of the wafer numbers are 1521, 1532 and 2353. Wafer 2353, chip c was

1 Introduction

10

defective so is not included in the data set. For wafers 1521 and 1532, 20 .CEL files were generated, and for wafer 2353, 19 .CEL files were generated. Each group contains a pool of non specific RNA as well as a set of 14 distinct human transcripts spiked in at known concentrations of 0, 0.25, 0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 pM. The spike-in concentrations for each pool are staggered in a Latin square design. For purposes of rapid demonstration, the data have been trimmed to only 100 genes (including the 14 spike-ins), and trimmed versions of .CEL files containing just these 100 genes are available in your JMP Genomics Sample Data folder

Prostate Cancer Biomarkers This data set was obtained by surface-enhanced laser desorption/ionization (SELDI). This method allows an investigator to detect and resolve multiple proteins bound to protein chip arrays (Merchant and Weinberger 2000). This approach was used by Qu, et al. ( 2002) to discriminate prostate cancer from non-prostate cancer patients. The promise of this approach is that a panel of multiple biomarkers can be used to distinguish important phenotypes such as cancer status; however, great care must be taken to pre-process and analyze the data appropriately to ensure generalizability of results. The example data set consists of serum samples collected from 165 men, 84 of whom had prostate cancer. The remaining 81 men are considered to be controls. The primary goal is to determine differences in protein expression between these groups. Sample Genetic Marker Data These data are computer-simulated. The data are in wide form, with the 1000 rows corresponding to individuals and 130 columns corresponding to various data on these individuals. These data contain family, genotype, and phenotype information. The disease column contains the binary trait of primary interest, with 1 indicating individuals affected with the disease and 0 indicating unaffected individuals. There are also four quantitative traits and sixty markers, with two possible alleles (designated 1 and 2), per marker, for each individual. The marker data occur in pairs, so that the genotype at the first marker comprises columns ma1 and, ma2, ma3 and ma4 the second marker genotype, and so on. The analyses performed on this data set are aiming to locate the gene or genes that affect susceptibility to this disease. Accompanying this data set is a map data set that provides information about the 60 markers, which are spread across two hypothetical candidate gene regions. The variable representing on which candidate gene the marker resides can be used to group analyses, and the Location variable is useful for accurately displaying distances in base pairs between markers along the x-axis of plots containing various association p-values. Affected Sib-Pair (ASP) Data Two hundred families, each containing an affected sib-pair and the siblings’ parents, were genotyped at 20 markers from a single chromosome in simulated data provided by Gonçalo Abecasis at the University of Michigan Center for Statistical Genetics. MERLIN was used to estimate identical-by-descent (IBD) allele-sharing probabilities at these markers for all pairs of related individuals. The 400 offspring are also measured for a quantitative trait of interest.

Designing New Experiments

2 C H A P T E R

Designed experiments are at the very heart of scientific discovery. However, a lot of scientific experimentation is conducted in a haphazard fashion. Because of this, many experiments are less efficient and less informative merely because of a lack of planning. Many designs have a large degree of confounding, in which the effects of two or more factors or their interactions are indistinguishable. If an experiment has too much confounding, it can lead to inconclusive or potentially misleading results. Taking a little time to properly design each experiment avoids confounding and leads to maximal information gain for the research costs incurred. This chapter guides you through features in JMP Genomics that plan efficient experiments. Our starting point is the JMP DOE (Design of Experiments) menu found in the main menu bar.

Figure 2.1: DOE Main Menu

JMP offers a wide range of DOE functionality, including classical designs. For scientific discovery and genomics purposes, we focus only on the first item in this menu: Custom Design (Figure. 2.1). For in-depth details and background on all items in this menu, refer to the JMP Design of Experiments guide. Note: Some of the terminology in the JMP Design of Experiments guide is derived from the statistical and engineering literature, which chronicles a long, rich, and successful history of highly efficient experimental designs. Many of the best designs are not widely known or utilized in genomics research, but JMP enables you to rapidly find and customize them for your laboratory’s needs.

Example: A Two-Way Design for Single Channel Instrumentation

This example uses 12 biological samples to study the effects of a chemical agent versus a chemical control. The study examines the expression of a large set of genes, proteins, or metabolites at 1 hour, 6 hours, and 24 hours after dosing the samples with the chemical. Because of the destructive nature of the expression protocol, each sample can be treated with only one chemical and observed at only one time. Expression is measured with a single-channel instrument, which excludes two-channel microarrays, considered later in this chapter. A standard two-way design is appropriate in this case.


The JMP Custom Design platform allows you to interactively create designs of any complexity, but let’s begin with this simple case.

Select DOE > Custom Design. The dialog illustrated in Figure 2.2 appears.

Figure 2.2: JMP’s Custom Design Dialog

The two main fields of this Custom Design dialog allow entry of responses and factors. Responses are the numerical measurements taken during the experiment. In Genomics research, thousands of responses are collected simultaneously, so JMP Genomics has special conventions for loading large response data files. These conventions are explained later.

For now, leave this field as it is, with response variable Y.

Factors are the variables that are controlled during the experiment. They are the effects of interest. For our two-way experiment, we have two factors: Treatment and Time. Treatment has two levels, Agent and Control. Time has three levels, 1h, 6h and 24h.

To add these factors to the design, select Add Factor > Categorical > 2 Level, as shown in Figure 2.3.

Figure 2.3: Adding a categorical factor


Double-click on X1 (Figure 2.4) and change its name to Treatment.

Under the Values column, click on L1 and change it to Agent.

Press the Tab key to L2 and change it to Control.

Figure 2.4: Specifying the first factor

To add the second factor, select Add Factor > Categorical > 3 Level.

Double-click on X2 and change it to Time.

Under the Values column, click on L1 and change it to 01h.

Tab to L2, change it to 06h.

Tab to L3 and change it to 24h.

Tip: Use zero-padding to code numerically-based values with varying lengths so that alphabetical sorting order matches numerical order during later analytical processing in SAS.

Note: You could optionally define Time as a Continuous factor, if you plan to directly model linear or quadratic trends over time. For this example, we define Time as categorical in order to allow each time level to have an arbitrary mean response.

The Factors section is shown in Figure 2.5.

Figure 2.5: Specific factors


Click Continue to proceed to the next design step, shown in Figure 2.6.

Figure 2.6: JMP’s DOE Custom Design Dialog Window (part II)

There are no constraints, so skip the Define Factor Constraints section. The Model section allows specification of the design that enables estimation of interactions between Treatment and Time. Note the Design Generation section at the bottom of Figure 2.6, which specifies 6 runs. In the current design, with no interaction terms specified, the default number of runs is 6.

Click Interactions > 2nd, as shown in Figure 2.7.

Figure 2.7: Adding Two-Level Interactions.


A Treatment*Time row is added to the model, as seen in Figure 2.8.

Figure 2.8: The Complete Model

In the Design Generation section (Figure 2.9), note the default number of runs has changed to 12. A run is one specific combination of factors applied to obtain one set of responses.

Figure 2.9: Design Generation

Since there are 12 samples budgeted for the runs, leave this field as is and click Make Design to generate

the design shown in Figure 2.10.


Figure 2.10: Custom Design Dialog Window (part III)

Note in the Design section that the 12 runs are listed sequentially. Whenever possible, it is always a good idea to randomize the order in which you collect experimental data. This helps avoid any unwanted trends that may creep into the data over time. If you are unable to randomize the order of one or more factors, you should consider more complex designs such as Randomized Block or Split-Plot designs, described later in this chapter. To do this randomization, complete the following steps.

In the Output Options box (Figure 2.11), leave Run Order as Randomize.

Figure 2.11: The Output Options Box

Do not change the Number of Replicates, since we have all 12 available samples.

Click Make Table to obtain the table shown in Figure 2.12.


Figure 2.12: Experimental Design Table

The treatment each biological sample is subjected to and the time at which each sample is observed are both listed in Figure 2.12. Note that the run order (1-12) may be different than Figure 2.12 because of the random number generator used to generate this design.

This is an example of a completely randomized design. The levels of both Treatment and Time are arranged in a random order. When collecting expression data on only one gene, protein, or metabolite, simply enter the data in the Y column and then analyze them directly in JMP using any number of different methods. But to work with thousands of expression measurements simultaneously, JMP Genomics requires you to construct a table like this one as a way to link the experimental design information to a collection of raw response data files, each of which contains thousands of measurements. Construction of this table, known as an Experimental Design File, requires adding two columns to this table, called File and Array, that are described more fully in Chapter 3. The File column lists the names of the raw data files containing the expression measurements corresponding to the factor levels for the run in its same row. The Array column contains a unique index for each array in the experiment.

For now, the table is ready to use in the lab to run the design in the random order specified.

Blocking Factors

Experimental designs are often difficult to conduct in a completely randomized fashion because of the presence of one or more additional factors that can induce correlation in the observed responses. In these situations define one or more blocking factors to better control unwanted experimental variation. Examples of blocking factors include: batch, animal, day of processing, technology lot number, machine, location, laboratory, technician, or operator. Blocking factors are typically considered random because they can be viewed as arising from a population of effects having a probability distribution, usually a normal distribution.

To continue the two-way design example, suppose that the 12 samples are not totally independent, but that 3 samples each were taken from 4 distinct batches. The batches could consist of any number of things, including the day of initial sample collection or the mode of processing them. In this case, it is important to control for the effect of batches on the experimental outcomes. To do this, add Batch as a blocking factor to the design. Defining such a blocking factor lets you model a correlation betweens samples from the same batch and provides a more accurate assessment of true batch-to-batch variability. Ignoring the batch effect when it is significant leads to biased conclusions about expression differences.


To add Batch to the previous design, complete the following steps.

Begin a new design. Select DOE > Custom Design.

Define Treatment and Time factors, as previously described.

Click Add Factor > Blocking > 3 runs per block. Double-click on X3 and change it to Batch. The Factors section should now appear as shown in Figure 2.13.

Figure 2.13: Design with a blocking factor

Click Continue to specify which terms need to be modeled.

Click Interactions > 2nd.

Click Continue in any message windows.

Click Make Design to make the design shown in Figure 2.14.


Figure 2.14: Custom Design Dialog Window with 1 blocking factor

Note the Batch factor has four levels, with three runs for each level.

Click Make Table to obtain the table shown in Figure 2.15.

Figure 2.15: Experimental Design Table (with 1 blocking factor)


This is an example of an Incomplete Block Design. The blocks corresponding to Batch are incomplete because not all combinations of treatment and time are observed within a block; however, there is a form of partial balance in the experiment, because each unique combination of treatment and time is observed exactly twice across the whole experiment. Tip: Good designs often have some form of balance in terms of number of treatment combinations observed. Balancing the number of factor levels helps break confounding among factors and ensures approximately equal information gain on all relevant differences.

Split-Plot Designs

Continuing our two-way experiment example with factors Treatment and Time on a single-channel instrument, suppose that instead of the need for the batch blocking factor, the actual constraint is that samples need to be processed immediately after collection at the 1 hour, 6 hour, and 24 hour time points. In other words, it is not feasible to conduct the experimental runs in a completely randomized fashion; rather, they must be processed in time order. This is a situation calling for a Split-Plot Design, in which certain factors are easy to change in the lab and others are hard to change. You can easily generate a split-plot design in JMP DOE by changing values in the Changes column in the Factors section.

Begin a new design. Select DOE > Custom Design.

Define Treatment and Time factors as described previously.

In the Changes column, click Easy in the Time row.

Select Hard from the menu that appears (Figure 2.16).

Figure 2.16: Changing Time from Easy to Hard

Click Continue to define the model.

Click Interactions > 2nd in the Model section

Click Make Design to proceed to the next step (shown in Figure 2.17).


Figure 2.17: Custom Design Dialog Window (split-plot design)

Notice the automatic creation of the Whole Plots column in the Design section.

Note: The term whole plot derives from agricultural field research where split-plot designs were originally popularized. Imagine a two-way design in a field trial in which the effects of plant variety and different fertilizers are to be studied. The fertilizers can only be applied to large sections of the field via large machinery or airplanes, but varieties can be planted in smaller sections. The split-plot design consists of dividing the field into fertilizer-level sections called Whole Plots, and the varieties are planted in subplots within each whole plot. There are six whole plots in this design. Note that levels of Time are constant within any particular whole plot. In contrast, the Agent and Control levels of Treatment change within whole plots. Treatment is known as a subplot factor, and Time as a whole-plot factor. By their nature, split-plot designs provide more precision in estimating effects of subplot factors than they do for effects of whole-plot factors. This is perhaps intuitive given the constraints placed on the whole-plot factors. Many experimenters employ split-plot designs without realizing it when they process samples in a grouped order, but then analyze the data as if they were completely randomized. This practice can lead to badly biased conclusions, especially when the whole-plot effect is substantial. The appropriate way to analyze a split-plot design involves specifying whole-plots as a random effect in the analysis, thereby modeling a correlation among measurements taken within the same whole plot.

Two-Channel Microarrays

Two-channel microarrays are characterized by the fact that two measurements for each gene are obtained from each microarray. This is because two different samples are tagged with different dyes, competitively hybridized to one array, and then measured under two different laser frequencies. This technology therefore offers an additional layer of complexity for experimental design beyond the one-channel designs described previously.


Several papers have discussed different two-channel design options in detail, including Kerr and Churchill (Kerr and Churchill 2001) and Dobbin and Simon (Dobbin and Simon 2002). Arguably the most popular design is the Reference Sample Design, in which a common reference sample (typically a pool of samples that is not of direct experimental interest) is tagged with one dye and hybridized on every array, while the various treated samples are tagged with the other dye. This design is easy to set up and effectively reduces design considerations to the single channel that is changing. However, the reference sample design can be two to four times less efficient than designs that hybridize samples of interest directly together on microarrays. The keys to higher efficiency are to pair samples together on arrays in a way that optimizes experimental interests and then to make sure the analysis of the data is conducted appropriately. The previous discussion of blocking factors and split-plot designs has direct bearing here. If we narrow our focus to all the data from a single gene, and assume there is only one spot for that gene on each array, then the data come in pairs corresponding to the two measurements from each array. Each array can therefore be considered as a block of size two. Alternatively, in a split-plot scenario where certain factors are hard to change, you desire more precise information on some factors versus others, then you can consider arrays to be whole plots and assign certain factors to change within whole plots (subplot factors) and others to stay constant on the whole plots (whole-plot factors). Example: Split-Plot Design for Two-Channel Microarrays. Here we use the Drosophila aging experiment described in Jin et al. (Jin, Riley et al. 2001) as an example to consider for experimental design options for two-channel microarrays. A subset of these data is included with your JMP Genomics installation and is described in Chapter 1 of this manual. This design has three experimental factors with two levels each: Age (1 week, 6 weeks), Sex (Female, Male), and Line (Oregon, Samarkand).

Note: For higher-level factorial arrangements, experimental design experts often use exponential notation as a shorthand description. The Drosophila example would be called a 23 design, which designates 3 factors with 2 levels each. The primary experimental factor of interest is Age, and for this experiment it was desirable to obtain more precise information on the effects of Age at the expense of the Sex and Line effects. The latter two are still included to provide a higher degree of generalization for conclusions. These considerations call for a split-plot design. To create a split-plot design for this example,

Click DOE > Custom Design.

Define the three categorical factors and a fourth factor indicating the Channel. Specify Sex and Line as Hard in the Changes column, and leave Age and Channel as Easy. The completed dialog should look like the one in Figure 2.18.

Figure 2.18: The Factors have been defined

Click Continue.


There are 24 assays available for experimentation, so in the Design Generation section (Figure 2.19), specify 24 in the Number of Whole Plots box.

Figure 2.19: The Design Generation box

Click Make Design and then Make Table to generate a table like the one partially shown in Figure 2.20.

Figure 2.20: A portion of the Experimental Design Table

Note how Age and Channel change within whole plots, whereas Sex and Line stay constant for each whole plot. To convert this table to a valid JMP Genomics Experimental Design File (EDF), change the name of the Whole Plot column to Array by double-clicking on the column header and typing in Array as the new column name. Also, delete the Y column, since it will be replaced by a column named File. See Chapter 3 for specific instructions on building EDFs. To compare this design with the original design in Jin et al. (Jin, Riley et al. 2001), open the file AgingExperimentTable.txt located in the Sample Data folder. Note the run order and randomization schemes are different, but the designs are similar in terms of their split-plot structure.

Example: Randomized Block Design for Two-Channel Microarrays

Suppose that instead of the split-plot design just considered, equal information about the Age, Sex, and Line factors is needed and they need to be randomly allocated to the arrays in a randomized block design. A somewhat different approach in JMP illustrate a few more of its features.

Click DOE > Custom Design.


Define Dye as a 2-level categorical factor and Array as a 2-runs-per-block blocking factor as shown in Figure 2.21.

Figure 2.21: The completed Factors panel

Click Continue.

In the Design Generation section (Figure 2.22), enter 48 runs.

Figure 2.22: The Design Generation box

Click Make Design and then Make Table to generate a table like the one shown in Figure 2.23.

Figure 2.23: The Experimental Design Table

This table establishes the static portion of the design and ensures that Cy3 and Cy5 always appear once in each array.

Make sure this table is the active JMP table, and then open a new Custom Design window with DOE > Custom Design.


In the Factors section, click Add Factors > Covariate, select Dye, and click OK, as shown in Figure 2.24.

Figure 2.24: Selecting the first covariate

Click Add Factors > Covariate again, select Array, and click OK to generate the Factors section shown

in Figure 2.25.

Figure 2.25: Both covariates have been selected.

Note: JMP considers a Covariate to be a factor describing fixed characteristics of the samples that do not change. Also note the levels of the two covariates Dye and Array are automatically read from the active JMP table because it is a previously created JMP table. In addition to loading factors from an active JMP table, you can save and load factors by clicking on the small red triangle beside Custom Design. Next, define the three experimental factors Age, Sex, and Line. Since all three of these factors have two levels, they can be added to the design at the same time.

Type a 3 into the Add N Factors box, and then click Add Factor > Categorical > 2 Level. This creates three new rows in the Factors section.

Change each row to match the Factors section shown in Figure 2.26.

Figure 2.26: Completed Factors dialog

Click Continue.


In the Factors section, highlight Age, Sex, and Line.

In the Model section, select Interactions > 3rd.

This produces a Model section like the one shown in Figure 2.27.

Figure 2.27: Model Section with all factors and interactions defined

Click Make Design and then Make Table to create the final design, shown in Figure 2.28.

Figure 2.28: The Randomized Block Design

This Randomized Block Design allocates 2 of the 8 possible treatment combinations to each array. Note: The previous design is also known as a kind of loop design (Kerr and Churchill 2001), and is illustrated in Figure 2.29. The term loop derives from the fact that the design can be depicted as nodes indicating samples treated with one particular experimental factor combination. Aliquots of RNA from each sample are labeled either with the CY3 (green) or CY5 (red) florescent dyes. Two labeling reactions are required for each sample. Pairs of alternately labeled samples are pooled and hybridized to identical arrays. Each spot is probed with each sample, labeled with either dye, allowing the experimenter to control for confounding biases resulting from either dye or array effects.


Sample 1

Sample 3

Cy5 Cy3

Cy5 Cy3

Sample 2Sample 4

Cy5 Cy3

Cy5 Cy3

Figure 2.29: Loop design for 4 experimental conditions

Microarrays with Three or More Channels

With microarrays having three or more channels, the previous discussion for two-channel designs can be extended. For incomplete block designs, set the number of runs per block equal to the number of channels and set up the other factors as usual. For split-plot designs, set the number of whole plots equal to the number of budgeted arrays.

Microarrays with More than One Spot per Gene on Each Array

Some microarrays, often those manufactured in your local lab, have multiple spots per gene on the array. Such two-color arrays pose no additional concerns from a new experimental design perspective because the samples are applied to the entire array. However, the existence of multiple spots does make a difference during subsequent data analysis, when random effects caused by such things as the nesting of identical spots within an array or differences in dye effects among multiple arrays should be considered, in addition to the usual Array random effect.

Choosing the Overall Number of Runs in a Design Selecting the number of runs in a design is always a tradeoff between cost of the experiment versus the desired information gain, precision, or power. The latter can be difficult to quantify, considering that tens of thousands of genes or proteins are measured simultaneously. One rule of thumb is to use three biological replicates for each distinct combination of factors. A biological replicate is a biologically unique sample from the population of samples considered for experimentation. This is to be distinguished from a technical replicate, which is a repetitive measurement from biological material already


used in a previous run. Biological replicates tend to be much more variable than technical replicates, but they also provide the best means to make appropriate conclusions about the population of interest. A more statistical concept for evaluating size of designs is degrees of freedom for error. This represents the fraction of the data that is used to estimate noise instead of signal. It is computed by subtracting the total number of factor combinations from the total number of runs. Another rule of thumb requires at least 10 degrees of freedom for error in the design in order to be able to obtain an accurate estimate of noise and accompanying standard errors for effect differences. A rigorous statistical approach for determining the number of replicates in a design is to use sample size and power calculations. These require some prior knowledge about anticipated magnitudes of effect sizes as well as desired false positive rates. Some common methods are available under DOE > Sample Size and Power, and a few advanced ones are under Genomics > Power and Sample Size. Refer to the JMP Design of Experiments guide for additional information.

Creating Data Sets for Analysis in JMP Genomics

3 C H A P T E R

Congratulations! A completed experiment has yielded many data files. Each file consists of hundreds or thousands of rows and columns filled with numbers. Now what? Fortunately, JMP Genomics is available to help you analyze your large and complex data sets, extracting the maximum amount of information from them. Before analyzing your data, however, you must convert the raw files into a readable format. This chapter demonstrates how to prepare data for analysis. Recall from Chapter 1 that, instead of using standard JMP data files, JMP Genomics uses SAS data sets. JMP Genomics provides several commands to create SAS data sets from raw genomics data files, such as text files, Excel spreadsheets, or data from various types of special instruments. These SAS data sets serve as inputs to other JMP Genomics processes. Nearly all JMP Genomics processes generate more SAS data sets as outputs, which then serve as inputs to more processes. This framework provides considerable flexibility for statistical workflows. Make sure to organize and name your SAS data sets in a clear way to avoid confusion. The examples in this chapter demonstrate the processes for creating SAS data sets using JMP Genomics. Before we get to those examples, we should review and clarify some aspects of SAS data sets, particularly as they relate to JMP Genomics.

A Few Words about SAS Data Sets and JMP Genomics

SAS data sets have the extension .sas7bdat. We recommend you associate the extension .sas7bdat with JMP (Control Panel > Folder Options > File Types) so that double-clicking on any .sas7bdat file opens it in JMP as a JMP table. JMP can then produce its native graphics and analyses, in addition to those created by JMP Genomics dialogs. To save a JMP table as a SAS data set, change the File Type in the Save As dialog. Alternatively, you may use the File > Save As SAS Data Set command. JMP Genomics Requires Specific Types of Data Sets Many of the processes in JMP Genomics (especially those used for microarray and proteomic analyses) require the specification of two separate SAS input data sets: 1. an input data set in tall format (The tall and wide data formats are defined on the next page), and 2. an appropriate Experimental Design Data Set (EDDS). An EDDS is a SAS data set that provides

information about the columns of the tall data set. It describes relevant experimental variables such as treatment conditions and covariates, as well as a variable named ColumnName. Entries in the ColumnName column must exactly match the column names in the input data set. Experimental design data sets have certain constraints that must be followed for the processes to run successfully.

3 Creating Data Sets for Analysis in JMP Genomics 30

To create these data sets, first construct a third type of file, the Experimental Design File (EDF). An EDF imports various kinds of data into JMP Genomics. An EDF is a precursor to an EDDS. The EDF is normally saved as a comma separated values (.csv) file, tab delimited text (.txt) file, or Microsoft Excel (.xls) spreadsheet rather than as a SAS dataset. A typical JMP data file (.jmp) does not work as an EDF. When designing a new experiment from scratch, refer to Chapter 2 on how to use JMP’s DOE (Design of Experiments) functionality to create an optimal design. After creating a design, one or more columns are usually added to the table to make a valid EDF. Then use the JMP File menu to save it as a text or Excel table. Note: The advantage to using an EDF is having all of the experimental variables collected in one table that can be reused or modified as needed. An EDF is an excellent way to consolidate, store, and share the critical factors in an experiment, rather than trying to attach them to the raw data manually or adding them into the names of the raw data files. Since an EDF can be used to record corresponding experimental factors of a microarray experiment, it is good practice to construct it during the initial planning of your experiments. Note: Many of the processes used for genetic analyses make use of wide data sets and do not require an EDDS. Tall and Wide Data Sets Most of the processes in JMP Genomics assume that the input SAS data set has a particular data structure. JMP Genomics distinguishes between tall and wide SAS data sets. A tall SAS data set has samples as columns and molecular entity (such as marker, gene, clone, protein, or metabolite) as rows. A wide SAS data set is the transpose of a tall data set, having the samples as rows and molecular entity as columns. When specifying the input SAS data set for a process, it is important to know the required form. Most of the processes associated with genetic analyses require a wide structure, whereas most of those for microarray and proteomics analyses use a tall structure. The Transpose Tall and Wide and Transpose Rectangular processes under the Data Set Utilities menu transform SAS data sets between tall and wide forms. The use of these commands is discussed in more detail in Chapter 4. Terminology The columns in a SAS data set are called variables, and the rows are called observations. This terminology is used frequently in JMP Genomics dialogs and this documentation.

Annotation Data Sets In addition to an experimental design data set, many JMP Genomics processes also optionally accept an annotation data set. This is a SAS data set containing biological or chemical properties corresponding to the molecular entities in the experiment. Annotation data sets can correspond to either tall or wide data sets. For tall data sets, annotation data sets must share one or more merge key variables with the tall

data set so that the two data sets can be joined at run time. For wide data sets, an assumption on the order of the variables is usually in effect.

Annotation data sets are typically created by opening an appropriate text or Excel table in JMP, removing any undesired columns, and then saving it as a SAS data set (with extension .sas7bdat) using the Save As menu. However, if the column names in the data set contains special characters (-, *, #, for example), the columns may be truncated. This problem can be avoided by using the File > Save As SAS Data Set command. Annotation data sets provided by Affymetrix or other suppliers, typically as .txt or .csv files, must first be imported into JMP using the Genomics > Data Set Creation > Text > Import Individual Text, CSV or Excel Files process to convert the .txt, .csv files or excel file to a .sas7bdat file. See Chapter 11 for more information on Annotation Data Sets.


Creating the Input Data Sets There are numerous ways to create input data sets and the EDDSs needed for analysis by JMP Genomics. How you decide which method to use depends on the form in which your raw data is stored, the availability of design files that describe the organization of your experiment and the data, the complexity of your experiment, and the number and types of analyses you plan to conduct. Table 3.1 lists possible scenarios for creating the needed data sets, depending upon the types of files you start with.

Table 3.1: Recommended Procedures for Creating the Needed Data Sets

What you have: Recommended Procedure:

Raw data files (device-specific format), and a Design file

1. Convert the design file to an EDF (Join multiple files if needed)

2. Generate the EDDS and SAS data set using the device-specific import engine under Import.

Raw data files only (device-specific format)

1. Create an EDF using the Experimental Design File Builder.

2. Generate the EDDS and SAS data set using the device-specific import engine under Import.

Raw data files only (.txt, .csv. .xls, .sas7bdat)

1. Create an EDF using the Experimental Design File Builder.

2. Generate the EDDS and SAS data set by running the Import a Designed Experiment from Text, CSV, or Excel Files process under Import > Text.

Raw data file only (one file, in tall form)

1. Read the raw data file into JMP and then save it as a .sas7bdat file using either the File > Save As or the File > Save As SAS Data Set command. The JMP User Guide provides instructions for importing data from .txt, .csv, and .xls files.

2. Run the Experimental Design Data Set Builder process, under Experimental Design, on the newly created .sas7bdat file to create the EDDS.

Raw data file only (one file, in wide form)

1. Read the raw data file into JMP and then save it as a .sas7bdat file using either the File > Save As or the File > Save As SAS Data Set command. For processes that require a tall data set, run the Transpose Tall and Wide process, under Data Set Utilities, to convert the data set from a wide to a tall form and to generate the EDDS.

2. Run the Experimental Design Data Set Builder process, under Experimental Design, on your newly created .sas7bdat file to create the EDDS.

Note: For processes that do not require an EDDS, you import the data using the Import Individual Text, CSV or Excel Files command.


Later, this chapter includes several examples of how to create these data sets using the procedures listed in the Table 3.1.

The Experimental Design File Recall the ArrayTrack example from Chapter 1. In this example, we created an input data set, an EDDS, and an Annotation Data Set using parameters specified by an EDF from the Sample Data folder when you installed JMP Genomics. In most cases, an EDF must be created before you conduct further analyses. EDFs for JMP Genomics must adhere to the following conventions:

1. The first row of the file must contain column header names. The second and subsequent rows must contain data with no blank rows.

2. One column must have the header name Array, Chip, or Spectrum. An optional second

column must be named Channel or Dye. The data entries in these two columns must uniquely identify the rows of the file. The Create Array Index process (under Genomics > Experimental Design) generates this column, if needed.

3. One column must have the header name File or FileName. The entries in this column must

contain the names of the raw data files that are associated with each row. The Check File Names process (under Genomics > Experimental Design) helps you to check the accuracy of the file names.

4. One column must have the header name ColumnName. The entries in this column must

correspond to valid SAS variable names in the tall data set that is associated with this experimental design. The Create ColumnName process (under Genomics > Experimental Design) can generate this column.

5. When raw data files have more than one raw data column, a column named Intensity is

required. The names of the columns in the raw data files are listed in this column.

6. When raw data files have a column corresponding to a background signal to be subtracted from the specified Intensity column, include a column named Background. The entries in this column contain the names of the columns in the raw data file that correspond to the background columns.

7. To input other columns, which are shared for all the raw files, such as coordinates of

molecular entities on arrays, you may include columns named _X_varname in your EDF, where varname is the name assigned to these columns in the tall data set you are creating. The entries in this column contain the names of the columns in the raw data file that correspond to the extra data.

8. You may include an arbitrary number of additional columns corresponding to such things as

treatment, dose, time, or any other experimental variable or covariate of interest. Do not use any of the names described in conventions 2-7 above for these additional columns.

9. The file must be in one of the following formats: tab-delimited with .txt extension, comma-

delimited with .csv extension, Microsoft Excel with .xls extension, or a SAS data set, with .sas7bdat extension.

EDFs may be built in a variety of ways. The simplest method assumes you have a file identifies individual raw data files along with the experimental conditions, such as treatment, dosage, time cell line, animal, sex, age, etc, under which they were generated. Such a file may be created using JMP’s


DOE capabilities, as discussed in Chapter 2. This file is read into JMP and modified such that it functions as an EDF. Note: If the design information is spread across separate tables, use the Tables > Join command to merge the tables to create the design file. Consult the JMP User Guide for specific instructions on merging tables. Alternatively, JMP Genomics includes a tool called the Experimental Design File Builder (under Genomics > Experimental Design) that you can use to create a new EDF. Let’s use the Affymetrix Latin Square data set, contained in the Sample Data Folder included with JMP Genomics and described in Chapter 1, as an example to demonstrate both methods. Converting an Existing Design File into an EDF This example uses an Excel file called DesignTable.xls that contains information identifying specific raw data files with the experiments from which the data in each file was generated.

Select File > Open.

Navigate to Sample Data > Microarray > Affymetrix Latin Square.

Specify the file type as an Excel file in the Open Data File box.

Open the DesignTable.xls file.

The file is imported to a JMP table, as shown in Figure 3.1.

Figure 3.1: The Experimental Design File

Note that this table contains the three required elements for an Experimental Design File:

1. An Array column listing the individual array used for each experiment, 2. A File column listing the names of the specific raw data files for each experiment, and


3. A ColumnName column listing the column names within those files that contain the relevant data.

As such, this file can serve as an EDF. There is an additional Experiment column specifying the individual experiment from which the data in each row was collected. Presumably, the experimenter is aware of the variables (treatment, dosage, time, etc.) for each experiment. However, additional columns could be added to the table to specify additional information. To add columns, complete the following steps.

Select Cols > New Column.

Specify the name and characteristics for the new column.

Fill in the contents of the new column, either by typing the information into each cell, or by using JMP’s Tables > Join command to merge this table with another containing the information.

Repeat for each new column that you add.

Select File > Save As to save the EDF. Be sure to specify one of the acceptable file types.

Note: You should use the File > Save As SAS Data Set command to save the file as a .sas7bdat file if the file’s column names contain special characters.

Building a New EDF To construct an EDF from scratch, use the Experimental Design File Builder command.

Select Genomics > Experimental Design > Experimental Design File Builder. The dialog shown in Figure 3.2 appears.

Figure 3.2: The EDF Builder


Click Choose to specify the folder containing the raw data files. For this example,


Open the CEL folder and click Select.

The Affymetrix Latin Square folder, which contains the raw data files, is specified in the Experimental Design File Builder dialog as shown in Figure 3.3.

Figure 3.3: The folder containing the raw data files has been selected.

To view only the relevant .cel files, complete the following step.

Select .cel in the File Filter Expression box.

The File Filter Expression box appears as shown in Figure 3.4.

Figure 3.4: The File Filter Expression box

Because the probes were labeled with one dye,

Make sure that 1 channel is selected, as shown in Figure 3.5.

Figure 3.5


Finally indicate a name for the EDF and specify to save the file.

Type AffyLatinSquare_Design in the Output File Name box. To specify the output folder, complete the following steps.

Click on Choose.

Navigate to ProcessResults.

Open the ProcessResults folder and click Select to select this folder.

The Experimental Design File Builder dialog appears like the one shown in Figure 3.6.

Figure 3.6: The Completed Dialog

Click Run to generate the EDF.

The EDF is shown in Figure 3.7.


Figure 3.7: The Experimental Design File

Compare the EDF illustrated in Figure 3.7 with the EDF displayed in Figure 3.1. Aside from a difference in the column order and the presence of the optional Experiment column in Figure 3.1, the two files are the same. Note: The ColumnName column is empty. The appropriate experimental data can be entered either by typing the data directly in the column or by defining specific SAS code in the Options tab of the Experimental Design File Builder dialog. Refer to the SAS 9.1.3 User’s Guide (http://support.sas.com/onlinedoc/913/docMainpage.jsp) for additional information. To enter the data using SAS code, complete the following steps.

Click on the Experimental Design File Builder dialog to reactivate the dialog.

Click on the Options tab.

Type the following SAS code between the parentheses of the %str() SAS macro.

length Experiment $ 1; Experiment=substr(file,5,1); if Experiment in ("n","o","p") then Experiment = "m"; else if Experiment in ("r","s","t") then Experiment = "q"; ColumnName = Experiment || "_" || trim(left(Array));


The modified EDF is shown in Figure 3.8.

http://support.sas.com/onlinedoc/913/docMainpage.jsp


Figure 3.8: A portion of the modified EDF

Note: Except for the values of ColumnName, the modified EDF is identical to the EDF shown in Figure 3.1.

Select File > Save As to save the EDF. Be sure to specify one of the acceptable file types (.xls, .csv, .txt).

Note: You should use the File > Save As SAS Data Set command to save the file as a .sas7bdat file if the file’s column names contain special characters. Additional Tools for Creating an EDF Also available are processes called Create Array Index, Create ColumnName, and Check File Names (under Genomics > Experimental Design). Each works with an open JMP table to help you transform it into a valid EDF. Once you have a complete EDF created in a JMP table, save it as a .txt or .xls file for use as input for one of the Import processes.

Creating Both the Experimental Design Data Set (EDDS) and the SAS Data Set with an EDF

After creating an appropriate EDF, specify it as one of the input parameters in a data-specific process from the Import submenu. Using a Device-Specific Data Import Engine Recall the ArrayTrack example from Chapter 1. In this example, we created both an input data set and EDDS using the parameters specified by a sample EDF. The output of this process usually consists of two SAS data sets, one containing the raw data in tall form, and the other a corresponding EDDS. In the following example, we create corresponding data sets using the sample Affymetrix Latin Square data set and the newly created AffyLatinSquare_Design EDF.

Select Genomics > Import > Affymetrix > Affymetrix Expression CEL, as shown in Figure 3.9.


Figure 3.9: Opening the Affymetrix Expression CEL Import Engine

This opens the dialog shown in Figure 3.10.

Figure 3.10: The Affymetrix Input Engine dialog

To select the Experimental Design File included with JMP Genomics,

Click Choose.


Select the DesignTable.txt file and click Open.

To select the folder containing the raw data files,

Click Choose to specify the folder containing the raw data files.



Open the CEL folder and click Select.

A special file, known as the Chip Description File (CDF), must be specified. This file contains information to associate individual probes (extracted from the CEL file) with the corresponding probe set. CDFs are standard files, unique for each chip, and are provided for downloading, by Affymetrix. To select folder containing the CDF file for this data set,

Click Choose.

Navigate to Sample Data > Microarray.

Open the Affymetrix Latin Square folder and click Select.

Now specify where to save the SAS data set and EDDS.

Click Choose.


Open the ProcessResults folder and click Select.

The dialog should appear as shown in Figure 3.11.

Figure 3.11: The Affymetrix Expression CEL/CHP Import Engine (II)

Click Run to generate the data sets. As discussed in Chapter 1, JMP Genomics dialogs generate and run a SAS program each time you click the Run button. Depending on the size of your data sets and the capacities of your computer, some processes can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (See Figure 1.10). The Windows Task Manager shows a process named sas.exe, and tracks its CPU and I/O activity. Alternatively, monitor the SAS temporary working directory and the Output Folder for results as they are created. The SAS data sets generated by this process are listed in a SAS Message dialog (Figure 3.12).


Figure 3.12: The SAS Message dialog

Click Open for each of the data sets to examine their contents and structures.

Using the Import a Designed Experiment from Text, CSV, or Excel Files Command If the data is stored in a generic (.txt, .csv. .xls, or sas7bdat) format, build the input data set and EDDS using the Import a Designed Experiment from Text, CSV, or Excel Files command. This example uses data from the Drosophila aging experiment described in Chapter 1.

Select Genomics > Import > Text > Import a Designed Experiment from Text, CSV, or Excel Files, as shown in Figure 3.13.

Figure 3.13: Selecting the Import a Designed Experiment from Text, CSV, or Excel Files Command

The Import a Designed Experiment from Text, CSV or Excel Files dialog opens, as shown in Figure 3.14.


Figure 3.14 The Import a Designed Experiment from Text, CSV or Excel Files dialog

To select the Experimental Design File, complete the following steps.

Click Choose.

Navigate to Sample Data > Microarray > Scanalyze Drosophila.

Select the AgingExperimentTable.txt file and click Open to select the file.

To select the folder containing the raw data files, complete the following steps.

Click Choose to specify the folder containing the raw data files.


Open the Scanalyze Drosophila folder and click Select.

If files do not end with the .csv, .sas7bdat, .txt, or .xls extension, specify their file type. The raw data files for the Drosophila Aging experiment, used for this example, end with .dat. These are tab delimited files.

Select Tab Delimited from the Data File Type drop-down menu, as shown in Figure 3.15.

Figure 3.15

The first row of a tall SAS data set always lists the name of the variable or column.

Enter 1 in the Row Number of Variable Names box, as shown in Figure 3.16.


Figure 3.16

The first seven rows in each of the raw data files contain information about the samples. Data entries begin in row 9. Specify 9 in the Data Start Row box, as shown in Figure 3.17.

Figure 3.17

ID Variables are required. For this example, the variable being measured is the intensity of the spots on the microarray.

Type Spot in the ID Variables box, as shown in Figure 3.18.

Figure 3.18

Finally, select a location to save the SAS data set and the EDDS.


Open the ProcessResults folder and click Select.

The completed dialog appears as shown in Figure 3.19.

Figure 3.19: The completed Import a Designed Experiment from Text, CSV or Excel Files dialog

Click Run to generate the data sets.


The locations of the output data sets generated by this process are listed in a SAS Message dialog, as shown in Figure 3.20.


Click Open for each of the data sets to examine their contents and structures.

Creating the Input Data Set and EDDS from a Single, Tall Data File

Suppose all experimental data are assembled into one Excel spreadsheet like the one illustrated in Figure 3.21.

Figure 3.21: An Excel spreadsheet containing data from the Drosophila aging experiment

In this case, the data set is already in tall form, so a SAS input data set and a corresponding experimental design data set is all that is needed. You should create the two data sets separately using the following steps. For the input data set:

Select File > Open to open an Open Data File dialog.


Select Excel Files (*.xls) from the Files of type drop-down menu.


Select the drosophilaaging.xls file and click Open to select the file.

The file opens as a JMP table, as shown in Figure 3.22.

Figure 3.22: A portion of the JMP data table containing data from the Drosophila aging experiment

Select File > Save As SAS Data Set to open the Save As SAS Data Set dialog.

Type drosophilaaging_tall as the name of the output data set.

Choose the ProcessResults folder as the save destination.

Click Save to save the file.

For the EDDS:

Select Genomics > Experimental Design > Experimental Design Data Set Builder.

The dialog shown in Figure 3.23 opens.


Figure 3.23: The EDDS Builder dialog

Follow these steps to select your converted file as the Input Data Set.

Click Choose to select the Input Data Set.


Select the drosophilaaging_tall.sas7bdat file.

Click Open to select the file.

Examine Figure 3.22. Note that all of the columns except for Spot contain raw data. To select the columns containing raw data, complete the following steps.

Hold the Ctrl key down while clicking on all of the columns listed in the Available Variables box except for Spot, as shown in Figure 3.24. Do not select Spot.

Figure 3.24: Selecting the variables containing raw data

Specify the ProcessResults folder as the output folder.

The completed General tab of the dialog should look like the one illustrated in Figure 3.25.


Figure 3.25: The completed General tab of the EDDS Builder dialog

As described in Chapter 1, the data in this example came from an experiment comparing the effects of age, sex and line on Drosophila gene expression. The data in each of the columns in the raw data file describes channel-specific results from one combination of those experimental conditions. To create additional columns in the EDDS to further describe these conditions, complete the following steps.

Click on the Options tab.

Type the following SAS code to create five additional columns: Line, Sex. Age, Channel, and Array within the SAS Code to Create New Design Variables field.

Line = scan(columnname,1,"_"); Sex = scan(columnname,2,"_"); Age = scan(columnname,3,"_"); Channel = scan(columnname,4,"_"); Array = scan(columnname,5,"_");

The SAS Code to Create New Design Variables field should appear as shown in Figure 3.26.

Figure 3.26 The SAS Code to Create New Design Variables field


Note: Each new column identifies one of the conditions in the experiment. Each column is specified on its own line. The name of each new column is specified on the left side of the equal sign, while the location within the original column name that describes the condition is defined on the right side of the equals sign. Refer to the SAS 9.1.3 User’s Guide for additional information on writing SAS syntax.

Make no other changes to the tab.

Click the Run button to generate the EDDS.

The new EDDS opens, as shown in Figure 3.27.

Figure 3.27: A portion of the EDDS

Now That You Have Your Data Sets Keep a tall data set and its corresponding EDDS together in subsequent processes that call for them. If needed, pair the same experimental design data set with updated versions of the input data such as those created by processes in the Normalization submenu. You can also create subsets of the original data, set by deleting rows from the EDDS and saving the result under a new name, to concentrate the focus of your analysis. Tall data sets and EDDSs can also be mixed and matched, depending on your analysis needs. These procedures are discussed in greater detail in later chapters.

Data Set Utilities

4 C H A P T E R

The Data Set Utilities menu provides a collection of processes for managing and modifying SAS data sets. These utilities can be used at any point during your JMP Genomics session. The utilities are divided into four main sections:

• Column Utilities • Joins and Transpositions • Statistics and Transforms • Export

These are shown in Figure 4.1.

Export

Statistics and Transforms

Joins and Transpositions

Column Utilities

Figure 4.1: The Data Set Utilities menu

The purpose of this chapter is to provide descriptions and examples for these commands. Note that a similar set of utilities for JMP tables is available under the Tables menu.


Column Utilities

The Column Utilities group offers analytical procedures and manipulations frequently used in genomic analyses that:

• display detailed contents about the columns and structures of a SAS data set, • change the lengths, labels, names, or order of SAS data columns (also known as SAS

variables).

Column Contents The Column Contents command displays the contents of a SAS data set in .html format.

Select Genomics > Data Set Utilities > Column Contents.


Figure 4.2: The Data Contents dialog

Click Load.

Select the default settings for the AffymetrixLatinSqureExample.

Click OK to bring up the Column Contents dialog shown in Figure 4.3.


Figure 4.3: The completed Column Contents dialog

In the Print Options field, specify whether to print all the data or a subset of the data. In this example, only the first 100 observations are displayed.

Click Run.

JMP displays the results in a series of tables (Figure 4.4).

Figure 4.4: The output of the Column Contents process

Each output table, shown sequentially in the frame on the right, is identified in the Table of Contents column. You can specify certain print options to selectively print all or part of the data set. For further information, see the SAS documentation for the CONTENTS and PRINT procedures.


Change Labels The Change Labels command modifies multiple column labels by writing simple SAS syntax. This command is particularly useful if you want to change the labels of multiple columns in multiple data sets. This example changes the labels of two columns, Unit No and Probe No, in the affylatin.sas7bdat data set included in the Sample Data folder. The original file is shown in Figure 4.5.

Figure 4.5 The original affylatin.sas7bdat data set

Select Genomics > Data Set Utilities > Change Labels.


Figure 4.6: The Change Labels dialog

Click Load.


Select the default settings for the AffymetrixLatinSquareExample.

Click OK to bring up the Data Contents dialog shown in Figure 4.7.

Figure 4.7: The Completed Change Labels dialog

To remove labels from any number of columns, click on the variable name and then click to add the name to the Remove Labels from these Variables box.

Do not select any column labels to remove.

Specify the SAS syntax for multiple new labels in the New Label Specifications box, as shown in Figure 4.8.

Variable Name

New Label

Figure 4.8 The Completed New Label Specifications box

Note: In this example, the new syntax has already been entered. The name for each column is on the left side of the equals sign and the new label for each column is on the right side of the equals sign and is contained in quotes. The changes for each column must be entered on a separate line. In your own analyses, specify both the name and the location of the relabeled output file. However, since this is an example, proceed without changing the default specifications.

Do not change either the name or location of the output folder.

Click Run to relabel the columns.


The location of the relabeled data set generated by this process is listed in a SAS Message dialog shown in Figure 4.9.


Click Open to examine the relabeled file.

The relabeled file appears as shown in Figure 4.10.

Figure 4.10: The relabeled data set

Compare the original and modified table labels. Note that the Unit No and Probe No variables in the original table were changed to Affy Internal Unit No. and Probe Sequential No. in the new table respectively.

Note: Recall that variables in a SAS data set can have both labels and names. A variable must have a variable name that conforms to certain conventions. Data labels are less stringent and can be any ASCII text. JMP automatically uses SAS variable labels as column names.

Change Lengths

The Change Lengths command shortens the lengths of variables in a SAS data set to save space. This command is used only for character variables; it does not change the lengths of numeric variables. This example changes the length of the variables in the Probe Set ID column, in the affylatin.sas7bdat data set included in the Sample Data folder. A portion of the original file is shown in Figure 4.5.


Select Genomics > Data Set Utilities > Change Lengths. The dialog shown in Figure 4.11

opens.

Figure 4.11: The Data Length dialog

Click Load.

Select the default settings for the AffymetrixLatinSquareExample and click OK.

Uncheck the Minimize Lengths of Selected Variables box.

Change the default setting in the New Length for Variables Select above [0, 64] box from 16 to

2, as shown in Figure 4.12.

Figure 4.12

Click Run to change the length of the selected variable.

The location of the modified data set generated by this process is listed in a SAS Message dialog (shown in Figure 4.13).



Click Open to examine the modified file.

The modified file appears as shown in Figure 4.14. Compare the length of variables in the Probe Set ID column in the modified data set with those in the original data set (Figure 4.5).

Figure 4.14: The modified data set.

Rename The names of the columns in the input data set were initially established by the ColumnName variables in the Experimental Design table. The Rename process systematically changes the names of the columns in the input data table and the corresponding values in the Experimental Design Data Set (EDDS). This example changes the column names in the input data set from the Drosophila aging experiment described in Chapter 1. Potions of the original input data set and EDDS are shown in Figures 4.15 and 4.16, respectively.


Figure 4.15: The Drosophila Aging Input Data Set

Figure 4.16: The Drosophila Aging EDDS

Select Genomics > Data Set Utilities > Rename. The dialog shown in Figure 4.17 opens.


Figure 4.17: The Data Rename dialog

Click Load.

Select the default settings for the DrosophilaAgingExample and click OK.

This dialog allows selection of the variable whose values are used for the column names from the list of available variables. To select this variable, click on the desired variable, then click to add the variable to the Variable Containing Current Column Names box, as shown in Figure 4.18.

Figure 4.18: Selecting the variable

Because CurrColumnName is already selected, complete the following steps.

Do not change the default setting for the Variable Containing Current Column Names box. This dialog allows selection of the variable whose values are used for the column names from the list of available variables. To select this variable, click on the desired variable, then click to add the variable to the Variable Containing New Column Names box, as shown in Figure 4.18. Because ColumnName is already selected, complete the following steps.

Do not change the default setting for the Variable Containing New Column Names box.

Click Run to rename the columns.


The location of the data sets generated by this process is listed in a SAS Message dialog, shown in Figure 4.19.


Note: By leaving the Output Data Set and Output Experimental Design Data Set boxes blank in the Data Rename dialog, the file names do not change, except that the abbreviation _drn is appended to each of the output file names.

Click Open next to each file to examine the output files. The first listed file (Figure 4.20) is the output data set with new column names.

Figure 4.20: The output data set

The second listed file (Figure 4.21) is the output EDDS that excludes the OldColumnName column.


Figure 4.21: The output EDDS

Compare the input (Figures 4.15 and 4.16) and output (Figures 4.20 and 4.21) data sets to see the results of the Rename command. Reorder The Reorder command sorts the columns according to the order of the values in the ColumnName variable in the EDDS. This example changes the column order in the input data set from the Drosophila aging experiment described in Chapter 1. A portion of the original input data set is shown in Figures 4.15.

Select Genomics > Data Set Utilities > Reorder. The dialog shown in Figure 4.22 opens.


Figure 4.22: The Data Reorder dialog

Click Load.


Click Run to reorder the columns in the input data set.

The location of the modified data set generated by this process is listed in a SAS Message dialog (shown in Figure 4.23).


Click Open to examine the reordered data set.

The reordered file appears as shown in Figure 4.24.


Figure 4.24: The reordered data set

Compare the original (Figure 4.15) and reordered (Figure 4.24) data sets to see the different column order from the Reorder command.

Joins and Transpositions The Joins and Transpositions section contains utilities to append, merge, and transpose SAS data sets.

Append The Append command appends two SAS tables together, end−to−end. This example appends two tables with identical column labels from the Drosophila aging experiment included in the Sample Data folder.

Select Genomics > Data Set Utilities > Append. The dialog shown in Figure 4.25 opens.

Figure 4.25: The Data Append dialog

Click Load.



The default Base Input Data Set is the Drosophila input data shown in Figure 4.15 and the Append Input Data Set is the normalized Drosophila input data set. Each table has 100 rows.

Click the Run button to append the data sets. The location of the appended data set generated by this process is listed in a SAS Message dialog.

Click the Open button to examine the appended data set.

The appended table (shown in Figure 4.26) has the same number of columns as either of the input data sets, but twice the number of rows (circled).

Figure 4.26: The appended data set

To append two tables with different column labels, check the Force Append checkbox in the Data Append dialog (see Figure 4.27). This forces two tables with different column labels to append together using the base input data set column labels.

Figure 4.27: The Force Append checkbox.

This example appends tables with different column labels.

Select Genomics > Data Set Utilities > Append.

Follow these steps to select the Affymetrix Latin Square Experimental Design File as the Base Input Data Set.

Click Choose to select the Base Input Data Set.

Navigate to Sample Data > MicroArray > Affymetrix Latin Square.

Select the affylatin_exp.sas7bdat file and click Open to select the file.

This file is shown in Figure 4.28. Note that there are 59 rows (circled) in this table.


Figure 4.28: The Base Input Data Set

Follow these steps to select the Drosophila Aging Experimental Design File as the Append Input Data Set.

Click Choose to select the Append Input Data Set.

Navigate to Sample Data > MicroArray > Scanalyze Drosophila.

Select the drosophilaaging_exp.sas7bdat file and click Open to select the file.

This file is partially shown in Figure 4.29. Note that there are 48 rows (circled) in this table.

Figure 4.29: The Append Input Data Set

Check the Force Append checkbox (as shown in Figure 4.27).

To select the Output Folder, complete the following steps.

Click Choose to select the Output folder.


Navigate to the Genomics folder.

Select the ProcessResults folder and click Open to select the folder.

Click Select to select the folder.

Click Run to append the data sets. The location of the appended data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the appended data sets (partially shown in Figure 4.30).

From Base Input Data Set

From AppendInput Data Set

Figure 4.30: The appended data set

Compare the appended data set with the both the input data sets to see the results of the Append command. The 48 rows from the append input data set are added after the 59 rows from the base input data set, for a total of 107 rows (circled). The appended table has the same labels as the Affymetrix Latin Square experimental design file. The columns that are common to both tables, Array, file and ColumnName, are filled in with their respective values. Note that the Variable Length parameter is retained from the base input data set. The columns that are present in the base input data set but absent in the append input data set (for example, Experiment) are retained in the concatenated table. However, values for these columns are missing. The columns that are absent in base but present in the append input data set (for example, Sex and Line) are not retained in the concatenated table. Note: Inverting the roles of the two input data sets results in the table shown in Figure 4.31.


From Base Input Data Set

From Append Input Data Set

Figure 4.31: The appended data set (inverted)

Merge The Merge command joins two tables, side−by−side, with matching row variables. This example merges the annotation data set for the Drosophila aging experiment with the input data set for this experiment. Recall from Chapter 3 that annotation data sets contain specific biological or chemical information for each row of a tall data set.

Select Genomics > Data Set Utilities > Merge.

Follow these steps to select the Drosophila Aging Annotation Data Set as the Base Input Data Set.

Click Choose to select the Base Input Data Set.


Select the drosophila_annotation.sas7bdat file and click Open to select the file.

This file is shown in Figure 4.32. Note that there are five columns and 3933 rows.

Figure 4.32: The Drosophila Aging Annotation File


The variables available in this data set are listed in the Available Variables box (Figure 4.33).

Select Spot.

Click to add Spot to the Key Variables from Base Input Data Set box.

Figure 4.33: Selecting the key variable from the Base Input Data Set

Follow these steps to select the Drosophila Aging Input Data Set as the Merge Input Data Set.

Click Choose to select the Merge Input Data Set.


Select the drosophilaaging.sas7bdat file and click Open to select the file.

This file is shown in Figure 4.34. Note that there are 49 columns and 100 rows.

Figure 4.34: The Drosophila Aging Data Set

The variables available in this data set are listed in the Available Variables box (Figure 4.35).

Select Spot.

Click to add Spot to the Corresponding Key Variables from Merge Input Data Set box.

Figure 4.35: Selecting the Key variable from the Merge Input Data Set

Specify an output folder.


Click Run to merge the data sets. The location of the merged data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the appended data set (shown in Figure 4.36).

From Base Input Data

Set

From MergeInput Data

Set

Common Identifiers

Figure 4.36: The merged data set

Compare the merged data set with both the input data sets to see the results of the Merge command. Note: Only the 100 rows common to both of the input data sets are found in the merged data set. However, all of the columns present in either of the input data sets are in the merged data set. Transpose Tall and Wide The Transpose Tall and Wide command converts a tall data set into a wide data set or vice-versa (see more detail about tall and wide format in Chapter 3). This example transforms the Affymetrix Latin Square Input Data Set and its accompanying EDDS from the tall format to the wide format. A portion of the tall input data set appears as shown in Figure 4.37.


Figure 4.37: The Affymetrix Latin Square Input Data Set (tall)

Note that there are 59 data columns and 1604 data rows.

Select Genomics > Data Set Utilities > Transpose Tall and Wide. The Transpose Tall and Wide dialog opens, as shown in Figure 4.38.

Figure 4.38: The Data Transpose Dialog

Note that there are two tabs in the dialog. Because you are transposing a tall data set into a wide data set,

Make sure that the Tall -> Wide tab is selected.


To select the Affymetrix Latin Square Input Data Set as the Base Input Data Set, complete the following steps.

Click Choose to select the input tall data set.


Select the affylatin.sas7bdat file and click Open to select the file.

In this example, there is no need to specify either the variables or prefixes for wide column names. To select the Affymetrix Latin Square EDDS as the EDDS, complete the following steps.

Click Choose to select the EDDS.


Select the affylatin_exp.sas7bdat file and click Open to select the file.


Click Run to transpose the data sets.

The location of the transposed data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the transposed data set.

The transposed data set with wide data format is shown in Figure 4.39.

Figure 4.39: The transposed data set

Compare the transposed (wide) data set (Figure 4.39) with the original tall data set (Figure 4.37). Note that the data has been transposed; there are now 1604 data columns and 59 data rows. In addition, the _wid abbreviation has been added to the transposed file name. A SAS data set in wide format can be transposed into a data set in tall format in a similar manner by selecting the Wide -> Tall tab.


Transpose Rectangular The Transpose Rectangular command creates a new SAS data set by transposing a block, or subset, of variables in a SAS data set. The variables (columns) become observations (rows) and observations become variables.

Select Genomics > Data Set Utilities > Transpose Rectangular. The Data Transpose Rectangular dialog opens.

Click Load.

Select the default settings for the AffymetrixLatinSquareExample and click OK.


Figure 4.40: The Data Transpose Rectangular dialog

A portion of the input data set is from the Affymetrix Latin Square Example included with JMP Genomics. The input data set is partially shown in Figure 4.41.


Figure 4.41: The input data set

Do not change any of the default settings in the dialog.

Click Run to transpose the data set.

The location of the transposed data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the transposed data set.

The transposed data set with the wide data format is partially shown in Figure 4.42.

Figure 4.42: A portion of the transposed data set

Compare the transposed (wide) data set (Figure 4.42) with the original tall data set (Figure 4.41) to see the transposed data. In addition, the identifiers listed in the Probe_Set_ID column in the input data set have been separated into two columns (Array and Treatment), as specified in the SAS Code parameter pane in the dialog.


Unstack The Unstack command transposes a stacked data set into a tall data set and an EDDS. A stacked data set has the variables of interest stacked into a single column. This example converts a stacked data set (Figure 4.43) to a tall data set and an EDDS.

Figure 4.43: A portion of the stacked data set

Typically, stacked data sets contain a smaller number of columns when compared with the number of rows (circled in Figure 4.43). Note the repetitiveness in the ChipID, Experiment and Series columns.

Select Genomics > Data Set Utilities > Unstack. The Data Unstack dialog opens, as shown in Figure 4.44.


Figure 4.44: The Data Unstack dialog

To select the Affymetrix Latin Square Stacked Data Set as the Input Data Set, complete the following steps.

Click the Choose button to select the Input Data Set.


Select the affylatin_stack.sas7bdat file and click Open to select the file.

The variable names for the data set are listed in the Available Variables box (shown in Figure 4.45).


Figure 4.45: Variables

To unstack the data set, first specify the variables containing the numerical data, the variables to transpose by, and the variables to make up the columns in the new data set. The criteria and procedures for this process are outlined in the following sections. The Response Variable is the variable that contains the actual numeric data to be transposed. In this example, the data is in the log2i column in the input data set.

Click log2i.

Click to add log2i to the Response Variable box.

JMP Genomics uses the unique levels in the Row Variables to form the rows in the output tall data set. To select the Row Variables,

Click Unit.

Click to add Unit to the Row Variables box.

Repeat for AffyID and Probe. JMP Genomics uses the unique combinations of levels in the Column Variables to form the columns in the output tall data set. These levels must not overlap the Row Variables. To select the Column Variables, complete the following steps.

Click ChipID.

Click to add ChipID to the Column Variables box.

The Array Variable identifies the array, chip or spectrum in the input data set. This variable is typically also identified as a Column Variable. To select the Array Variable, complete the following steps.

Click ChipID.

Click to add ChipID to the Array Variable box.

The Channel Variable identifies the channel or dye column in the input data set. Since this is a one-channel experiment,


Leave the Channel Variable box blank.

Because the values in the Series column offer no valuable information, complete the following steps to drop this column from the output tall data set.

Click Series.

Click to add Series to the Drop Variables box.

To specify a prefix for the names of the Response columns in the output tall data set complete the following steps.

Type Chip_ in the Prefix for Column Names in Tall Data Set box.

Specify an Output Folder.

The completed dialog should appear like the one shown in Figure 4.46.

Figure 4.46: The completed dialog

Click Run to transpose the data set. The location of the transposed data set and EDDS generated by this process is listed in a SAS Message dialog (shown in Figure 4.9).

Click Open to examine the output data set (shown in Figure 4.47).


Figure 4.47: The output, tall data set

Compare the stacked, input data set (Figure 4.43) with the tall, output data set (Figure 4.47) to see the results of the unstack process. Note that the output data are grouped by probe, chip, and Affymetrix ID. In addition, the output data set has more columns but many fewer rows.

Statistics and Transforms

The Statistics and Transforms section includes Data Step, Merge and Transform, Rank Rows, Sort Rows, Statistics for Columns, Statistics for Rows, and Transform.

Data Step The Data Step command modifies a SAS data set by executing SAS Data Step commands on the data set. You must be familiar with SAS programming to use this utility. The SAS language has an array of statements and functions to perform a vast number of manipulations of a SAS data set. Refer to the DATA STEP documentation for further details. This documentation is available at http://support.sas.com/documentation. Merge and Transform The Data Merge and Transform command merges two SAS data sets that share a common set of variables, uses SAS syntax to compute an arbitrary function of each pair of variables having the same name, and generates an output data set consisting of a transformed merge of the two input data sets. You must be familiar with SAS programming to use this utility. Refer to the Base SAS documentation for further details. This documentation is available at http://support.sas.com/documentation. Rank Rows The Rank Rows command creates a new table in which each observation within each of the variables in the data set is replaced by that observation’s numerical ranking. This example ranks the responses of the 100 genes observed in the Drosophila aging experiment to the different experimental conditions.

Select Genomics > Data Set Utilities > Rank Rows. The Data Rank dialog opens, as shown in Figure 4.48.

http://support.sas.com/documentation

http://support.sas.com/documentation


Figure 4.48: The Data Rank dialog

To select the Drosophila Aging Input Data Set as the Input Data Set, complete the following steps.



Select the drosophilaaging.sas7bdat file and click Open to select the file.

This file is shown in Figure 4.34. The variable names for the data set are listed in the Available Variables box (shown in Figure 4.49).

Select all of the available variables except for Spot.

Click to add the variables to the Rank Variable box.

Figure 4.49: Selecting the variables to rank

The Advanced tab allows specification of the rank order, rank method, and method for handling ties. You may also add new variable names.




The location of the ranked data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the ranked data set (shown in Figure 4.50).

Figure 4.50: The ranked data set

Compare the input data set (Figure 4.34) with the ranked, output data set (Figure 4.50) to see that the observed values in each of the columns have been replaced with the ranks (from 1- 100) for the observations within each column. Sort Rows The Sort Rows command sorts a data set’s rows by the values in one or more columns. This example sorts the data from the Drosophila aging experiment according to age, line, and sex.

Select Genomics > Data Set Utilities > Sort Rows. The Data Sort dialog opens, as shown in Figure 4.51.

Figure 4.51: The Data Sort dialog

To select the Drosophila Aging EDDS as the Input Data Set, complete the following steps.




Select the drosophilaaging_exp.sas7bdat file and click Open to select the file. A portion of this file is shown in Figure 4.52.

Figure 4.52: The Drosophila Aging Experiment EDDS

The variable names for the data set are listed in the Available Variables box, shown in Figure 4.53. The output table contains the variables according to the order that you enter in the Sort Variables box. In the following example, the sort order is the same as the order in Sort Variables: age, line and then sex.

Select Age, Line, and Sex.

Click to add the variables to the Sort Variable box.

Figure 4.53: Selecting the variables to sort.



The location of the sorted data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the sorted data set, shown in Figure 4.54.


Figure 4.54: The sorted data set

Note that the rows are sorted first by age, then by line, and then by sex. Statistics for Columns The Statistics for Columns command calculates a variety of statistics for the columns in a SAS data set. This example calculates the mean, median, standard deviation, minimum, and maximum for each column and probe set in the Affymetrix Latin Square data set.

Select Genomics > Data Set Utilities > Statistics for Columns.

The Statistics for Columns dialog opens, as shown in Figure 4.55.


Figure 4.55: The Statistics for Columns dialog

To select the Affymetrix Latin Square Data Set as the Input Data Set, follow these steps.




The variable names for the data set are listed in the Available Variables box (shown in Figure 4.56). To select the variables to be summarized, complete the following steps.

Select all of the available variables from a_01 through q_59.

Click to add the variables to the Variables to be Summarized box.

Figure 4.56: Selecting the variables to be summarized

To calculate the statistics for all the rows in the columns,

Leave the Variables by Which to Summarize box blank.



Click on the Options tab to select the statistics to run.

Hold the Ctrl key down while clicking Max, Mean, Median, Min, and StdDev.

Click the Run button to summarize the data set.

The location of the output data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the output data set, partially shown in Figure 4.57, which lists the statistics for each column.

Figure 4.55: The summarized data set

Statistics for Rows The Statistics for Rows command computes row-wise statistics for a data set. This example computes the standard deviation and standard error for each row in the Affymetrix Latin Square data set and displays the results based on a condition.

Select Genomics > Data Set Utilities > Statistics for Rows.

The Data Row Statistics dialog opens, as shown in Figure 4.58.


Figure 4.56: The Data Row Statistics dialog

To select the Affymetrix Latin Square Data Set as the Input Data Set, complete the following steps.




The variable names for the data set are listed in the Available Variables box, shown in Figure 4.59. To select the variables to be summarized,


Click to add the variables to the Variables to be Summarized box.

Figure 4. 59: Selecting the variables to be summarized


Click the Statistics tab.

Select STD and STDERR as statistics method to compute, as shown in Figure 4.60.


Figure 4.60: Selecting the statistics

tab. For

example, to eliminate any rows for which the standard deviation value is greater than 2,

Click Options.

Specify the SAS syntax STD>2 as shown in Figure 4.61.

To filter rows based on these statistics, you specify the filtering condition in the Options

by typing

Figure 4.61

as the name of the output data set.

Click Run to summarize the data set.

he location of the output data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the output data set, partially shown in Figure 4.62.

Type Affylatin_std2

T

Figure 4.62: The summarized data set

n of Compare the input data set (Figure 4.37) with the summarized output data set. Note the additio

two columns to the data set, containing the summary statistics for each of the rows. In addition,


whereas the original data set had 1604 rows, the sorted data set has 64 rows. 1540 rows have been the condition set in the Options tab.

tion on specified variables. ransformations include exp2, exp, exp10, log2, log, log10 and sqrt (in Type of Transformation

the Affymetrix Latin Square Data Set.

The Data Transform dialog opens, as shown in Figure 4.63.

filtered out, based on Data Transform The Transform command performs a mathematical transformaTlist), or formulas specified in the Transform Expression box.

This example calculates the square root of each data point in

Select Genomics > Data Set Utilities > Transform.

Figure 4.63: The Data Transform dialog

the Input Data Set, complete the following steps.

.

ox, shown in Figure 4.64.

teps.

Click

To select the Affymetrix Latin Square Data Set as


Navigate to Sample Data > MicroArray > Affymetrix Latin Square


The variable names for the data set are listed in the Available Variables b To select the variables to be summarized, complete the following s


to add the variables to the Variables to be Transformed box.


Figure 4.64: Selecting the variables to be transformed

Select sqrt from the Type of Transformation drop-down menu.


Click Run to transform the data set.

The location of the output data set generated by this process is listed in a SAS Message dialog.

Click Open to examine the output data set, shown in Figure 4.65.

Figure 4.65: The transformed data set

Compare the input data set (Figure 4.37) with the transformed output data set to see the differences between the input and transformed data sets.


Export The Export command exports data from a SAS data set to a file. Supported formats are Tab-delimited text (.txt), Comma-separated values (.csv), Blank-delimited text (.txt), Excel (.xls) files, or JMP (.jmp).

This example exports the Affymetrix Latin Square Data Set as an Excel file.

Select Genomics > Data Set Utilities > Export.

The Data Summary dialog opens, as shown in Figure 4.66.

Figure 4.66: The Data Export dialog

To select the Affymetrix Latin Square Data Set as the Input Data Set, complete the following steps.




To specify the format of the output file, complete the following steps.

Choose the Excel format.


Click Run to generate the Excel file, shown in Figure 4.67.


Figure 4.67: The exported Excel file

Genetic Marker Case-Control Data

5C H A P T E R

In addition to a set of general statistical and data processing routines, JMP Genomics offers a collection of processes for analysis of genetic marker data. Access these processes from five submenus of the JMP Genomics menu, as shown in Figure 5.1.

Genetics Core Submenus

Figure 5.1: The Genetics submenus This chapter focuses on processes appropriate for case-control data. In case-control data, individuals are assumed to be:

• unrelated in recent generations, and • classifiable according to some phenotype.

The phenotype is typically binary with two generic levels, “case” and “control”, although several of the methods handle multi-category or continuous / quantitative phenotypes. Analysis of data for which family or pedigree information is available, in addition to markers and phenotypes, is discussed in Chapter 6. Note: Nearly all of the processes discussed in this and the next chapter call procedures from SAS/Genetics™. Detailed descriptions of these procedures and the computations performed are available in the SAS/Genetics™ 9.1.3 User’s Guide. Refer to this guide for details concerning the usage and computational methods of these SAS procedures.


The Genetic Marker Example

The example used in the analyses described in this chapter is the Genetic Marker data set described in Chapter 1. The data set and associated files can be found in the Sample Data folder that comes with JMP Genomics. To familiarize yourself with the data set for this example, complete the following steps.


Navigate to Sample Data > Genetics.

Select the samplegmdata.sas7bdat file and click Open to select the file.

The file opens the JMP table, shown in Figure 5.2.

Figure 5.2: Partial view of the samplegmdata.sas7bdat file

Examine the data contained in Figure 5.2. The data are in wide form, with 1000 rows corresponding to individuals and 130 columns corresponding to various data on these individuals. These data do, in fact, contain family and pedigree information, but this chapter considers only the unrelated individuals for which both father=0 and mother=0 (the founders). The disease column contains the binary trait of primary interest. There are also four quantitative traits and sixty markers for each individual. The marker data occur in pairs, so that the ma1 and ma2 column entries contain the alleles in the first genotype, ma3 and ma4 the second genotype, and so on. The data are computer-simulated.

Genetic Marker Data Format

The genetics processes in JMP Genomics analyze data consisting of individuals that have been genotyped at a set of genetic markers of interest. The required data structure for most of the genetics processes is the wide form, in which rows correspond to individuals and columns correspond to pedigree information, phenotypes, and genotypes. Refer to Chapter 3 for a more thorough discussion of tall and wide data sets. Genotypes can be represented in two different ways, and the two data sets partially illustrated in Figure 5.3 illustrate these different representations of the marker genotypes. JMP Genomics can process either representation.


Figure 5.3: Two different ways of representing marker genotypes

These data sets list the genotypes for the same group of individuals. Each individual is represented in a row. In the data set on the left, the alleles that comprise the genotype at each marker are listed in sequential pairs of columns. Each column in the pair contains one of the two alleles that make up the genotype. For example, the genotype of the first marker is listed in columns ma1 and ma2; the alleles that make up the genotype of the second marker are listed in columns ma3 and ma4, and so on. Alternatively, the alleles that make up the genotype at each locus can be listed in a single column with a delimiter (such as the “/” character used in the data set on the right in Figure 5.3 in columns g1−g3) separating the two alleles observed at the marker for the individual. Each of the genetics processes that contain a Marker Variables field for specifying the marker genotype variables offers a Format of Marker Variables option that indicates whether the variables in the data set correspond to individual alleles, two per marker, or genotypes with the delimiter of your choice.

In addition to the main data set containing pedigree, phenotype, and genotype information, there might also be information about the genetic markers in an annotation data set. For the annotation data set, the rows represent markers and they must match the order of the markers in the main data set. Label, chromosome, physical position, GenBank accession number, and dbSNP identifier are examples of the variables that the annotation data set could include. Most of JMP’s Genetics processes provide an Annotation tab that allows you to specify this data set and cast variables into particular roles to be used in the analysis and output.

Importing Genetic Marker Data

There are a number of different ways to prepare genetic marker data for processing with JMP Genomics. Your choice of import methods depends on the format of the raw data files and the types of analyses you want to perform. The goal is to create a wide SAS data set, and optionally a corresponding SAS annotation data set. With SAS programming experience, these data sets can be created directly in SAS before working with them in JMP Genomics. Alternatively, if the data are already in wide form, but are in text or Excel formats, open them directly in JMP, alter them as needed, and then save them as SAS data files (see Chapter 3 for an example of generating a SAS data set from an Excel file). JMP Genomics also offers customized import routines for seven different specialized genetics formats (Affymetrix SNP CHP, Affymetrix SNP CEL, Illumina SNP, Arlequin, HapMap, NEXUS, and Pedigree) divided among the Affymetrix, Illumina and Other Genetics submenus.


Finally, the generic Import Individual Text, CSV, or Excel Files process directly creates a SAS data set from one file. The Import a Designed Experiment from Text, CSV, or Excel Files process does the same if the data are spread across multiple files. The latter requires an accompanying Experimental Design File. See Chapter 3 for examples illustrating the generation of SAS data sets.

Genetic Marker Statistics The Genetic Marker Statistics submenu offers five analytical processes, as shown in Figure 5.4.

Figure 5.4: The Genetic Marker Statistics submenu.

These processes calculate a variety of measurements and statistics for both phenotypic and genotypic markers and often serve as the starting point for further experiments and analyses.

Marker Properties A convenient way to explore several properties of all the markers is to use the Marker Properties analytical process. Use the following steps to run this process on the samplegmdata.sas7bdat data set described in Chapter 1.

Select Genomics > Genetic Marker Statistics > Marker Properties.


Figure 5.5: The Marker Properties dialog


Click Load.

Select the settings for the GeneticMarkerExample.

Click OK to complete the Marker Properties dialog, as shown in Figure 5.6.

Figure 5.6: The completed General tab of the Marker Properties dialog

Recall that this input data set contains variables ma1 – ma120, and that each specifies a single allele. These markers were selected from the list in the Available Variables box and added to the Marker Variables box. Note: This data set also contains family data. In order to run the analysis on the subset of unrelated individuals, the Filter to Include Observations field should contain a filter that is used to specify the inclusion of only the founders in the analysis.

Do not make any changes to the General tab.

Click on the Annotation tab to bring up the tab shown in Figure 5.7.


Figure 5.7: The Annotation tab.

Examine the Annotation tab. This tab specifies a separate annotation data set that contains information about the markers being analyzed. The annotations used for the markers in this example are listed in the annotation data set samplemap.sas7bdat, found in the Sample Data folder.

Click Open to examine the annotation data set, as shown in Figure 5.8.

Figure 5.8: The samplemap.sas7bdat file.

Sixty different markers, corresponding to the 60 pairs (ma1 – ma120) are described in this data set. Note that the rows in this data set must be in exactly the same order as the marker columns in the input data set. There are three columns in the annotation data set. Each of these variables serves a different role in the analysis. The values in the Marker column label the markers in the output data set and any plots. The values in the CandGene column designate the candidate gene in which each marker resides. This variable groups analyses with identical CandGene levels and produces separate plots of the HWE p-values for each group. The values in the Location column list the chromosomal location of each of the markers. The x-axis of this plot uses the values in the Location variable. Each of these variables is specified by default in this example. The Filter to Include Markers field located at the bottom of this tab allows you to enter text that subsets the annotation data set to restrict the markers from the input data set that you want to analyze. This can be especially useful when selecting marker variables with the List-Style Specification of Marker Variables field on the General tab. There might be marker genotypes in columns that all begin with the same prefix, so the list-style specification is a convenient way to select all markers, then the Filter to Include Markers can filter out particular marker variables based on values of variables that are in the annotation data set.


Do not make any changes to the Annotation tab.

Click on the Options tab to bring up the tab shown in Figure 5.9.

Figure 5.9: The Options tab

Examine the Options tab. Because consecutive pairs of these columns make up the genotype at each of the 60 markers, the Alleles radio button is selected for the Format of Marker Variables parameter.

Do not make any changes to the Options tab.

Click the Output tab to bring up the tab shown in Figure 5.10.

Figure 5.4: The Output tab

Note that the Create Frequency Charts box, Create HTML box, and Create Cell Plot box are all checked. With all three boxes checked, the output from this process includes JMP frequency charts for the alleles and genotypes, HTML files containing SAS PROC ALLELE tables summarizing marker information and allele and genotype frequencies, and a cell plot representing marker genotypes. Note that the Output File prefix box is blank. When this box is left blank, the name of the input data set is used as the prefix when naming the output files. This allows all analyses performed on the same genetic marker data to be named similarly and thus easily identified. Alternatively, you can specify a different prefix to use; for example, a project identifier for the analyses you are running. Note: If the same prefix is used for multiple runs of the same process and the same output folder is specified, results from the previous run will be overwritten.

Click Run.

Figure 5.11 shows some of the output.


Figure 5.5: Output from the Marker Statistics process

Explore the results in the different windows. The cell plot provides a global view of the genotypes and lets you see patterns of homozygousity / heterozygosity using three colors. The histograms of allele and genotype frequencies provide locus-by-locus details. Note that each set of graphs is dynamically associated with a JMP table containing corresponding numerical results.

Linkage Disequilibrium The Linkage Disequilibrium (LD) process offers various displays representing measures of linkage disequilibrium between pairs of markers. Note: LD measures statistical association between groups of alleles at different loci. This is a different process than linkage analysis, which refers to techniques quantifying genetic distances. Due to the modern availability of fine-scale marker data, JMP Genomics currently focuses more on LD than on linkage analysis, although certain methods available in JMP Genomics provide information on linkage.

Select Genomics > Genetic Marker Statistics > Linkage Disequilibrium.



Figure 5.6: The Linkage Disequilibrium dialog

Click Load.

Select the default settings for the GeneticMarkerExample.

Click OK to complete the Linkage Disequilibrium dialog, as shown in Figure 5.13.

Figure 5.7: The General (left) and Annotation (right) tabs of the completed Linkage Disequilibrium

dialog

Examine the General and Annotation tabs. As discussed for Marker Properties, the marker variables have been selected, a filter to limit the analysis to the founders has been specified, the annotation data set has been chosen, and the annotation markers have been defined.

Do not make any changes to either the General or the Annotation tabs.

Click the Options tab to bring up the tab shown in Figure 5.14.



Examine the Options tab. As discussed for Marker Properties, Alleles is selected for the Format of Marker Variables parameter.

Do not make any changes to the Option tab.

Click on the Output tab to bring up the tab shown in Figure 5.15.

Figure 5.9: The Output tab

Examine the Output tab. This tab specifies parameters for the LD contour plot as well as other output.

Click Run. Figure 5.16 shows some of the results.


Figure 5.10: The output of the Linkage Disequilibrium process

Explore the results in the various windows. Note that each set of graphs is dynamically associated with a JMP table containing numerical results. Other Processes

Three other processes are available under Genomics > Genetic Marker Statistics. Phenotype Summary provides a means to explore non-genetic variables that you have collected about the sample individuals. LD tagSNP Selection uses an LD measure to define bins of SNPs. Each bin is represented by a single SNP that is used in association studies. This grouping effectively reduces the number of SNPs to a small subset of tagSNPs that need to be considered. Malecot LD Map fits the Malecot model to pair wise marker statistics and constructs an associated one-dimensional map in terms of LD units. Default example settings are available for both the Phenotype Summary and LD TagSNP Selection processes and you are encouraged to run them in order to see what functionality these two processes offer. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

Association Testing There are six processes available in JMP Genomics for the association mapping of a trait or disease using genetic marker data. These include Case-Control Association, PCA for Population Stratification, Marker-Trait Association, SNP-Trait Association, transmission disequilibrium tests for either quantitative or binary traits (Quantitative TDT, and TDT, respectively), and SNP Interaction Testing (experimental) as shown in Figure 5.17.


Figure 5.11: The Association Testing submenu

For a sample of unrelated individuals, Case-Control Association and Marker-Trait Association are appropriate, while the Quantitative TDT and TDT processes are designed for family data. The latter three processes include example data consisting of samples of genotyped parent-offspring trios or sibships, discussed further in Chapter 6. The processes can be further distinguished by the type of trait on which they perform association testing. Case-Control Association and TDT offer chi-square tests for binary traits such as disease status. The other three processes provide methods for analyzing quantitative traits and can accommodate covariates. Marker-Trait Association can additionally handle binary or count trait variables and can adjust for strata variables or random effects, and survival traits can be tested in the Marker-Trait Association process. Table 5.1 provides a summary of the appropriate process for each type of analysis. Table 5.1: Selection of Appropriate JMP Genomics Process for Different Types of Analyses

Type of Trait JMP Genomics Process

Family Relationship Binary Quantitative Count Survival Nominal Ordinal

Case-Control Association

PCA for Population

Stratification

Marker-Trait Association

SNP-Trait Association

SNP Interaction

Testing

Unrelated individuals

Quantitative TDT

TDT

Individuals grouped in

families

The following example uses the Case-Control Association process to analyze the binary variable indicating disease status for the samplegmdata.sas7bdat data set. Default example settings are available for the Marker-Trait Association, SNP-Trait Association, SNP Interaction Testing, Quantitative TDT, and TDT processes. Refer to the JMP Genomics User Guide – Supplement for more details on the SNP-Trait Association process. The TDT process is discussed in detail in Chapter 6. You are encouraged to run the remaining two processes to see what functionality they offer.

Case-Control Association

Select Genomics > Association Testing > Case-Control Association.


The Control-Case Association dialog shown in Figure 5.18 opens.

Figure 5.12: The Case-Control Association dialog

Click Load.


Click OK to complete the dialog as shown in Figure 5.19.


Figure 5.19: The completed General (left) and Annotation (right) tabs of the Case-Control

Association dialogs

Examine the General and Annotation tabs of the completed dialog shown in Figure 5.19. As discussed for Marker Properties, the marker variables have been selected, a filter to limit the analysis to the founders has been specified, the annotation data set has been chosen, and the annotation markers have been defined. When the number of marker variables is large, it is often more convenient to type the list of marker variables into the List-Style Specification of Marker Variables box, rather than entering each variable into the Marker Variables box. For this example, first remove all the variables in the Marker Variables box and type ma1-ma120 in the List-Style Specification of Marker Variables box. Remember, SAS variable names are not case-sensitive. The disease variable is listed in the Trait Variables box.

Do not make any changes to either the General or the Annotation tabs.




Examine the Options tab. As discussed for Marker Properties, Alleles is selected for the Format of Marker Variables parameter. All three association tests, the Pearson Chi-squared tests for alleles and genotypes, and the linear trend test, are selected in the Association Tests box.

Do not make any changes to the Option tab.

Click the P-Value Plots tab to bring up the tab shown in Figure 5.21.

Figure 5.14: The P-Value Plots tab

Examine the P-Value Plots tab. This tab specifies parameters for conversion, corrections, and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.

Click Run.

The output window illustrated in Figure 5.22 opens.

Figure 5.15: The output of the Case-Control Association process

Examine the overlay plots in the output window. The y-axis in the two plots displays the negative log p-value for three different tests of association. Peaks indicate locations of significant association, and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The two different graphs appear because of the specification of CandGene as the Annotation Group Variable in the Annotation tab.

Haplotype Analysis Instead of examining markers individually, it can often be more informative to look at a set of alleles and markers from the same chromosome as a single entity; that is, as a haplotype. Estimates of haplotype frequencies


can be used in a variety of ways: to test for multilocus LD, to test for association between a trait and several markers at once, and to infer the parental haplotypes that an individual receives. There are three processes available in JMP Genomics for analyzing haplotypes using genetic marker data. These include Haplotype Estimation, Haplotype Trend Regression, and htSNP Selection, as shown in Figure 5.23.

Figure 5.16: The Haplotype Analysis submenu

When genotype data are collected, the two haplotypes that compose a multilocus genotype are not typically observed. Thus, the alleles, passed together from one parent, for each of the set of markers, remain unknown. The expectation-maximization (EM) algorithm can be used to estimate these unobserved haplotype frequencies and can be invoked with the Haplotype Estimation process, generally as the first step in your haplotype analysis. You can estimate haplotype frequencies for one particular set of markers, or many sets. To perform estimation for multiple marker sets, define a group variable from your annotation data set, a sliding window of specified-width markers, or both. For each set of markers, you can perform tests for LD and association with a binary trait. In order to further determine the particular haplotype from a set of markers that may be influencing a trait (binary, quantitative, or survival), use output data sets from the Haplotype Estimation process as input for the Haplotype Trend Regression process. Output data sets can also feed the htSNP Selection process to determine the subset(s) of markers that explain much of the haplotype diversity within a block of strongly associated markers. The following example uses the Haplotype Trend Regression process to analyze the binary variable (disease) indicating disease status for the samplegmdata.sas7bdat data set. Default example settings are available for the Haplotype Estimation and htSNP Selection processes. Run the remaining two processes to see what functionality they offer.

Haplotype Trend Regression

Select Genomics > Haplotype Analysis > Haplotype Trend Regression. The Haplotype Trend Regression dialog shown in Figure 5.24 opens.


Figure 5.17: The Haplotype Trend Regression dialog

Click Load.


Click OK to complete the dialog as shown in Figure 5.25.

Figure 5.18: The completed Haplotype Trend Regression dialog


Click Open to open the samplegmdata_phase.sas7bdat input data set (Figure 5.26).

Figure 5.19: The samplegmdata_phase.sas7bdat file

Note that the data set in this example contains columns from samplegmdata.sas7bdat, shown in Figure 5.2. This is the Phase Assignment data set created by the Haplotype Estimation process. The columns selected as ID variables from the original data set are included in this data set, namely Individual ID, disease, Qtrt1, and Qtrt2. Columns _A_1 through _A_10 contain the alleles at the five markers in the sliding window. Examine the General tab in the completed dialog (Figure 5.26). All of the columns from the input data set are listed in the Available Variables box. Qtrt1 is selected as the Trait Variable and Qtrt2 is selected as the Covariate. The SAS expression windows=7 is entered in the Where Clause box to perform the haplotype trend regression using the five markers from sliding window 7, which correspond to the first five single nucleotide polymorphisms (SNPs) from candidate gene 2. When the Sliding Window option is specified for the Haplotype Estimation run that creates the input data set for Haplotype Trend Regression, either a single sliding window can be analyzed using the Where Clause as shown here, or Window must be selected as a By Variable.



Figure 5.20: The Option tab

Examine the Option tab. The Type of Trait is specified as Continuous to allow for a linear regression of the trait variable (Qtrt1, specified in the General tab) on the haplotypes. The Frequency Cutoff for Combining Haplotypes is set to 0.005. Any haplotypes with a frequency below this value are


combined into a single group for analysis. The frequencies are provided by the data set specified as the Haplotype Frequency Data Set, also created as an output data set by the Haplotype Estimation process.

Click Run.

The output window shown in Figure 5.28 opens.

Figure 5.21: The output of the Haplotype Trend Regression process

Be sure to scroll down and examine the entire second table. This table lists the F-statistics and associated probabilities for each of the 14 estimated haplotypes. Haplotypes 14 and 1 are revealed as the most significant.

Genetic Marker Family or Pedigree Data

6C H A P T E R

While Chapter 5 considered genetic marker data from unrelated individuals, this chapter describes methods in JMP Genomics appropriate when family or pedigree information is available for the individuals. These methods include the Transmission Disequilibrium test for both binary traits (TDT) and quantitative traits (Quantitative TDT) in the Association Testing submenu, as shown in Figure 6.1, and the three processes grouped in the Model-free Linkage submenu, as shown in Figure 6.15.

Figure 6.1: The Association Testing submenu

Note: Nearly all of the processes discussed in this and the previous chapter call procedures from SAS/Genetics™. Detailed descriptions of these procedures and the computations performed are available in the SAS/Genetics™ 9.1.3 User’s Guide. This reference can be accessed from http://support.sas.com/documentation/index.html or viewed in PDF format from http://support.sas.com/documentation/onlinedoc/91pdf/. You should refer to this guide for details concerning the usage and computational methods of these SAS procedures.

The Sample Data Sets

The analyses described in this chapter use two sample data sets. The first data set is the genetic marker data set, samplegmdata.sas7bdat, considered in the previous chapter and described in Chapter 1. To familiarize yourself with the genetic marker data set, complete the following steps.



Select the samplegmdata.sas7bdat file and click Open to select the file.

The file opens as a JMP table, as partially shown in Figure 6.2.

http://support.sas.com/documentation/index.html

http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_913/genetics_ug_9199.pdf


Figure 6.2: The samplegmdata.sas7bdat file

The first four columns describe the family data structure for the 1000 individuals in the samplegmdata.sas7bdat data set. Ped_id is a variable whose values correspond to distinct family units. Ind_id is the individual identifier and is unique within each level of Ped_id. The father and mother columns contain the Ind_id values corresponding to that individual’s father and mother within their specific family. If the individual is a founder in the population (that is, data on that individual’s father and mother is not available), a value of 0 is coded for their father and mother. See Chapter 5 for further details about the other variables in this data set. Chapter 5 also describes the Marker Properties and Linkage Disequilibrium processes for investigating basic statistics on the markers, and provides an overview of the association testing methods available in JMP Genomics. The second data set, used for the Model-free Linkage processes, is the affected sib-pair (ASP) data kindly provided by Gonçalo Abecasis (University of Michigan Center for Statistical Genetics). This data set is discussed later in this chapter. Both data sets and associated files are found in the Sample Data folder that came with JMP Genomics.

Importing Family Data

There are a number of different ways to prepare family genetic marker data for processing with JMP Genomics, depending upon the format of the raw data files. The goal is to create a wide SAS data set as described previously and, optionally, a corresponding annotation SAS data set. See Chapter 3 for more details on generating SAS data sets.

As discussed in Chapter 5. JMP Genomics also offers customized import routines for six different specialized formats (Affymetrix SNP CHP, Arlequin, HapMap, Illumina, NEXUS, and Pedigree). These are found in the Data Set Creation submenu, as shown in Figure 5.4. This example uses the Pedigree process to import family-specific data.

The following steps describe how to use the customized Family import process.

Select Genomics > Data Set Creation > Other Genetics > Pedigree.



Figure 6.3: The Pedigree Input Engine dialog

The main input file is specified in the Input Pedigree File box. This example uses the ped_all_columns.txt file included with JMP Genomics. To view this file, complete the following steps.


Select the ped_all_columns.txt file.

Click Open to open the file shown in Figure 6.4.

Figure 6.4: The ped_all_columns.txt file

Note: This file is formatted as a blank-delimited test file. Each column is separated by a space. The columns, in order, indicate pedigree, individual ID, father’s ID, mother’s ID, sex, disease status, genotypes for 5 markers, and data for five quantitative traits. In addition to .txt files, the Pedigree process accommodates standard input file formats such as LINKAGE, QTDT, Genehunter, and FBAT. To choose the ped_all_columns.txt file as the input file, complete the following steps.

Click Load.


Select the settings for PedigreeExample1.

Click OK to complete the Pedigree Input Engine dialog, as shown in Figure 6.5.

Figure 6.5: The completed General tab of the Pedigree Input Engine dialog

Examine the General tab. Note that the ped_all_columns.txt file has been selected as the input file for this example. The destination folder for the output from this process has also been specified.




Examine the Options tab. This tab is where you specify the format of the input data file, the labels and identities of the different variables and an optional name for the output data set. Note that


SPACE is selected in the Column Delimiter field, thus matching the format of the input data set. Also note that specific names are listed, in order, for each of the columns, and that the quantitative variables are identified. Note: The order of the values listed in the List of Variable Names and in the Quantitative Variables fields must exactly match the order of the columns in the Input Pedigree File.


Click Run.

The location of the output data set generated by this process is listed in a SAS Message dialog, as shown in Figure 6.7.


Click Open to examine the contents and structure of the output data set (partially shown in Figure 6.8).

Figure 6.8: The output data set

The data have been imported into a JMP data table, organized into columns labeled as specified in the dialog. Note that the columns, except those containing quantitative traits, have missing values in place of any 0s that were present in the original text file. This recoding is done automatically for any column not listed in the Quantitative Variables field.

The Transmission Disequilibrium Test (TDT)

The Transmission Disequilibrium Test (TDT) process offers various chi-square tests for binary traits such as disease status for genotyped parent-offspring trios or sibships. Use the following steps to compute TDT statistics for the disease variable in the samplegmdata.sas7bdat data set.


Select Genomics > Association Testing > TDT. The TDT dialog shown in Figure 6.9 opens.

Figure 6.9: The TDT dialog

To choose the ped_all_columns.txt file as the input file, complete the following steps.

Click Load.

Select the settings for the GeneticMarkerExample.

Click OK to complete the TDT dialog, as shown in Figure 6.10.


Figure 6.10: The General tab of the TDT dialog

Examine the General tab of the completed TDT dialog. Note that the marker (ma1 – ma120) and disease variables, as well as the four family variables (Ped_id, Ind_id, father, and mother), are specified in their required fields. The Filter to Include Observations field is left blank because this example uses the entire data set of 1000 individuals.


Click the Annotation tab to bring up the tab shown in Figure 6.11.


Figure 6.11: The Annotation Tab

As discussed for the examples in Chapter 5, an annotation data set has been selected and required variables have been specified.




As discussed for the examples in Chapter 5, Alleles is selected for the Format of Marker Variables parameter. The TDT, along with the continuity correction option, is selected for the Family Association test. Information about these parameters can be found in the PROC TDT chapter of the SAS/Genetics User’s Guide.


Click the P-Value Plots tab to bring up the tab shown in Figure 6.13.



Examine the P-Value Plots tab. This tab specifies parameters for conversion, corrections and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.

Do not make any changes to the P-Value Plots tab.

Click Run.

The output window illustrated in Figure 6.14 opens.

Figure 6. 14: The output of the TDT process

Examine the overlay plots in the output window. The y-axis in the two plots displays the negative log p-value for the TDTs. Peaks indicate locations of significant association and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The two different graphs appear because of the specification of CandGene as the Annotation Group Variable in the Annotation tab. The SAS Output window (not shown) contains detailed tabulated statistics from the tests.

Model-Free Linkage Tests on IBD Data Three methods for performing model-free linkage tests are available in the Model-free Linkage submenu, as shown in Figure 6.15. These methods include the Affected Sib-Pair Tests, Haseman-Elston Regression, and Variance Components processes.


Figure 6.15: The Model-free Linkage submenu

When the data contain sibling pairs where both siblings are affected with the disease (or, more generally, possess the trait of interest) the Affected Sib-Pair Tests process can be used for performing simple chi-square tests for linkage between the trait and the available genetic markers. Both the Haseman-Elston Regression and Variance Components processes are designed for quantitative traits and can accommodate covariates. However, the Haseman-Elston Regression utilizes sib-pairs from the pedigrees sampled, whereas the Variance Components process uses any related pairs when testing for linkage of the trait with a marker. The three Model-free Linkage processes are not applied to genetic marker data as are the other genetic processes; instead, they analyze data containing information about the probabilities of pairs of individuals sharing alleles that are identical-by-descent (IBD) at the markers of interest. The required input IBD data set must contain one row for each pair of related individuals being analyzed at each marker, with variables z0, z1, and z2 representing the probability of the two individuals in the pair sharing 0, 1, or 2 alleles IBD, respectively. All possible pair-wise comparisons within each family should be made. Variables for the pedigree or family, the two individual IDs, and the marker are also required in this data set. Pairs of individuals should be grouped by marker, then by pedigree or family prior to carrying out these processes. The Identical-by-Descent (IBD) Data Sets The example illustrated for the Model-free Linkage processes uses the affected sib-pair (ASP) data provided by Gonçalo Abecasis (University of Michigan Center for Statistical Genetics) and described in Chapter 1. This example comprises three associated data sets:

1) the IBD data set that contains the IBD probabilities for 20 markers in 200 families, with 4 individuals in each family

2) a pedigree data set that lists the family relationships, affected status, and marker

genotypes for each of the 800 individuals (4 per family) in the data set

3) a map data set that lists the physical location of each of the markers on human chromosome 24.

Note: If you are curious about chromosome 24, recall that these are fictitious data. To examine the IBD data set, complete the following steps.

Select File > Open to open the Open Data File dialog.


Select the asp_ibd.sas7bdat file.

Click Open to open the file partially shown in Figure 6.16.


Figure 6.16: The IBD data file

Note: All pair-wise comparisons within each family are listed. MERLIN was used to estimate identical-by-descent (IBD) allele-sharing probabilities at these markers for all pairs of related individuals. To examine the IBD pedigree data set, complete the following steps.



Select the asp_ped.sas7bdat file.

Click Open to open the file partially shown in Figure 6.17.

Figure 6.17: The pedigree data file

Note: The alleles for each of the 20 markers are listed in successive pairs of marker columns, such that the alleles for the first marker are listed in columns a1 and a2, the alleles for the second marker are listed in columns a3 and a4, and so on. The 400 offspring are also measured for a quantitative trait of interest. To examine the IBD map data set, complete the following steps.




Select the asp_map.sas7bdat file.

Click Open to open the file shown in Figure 6.18.

Figure 6.18: The map data file

Note: The location of each marker is listed. The following example uses Variance Components process to test for linkage between the 20 markers and the quantitative trait for the families in the ASP IBD, pedigree, and map data sets. Variance Components

Select Genomics > Model-free Linkage > Variance Components.

The Variance Components dialog shown in Figure 6.19 opens.


Figure 6.19: The Variance Components dialog

Click Load.

Select the default settings for the Merlin_asp example.

Click OK to complete the dialog, as shown in Figure 6.20.


Figure 6.20: The completed General tab of the Variance Components dialog

Examine the General tab of the completed dialog. The data set asp_ibd.sas7bdat is specified in the IBD Data Set field, and asp_ped.sas7bdat is specified in the Pedigree Data Set field. The column headings from the pedigree data set are listed as variables listed in the Available Variables box. The QTrait variable from the latter is the quantitative trait of interest, and the Family, ID, Parent1, and Parent2 variables specify the family structure. The Filter to Include Observations field is left blank because we are using the entire data set of 800 individuals.


Click the Annotation tab to bring up the tab shown in Figure 6.21.

Figure 6.21: The Annotation Tab

Examine the Annotation tab. The asp_map.sas7bdat is specified as the annotation data set. The column headings from the file (illustrated in Figure 6.18) are listed in the Available Variables box. The variables marker and location are specified in the Annotation Label Variable and Annotation Location Variable boxes, respectively.




Figure 6.22: The Options Tab

Examine the Options tab. Likelihood Ratio is selected as the test statistic.


Click on the P-Value Plots tab to bring up the tab shown in Figure 6.23.


This tab specifies parameters for conversion, corrections and adjustments to the analyses. Refer to the PSMOOTH procedure in the SAS/Genetics User’s Guide for more information on these parameters.

Click Run.

The output window shown in Figure 6.24 opens.

Figure 6.24: The output of the Variance Components process

Examine the output window. The y-axis, labeled ProbChi, in the plot displays the negative log p-value for likelihood ratio tests at each locus. The peak occurs at the fourth marker, and you can mouse-over or click on them to highlight the rows in the corresponding JMP tables. The SAS Output window (not shown) contains details for the SAS Proc Mixed runs used to generate the tests.

6 Genetic Marker Family or Pedigree Data

126

Microarray Case Study I: The Drosophila Aging Experiment

7C H A P T E R

In this chapter we use a small subset of the Drosophila aging experiment data from Jin et al. (2001) to work through several analytical processes as a case study. The experiment consisted of 24 two-color cDNA microarrays, six for each experimental combination of two lines (Oregon and Samarkand), two sexes (Female and Male), and two ages (1 week and 6 weeks). The Cy3 and Cy5 dyes were flipped for two of the six replicates for each genotype and sex combination. The design is a split-plot design, with Age and Dye as subplot factors, and Line and Sex as whole-plot factors. A total of 4256 clones were spotted on the arrays, but for this example, we use a subset containing 100 randomly selected genes.

Sample Workflow for Analysis of Microarray Data

The workflow∗ for this example is as follows:

1. Generation of the Data Sets i. Experimental Design File Builder

ii. Data Set Creation 2. Evaluation of the Data Quality

i. Raw Data Distribution Analysis ii. Ratio Analysis (Raw Data)

iii. Ratio Analysis (Loess Normalization) 3. Comparison of Different Methods for Data Normalization

i. Data Standardization (Median) & Standardized Distribution Analysis ii. Loess Normalization Across Arrays & Distribution Analysis (Loess Normalized

Data) 4. Evaluation of Normalized Data Quality

i. Correlation and Principal Components ii. Correlation and Grouped Scatter Plots

5. Primary Data Analysis for Determining Significant Differences in Gene Expression i. Analysis of Variance

ii. Mixed Model Analysis 6. Further Analysis

i. Transpose Tall and Wide ii. K-Means Clustering

iii. Distance Matrix 7. Predictive Modeling

While this is a fairly standard sequence of processes to run, the order of the processes can change to suit any experimental objectives.

∗ Outline topics correspond to subsections of this chapter.

7 Microarray Case Study I: The Drosophila Aging Experiment 128

Generation of the Data Sets As described in Chapter 3, JMP Genomics requires the generation of specific data sets. The first step in generating these data sets is the building the Experimental Design File.

Experimental Design File Builder Many of the processes in JMP Genomics require an Experimental Design Data Set, (EDDS) which contains the corresponding experimental factors for each channel in a multi-channel platform or for each array in a single-channel platform. In order to bulk-load a set of raw data files, you need to prepare a corresponding Experimental Design File (EDF) that contains the file names and all experimental factors. Refer to Chapter 3 for detailed instructions on how to create an EDF. Here, we use the Experimental Design File Builder process to generate an EDF for the trimmed Drosophila Aging Data. The raw data consists of 24 .DAT files located in the Sample Data folder. To build an EDF using these files,

Select Genomics > Experimental Design > Experimental Design File Builder.

The Experimental Design File Builder dialog appears, as shown in Figure 7.1.

Figure 7.1: The Experimental Design File Builder dialog

Click Choose to select the folder containing the raw data files.


Open the Scanalyze Drosophila folder.


Click Select (circled in Figure 7.2) to select the folder.

Figure 7.2: Selecting the folder that contains the raw data files

When selecting folders in JMP Genomics, navigate into the folder containing the raw data files and select it.

Because the raw data files are in the .DAT file format, filter out all file types but .DAT files.

Select .dat from the File Filter Expression drop-down menu.

The File Filter Expression box appears as shown in Figure 7.3.

Figure 7.3: Selecting the file filter

Recall that this is a two color array using Cy3 and Cy5. Because the probes were labeled with two dyes,

Enter 2 in the Number of Channels in Each File box, as shown in Figure 7.4.

Figure 7.4: Specifying two channels

Type Line, Sex, and Age, in the New Variable Names for Experimental Design box, as

shown in Figure 7.5.

Figure 7.5: Entering new variable names

Note: These variable names may be entered on the same line, but must be separated by a space.


Specifying a name for the output file is optional and you may specify any name you like here. For this example, DrosophilaAging_Exp.txt is the name used for the output file.

Type DrosophilaAging_Exp.txt in the Output File Name box, as shown in Figure 7.6.

Figure 7.6: Specifying the output file

To specify the output folder, complete the following steps.

Click on Choose.

Navigate to the ProcessResults folder.


The Experimental Design File Builder dialog should appear like the one shown in Figure 7.7.

Figure 7.7: The completed General tab of the EDF Builder dialog


The EDF is shown in Figure 7.8.


Figure 7.8: The EDF

The EDF contains several empty columns. You can type the corresponding information into them and use the Create Array Index, Create ColumnName, and Check File Names commands located under the Data Set Creation submenu to add or modify certain columns. Alternatively, since the raw file names contain the sufficient information about the empty columns, you can write SAS code to create the values of Line, Sex, Age, and Intensity.

Click the Experimental Design File Builder dialog to make it the active window.

Click the Options tab.

Type the following SAS commands to the SAS Code to Create Columns box.

Name = scan(File,Channel); if substr(Name,1,1) = "O" then Line = "ORE"; else Line = "SAM"; if substr(Name,2,1) = "M" then Sex = "MAL"; else Sex = "FEM"; if substr(Name,3,1)="1" then Age = "WK1"; else Age="WK6"; if Channel = 1 then do; Dye = "Cy3"; Intensity = "Ch1i"; end; else do; Dye = "Cy5"; Intensity = "Ch2i"; end; if Array < 10 then ArrayString = "0" || trim(left(Array)); else ArrayString = trim(left(Array)); ColumnName = trim(Line) || "_" || trim(Sex) || "_" || trim(Age) || "_" || trim(Dye) || "_" || ArrayString; drop Name Channel ArrayString; rename Dye = Channel;


Note: The first part of the File variable (before the first “.”) and the second part (between the first and the second “.”) of the raw file name contains the experimental information associated with the Cy3 channel and Cy5, respectively. These commands may be modified to fit most experimental conditions. Refer to the SAS 9.1.3 User’s Guide (http://support.sas.com/onlinedoc/913/docMainpage.jsp) for additional information.

Click Run to generate the modified EDF. The modified EDF is partially shown in Figure 7.9.

Figure 7.9: The modified EDF

The EDF is automatically saved as a text file in the output folder you specified in the Experimental Data File Builder dialog.

Data Set Creation

To generate a SAS data set and EDDS from the raw data files that can be used for further analysis by JMP Genomics using a device-specific import engine, complete the following steps.

Select Genomics > Import > Other Expression > ScanAlyze. The ScanAlyze Import Engine dialog opens, as shown in Figure 7.10.

http://support.sas.com/onlinedoc/913/docMainpage.jsp


Figure 7.10: The ScanAlyze Import Engine dialog

Make sure the General tab is selected.

To choose the Experimental Design File you created in the previous section, complete the following steps.

Click Choose.

Navigate into the ProcessResults folder.

Select Text Import Files (*.TXT; *.CSV; *.DAT) from the File of type drop-down menu.

Select the DrosophilaAging_Exp.txt file and click Open.

To choose the folder containing the raw data files, complete the following steps.

Click Choose.


Open the Scanalyze Drosophila folder.

Click OK to select the folder. The first row of one of the raw .DAT files lists the column names and the primary numerical data does not start until the 9th row of the file, as shown in the partial view of one of the raw data files, illustrated in Figure 7.11.


Figure 7.11: A portion of one of the raw data files from the Drosophila Aging experiment

Numerical Data

The value of 9 is specified as a default setting for the Data Start Row, as shown in Figure 7.12, because of the structure of the raw data files generated by the Scanalyze device. For other ScanAlyze experiments, the setting in the Data Start Row box may need to be changed.

Figure 7.12: Specifying the data start row

Do not change the Data Start Row default setting.


Click Choose.



The General tab of the dialog should appear like the one shown in Figure 7.13.

Figure 7.13: The completed General tab

Select the Options tab.

The Options tab appears as shown in Figure 7.14.



There are three output data sets generated by the ScanAlyze Input Engine:

1. Output Experimental Design Data Set 2. Output Data Set 3. Spot Coordinates Output Data Set

Specify drosophilaaging_exp as the name of the output experimental design data set.

Specify drosophilaaging as the name of the output data set.

The Perform Log2 Transform checkbox provides an option to apply a logarithm base 2 transformation to the intensities in the output data.

Make sure that the Perform Log2 Transform box is checked. The third data set, Spot Coordinates Output Data Set, specifies location data for the individual spots on the microarray. This data set is not required for the analyses described in this chapter.

Do not specify a spot coordinates output data set. Number of Rows to Scan is used to specify the numbers of rows to be scanned in order to determine the attributes of the variables in the output SAS data set. The default value is set to 100.

Make sure the default value is specified.

The Options tab of the dialog should appear like the one shown in Figure 7.15.

Figure 7.15: The completed Options tab

Click Run to generate the data sets.

As discussed in Chapter 1, JMP Genomics dialogs generate and run a SAS program each time you click Run. Depending upon the size of your data sets and capacities of your computer, some processes can take several minutes or, for very large and complex runs, several hours. While a program is running, the message SAS Connected is displayed in the JMP status bar located in the lower left corner of your JMP window (See Figure 1.10) . The Windows Task Manager shows a process named


sas.exe running, and you can track its CPU and I/O activity. You can also monitor the SAS temporary working directory and the Output Folder for results as they are created. The SAS data sets generated by this process are listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.16).


The dialog lists the EDDS and the primary data set.

Click Open for each of the data sets to examine their contents and structures. The output EDDS is partially illustrated in Figure 7.17.

Figure 7.17: The drosophilaaging_exp EDDS

Figure 7.18 shows a partial listing of the output data.


Figure 7.18: The drosophilaaging data set

Note: The output data set is formatted in the tall SAS data set form, required for subsequent analyses.

Evaluation of Data Quality Numerous factors can affect the quality of the data generated in any microarray experiment. These factors may include experimental errors in labeling, gene-specific differences, minor slide defects, differences in hybridization conditions, variability in printing quality, and so forth. Because these factors can interfere with interpretations, the first step in any analysis of microarray data should be to assess the quality of the raw data. Performing quality control (QC) at the beginning of an analysis can save a great deal of time downstream and leads to more reliable results.

Distribution Analysis for Raw Data For the Drosophila example, we start with a simple distribution analysis to get a feel for overall intensity characteristics for the spots on the arrays.

Select Genomics > Quality Control > Distribution Analysis, as shown in Figure 7.19.


Figure 7.19: Selecting Distribution Analysis

The Data Distribution dialog opens, as shown in Figure 7.20.

Figure 7.20: The Data Distribution dialog

To choose the drosophilaaging.sas7bdat input data set created previously, complete the following steps.

Click Choose.



Select the drosophilaaging.sas7bdat file and click Open.

Note that the file path and all the column labels from the input data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.21.

Figure 7.21

Select from the available variables those for which you wish to view the distributions. Leaving the Variables for which to Display Distributions field blank displays distributions for all the available variables.

Leave the Variables for which to Display Distributions field blank.

Leave the ID Variables field blank.

Leave the List-Style Specification field blank To specify the Output Folder, complete the following steps.

Click Choose.





Figure 7.22: The General tab of the completed Data Distribution dialog

The Experimental Design tab allows you to specify the experimental design data set (EDDS) and specific variables used to modify the analysis.

Click Experimental Design. To choose the Experimental Design Data Set, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. Note that the file path and all the column labels from the experimental design data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.23.

Figure 7.23

Leave the Variables Defining Groups, Color Variables and Label Variable fields blank.

The Option tab allows you to specify how the results of this process are displayed.

Click Options to view the default settings.



Click Run to generate the distributions.

Several windows open.

1. The drosophilaaging.sas7bdat data table for creating distribution details 2. A drosophilaaging_stack data set for creating box plots

3. A drosophilaaging_densities data set for creating the Overlay Kernel Density Estimates

4. A Box Plots summary window (Figure 7.24) that shows the distributions and outliers

for all the variables in the input data set

Figure 7.24: The Box Plots summary window

5. A Distribution Details window (partially shown in Figure 7.25) that shows histograms,

box plots, quantiles, and statistical moments for each row of the experimental design. Note that each row refers to an individual Cy3 or Cy5 channel in this case.


Figure 7.25: The Distribution Details window

6. A Parallel Plot window (Figure 7.26) that shows Overlayed Kernel Density Estimated curves

Figure 7.26: The Parallel Plot window

The overlay plot shows the raw univariate distributions of all 48 channels from the 24 arrays. Visually, the estimated distributions significantly vary among all the 48 channels here. This inherent variability among arrays and dye indicates that normalization across arrays and channels is essential for effective analysis of these data. Ratio Analysis and Checking for Dye Effects Dye effects are often significant for multi-channel microarray data. To investigate the dye effects, you should by inspect plots of log ratios versus average (or sum) log intensities of the two channels for each array. Such plots are known as MA plots.


Select Genomics > Normalization > Ratio Analysis, as shown in Figure 7.27.

Figure 7.27: Selecting the Ratio Analysis process

The Ratio Analysis dialog opens, as shown in Figure 7.28.

Figure 7.28: The Ratio Analysis dialog

Examine the General tab. This example uses the same EDDS, input data set, and output path as used in the Distribution Analysis done previously. To choose the drosophilaaging.sas7bdat input data set, complete the following steps.


Click Choose.


Select the drosophilaaging.sas7bdat file and click Open.

Note that the file path and all of the column labels from the input data set are listed in the Input SAS Data Set field and the Available Variables field, respectively, as shown in Figure 7.29.

Figure 7.29

Select from the available variables those for which you wish to view the distributions.

Select Spot.

Click to add Spot to the Feature Variable box, as shown in Figure 7.30.

Figure 7.30: Selecting the Feature Variable from the Input Data Set The data is presented as hybridization intensity.

Make sure Intensity is selected as the input data type. To choose the Experimental Design Data Set, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the output folder, complete the following steps.

Click Choose.





Figure 7.31: The completed General tab of the Ratio Analysis dialog

Click Analysis to open the tab shown in Figure 7.32.

Figure 7.32: The Analysis tab of the Ratio Analysis dialog

Two parameters, the Variable to Define Ratio and the Value of Variable above to Be Used as Denominator, provide options to construct ratios for two channels within a single array.

Leave both parameters unselected.

De-select Perform Loess Normalization.

By leaving both parameters blank and by not performing the Loess Normalization, the Ratio Analysis process creates MA plots with the original raw data.

Click the Options tab to open the tab shown in Figure 7.33.


Figure 7.33: The Options tab of the Ratio Analysis dialog

Four parameters under Options tab let you specify the output data set and file names. Specify the names or leave them blank to use default names.

Do not specify names for any of the output files.

Click Run to carry out the ratio analysis.

A window with MA plots appears (shown in Figure 7.34).

Figure 7.34: MA plots of raw data

Figure 7.34 shows MA plots for arrays 1 and 2. The red curve in each plot is a smoothing spline applied on the data of each array. A large discrepancy between the spline and the zero horizontal line indicates a significant dye effect within that array.


Scroll up and down on the MA Plots window to view MA plots for other arrays. All of the MA plots show significant deviation from the zero horizontal line, indicating significant dye effects and necessitating data normalization before further analysis. Loess Normalization within Arrays

Click the Ratio Analysis dialog to reactivate this window.

Click Analysis.

Click the checkbox to select the Perform Loess Normalization option.

Click Options.

Type drosophilaaging_loess1 in the Output Data Set field.

Click Run to rerun the Ratio Analysis process.

The MA plots that are generated by this process, illustrated in Figure 7.35, are now constructed from data that have been loess normalized within each array (Dudoit, Yang et al. 2002).

Figure 7.35: MA plots of Loess normalized data.

Compare the plots illustrated in Figure 7.35 with those in Figure 7.34. After within-array Loess normalization the smoothing spline becomes much closer to the zero horizontal line in each MA plot.


Note: The within-array Loess normalization performed by the Ratio Analysis process is different from the across-array normalization performed by the Loess Normalization process, which is described later in this chapter.

Comparison of Different Methods for Data Normalization JMP Genomics provides several methods for normalizing your data set. Deciding which process to use is best

done on a case-by-case basis. This example considers two of these methods: median standardization and Loess

normalization. Both procedures use the within-array loess normalized data set,

drosophilaaging_loess1.sas7bdat, created in the preceding section, as the input data set.

Median Standardization of Data across Arrays The first example method for normalizing the data across arrays is median standardization.

Select Genomics > Normalization > Data Standardize, as shown in Figure 7.36.

Figure 7.36: Selecting the Data Standardize process

The Data Standardize dialog, shown in Figure 7.37, opens.


Figure 7.37: The Data Standardize dialog

Make sure that the General tab is selected.

To choose the Loess normalized data set generated previously as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1.sas7bdat file and click Open. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the method for standardization, complete the following step.

Click the downward arrow in the Standardization Method box and select MEDIAN.

This type of standardization centers the median of each channel to zero. To specify the Output Folder, complete the following steps.


Click Choose.


Open the ProcessResults folder and click Select to select this folder. The General tab of the dialog should appear like the one shown in Figure 7.38.

Figure 7.38: The completed General tab of the Data Standardize dialog

Click Run to standardize the data.

The standardized SAS data set, drosophilaaging_loess1_med.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.39).


Rerun the Distribution Analysis process, using the median adjusted data set we just generated as the Input Data Set.

Select Genomics > Quality Control > Distribution Analysis.

Specify drosophilaaging_loess1_med.sas7bdat and drosophilaaging_exp.sas7bdat as the input data sets and EDDS, respectively.



The Data Distribution dialog should appear as shown in Figure 7.40.

Figure 7.40: The completed General (top) and Experimental Design (bottom) tabs of the Data

Distribution dialog


Click Run to generate the distributions. As before, several results windows open. Compare the overlayed kernel density plot for the normalized and standardized data set, shown in Figure 7.41, with the overlayed kernel density plot for the raw data set, shown in Figure 7.26. Note that the marginal univariate distributions of each channel are now much more consistent than before.

Figure 7.41: The overlayed kernel density plot for the normalized, standardized data set


Loess Normalization Across Arrays The alternative method for standardizing the data across arrays is Loess normalization.

Select Genomics > Normalization > Loess Model Normalization, as shown in Figure 7.42.

Figure 7.42: Selecting the Loess Normalization process

The Loess Model Normalization dialog opens.


To choose the Loess normalized data set, generated previously, as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1.sas7bdat file and click Open. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the Output Folder, complete the following steps.

Click Choose.





Figure 7.43: The completed General tab of the Loess Model Normalization dialog

Click Options.

drosophilaaging_loess1_loess2 in the Output Data Set field.

Make no other changes to the Options tab.

the process automatically uses the mean across all annels and all arrays as the common baseline.

Click Run to generate the Loess normalized data set.

fore and after normalization on e left and right panels, respectively, as illustrated in Figure 7.44.

Type

Note that without specifying a Baseline variable,ch

A Loess Normalization results window appears with scatter plots beth


Figure 7.44: Scatterplots of individual array data before (left) and after (right) normalization

All the scatter plots have a common baseline as the x-coordinate. The y-coordinates in the left graphs are computed as the within-array normalized data minus the corresponding baseline, whereas the y-coordinates on the right are computed as the across-array normalized data minus the corresponding baseline. The red horizontal line, seen in all four plots, is a smoothing spline curve fit to the data nonparametrically. Rerun the Distribution Analysis process, using the Loess-normalized data set we just generated as the Input Data Set. This data set and the EDDS are found in the ProcessResults folder.

Select Genomics > Quality Control > Distribution Analysis.

Specify drosophilaaging_loess1_loess2.sas7bdat and drosophilaaging_exp.sas7bdat as the input data sets and EDDS, respectively.



Change the number of grid points from 100 to 40.

Reducing the number of grid points smoothes out the resulting curves, but does not otherwise change the distributions.


As before, several results windows open. The overlayed kernel density plot for the within-array and across-array Loess-normalized data set are shown in Figure 7.45.


Figure 7.45: The overlayed kernel density plot for the within-array, across-array Loess-normalized

data set Compare the overlayed kernel density plot for the within-array and across-array Loess-normalized data set, shown in Figure 7.45, with the previous overlayed kernel density plots, shown in Figures 7.26 and 7.41. Note that the curves show an even greater consistency than seen previously. The drosophilaaging_loess1_loess2.sas7bdat data set is used in subsequent analyses.

Evaluation of Normalized Data Quality

Correlation and Principal Components

Select Genomics > Quality Control > Correlation and Principal Components, as shown in Figure 7.46.

Figure 7.46: Selecting the Correlation and Principal Components process


The Correlation and Principal Components dialog appears, as shown in Figure 7.47.

Figure 7.47: The Correlation and Principal Components dialog



Click Choose.


Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. To specify the Output Folder, complete the following steps.

Click Choose.




The General tab of the dialog should appear like the one shown in Figure 7.48

Figure 7.48: The completed Correlation and Principal Components dialog

Make no changes to the Analysis, Variance Components or Options tabs.

Click Run.

Running this process produces several windows. These windows are linked together, so selecting a point on one graph or table, highlights the corresponding point on all of the graphs or tables. Refer to the JMP User’s Guide for more details on this feature. The 3-D principal components scatterplot matrix is shown in Figure 7.49.


Figure 7.49: The 3-D principal components scatterplot matrix

Examine the scatterplots shown in Figure 7.49.The points aggregate into two groups, the identity of which is not of yet known. To investigate which experimental factor is driving the segregation of the data into the two groups, change the colors of the points using the Rows > Color or Mark by Column, under the main JMP menu, by completing the following steps.

Make sure the principal components scatterplot matrix window is active.

Select Rows > Color or Mark by Column, as illustrated in Figure 7.50.

Figure 7.50: Selecting the Rows > Color or Mark by Column process

The JMP: Color by Mark or Column dialog opens, as shown in Figure 7.51.


Figure 7.51: The JMP: Color by Mark or Column dialog

Select one of the columns by which to set the color.

Because we might expect the points to cluster by arrays or one of the treatments, select one of the columns describing the experimental factors (Line, Sex, Age, and Channel). Coloring the points by sex yields the principal components scatterplot matrix shown in Figure 7.52, indicting a near perfect correlation between sex and the clustered points.

Figure 7.52: The principal components scatterplot matrix, colored by sex

A second plot produced by this procedure is the Correlation Heat Map, shown in Figure 7.50.


Figure 7.53: The correlation heat map

There are two large blocks apparent in the correlation heat map, corresponding to the same two groups in the principal components display. By studying the labels on the left hand side, it is apparent that the females are all clustered at the top, and the males are all clustered at the bottom. This clustering phenomenon has been made even more obvious because the variables have been differentially colored by sex. Even though the primary focus of this experiment was aging, the initial results show that sex-to-sex differences are much larger overall.

Correlation and Grouped Scatterplots The Correlation and Grouped ScatterPlots process computes correlations and scatterplot matrices for expression measurements across groups of arrays. This process also merges annotations for each gene with the measurements to quickly provide information on genes of interest.

Select Genomics > Quality Control > Correlation and Grouped ScatterPlots, as shown in Figure 7.54.


Figure 7.54: Selecting the Correlation and Grouped ScatterPlots process

The Correlation and Group Scatterplots dialog opens, as shown in Figure 7.55.

Figure 7.55: The Correlation and Group Scatterplots dialog




Click Choose.


Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. This list is used to choose the variable(s) by which to merge the annotation data. Since we are interested in coordinating expression data with information about the gene represented in each spot on the arrays,

Select Spot from the list of available variables.

Click to add Spot to the Variable By Which to Merge Annotation Data box, as shown in Figure 7.56.

Figure 7.56: Selecting the Variable By Which to Merge Annotation Data from the Input Data Set

To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. This list is used to choose the variable(s) to plot against each other. Specifying Line, Sex, and Age lets you check for repeatability within each of the treatment groups. Since we are interested in investigating the effects of these experimental conditions on the expression data,

Select Line, Sex, and Age from the list of available variables.

Click to add Line, Sex, and Age to the Variables Defining Groups box, as shown in Figure 7.57.

Figure 7.57: Selecting the Variables Defining Groups from the EDDS

To specify the Output Folder, complete the following steps.

Click Choose.





Figure 7.58: The completed General tab of the Correlation and Group Scatterplots dialog

Click on the Annotation 1 tab to open the tab shown in Figure 7.59.

Figure 7.59: The Annotation tab of the Correlation and Group Scatterplots dialog

To choose the annotation data set for this experiment complete the following steps.

Click Choose.


Select the drosophilaaging_annotation.sas7bdat file and click Open.


Click Open to examine the annotation data set.

The data set appears as shown in Figure 7.60.

Figure 7.60: The Drosophila Aging experiment annotation data set

This data set lists the gene identity and GenBank accession number for each spot on the arrays and provides a short description of the function (where known) of the gene and its product. A list of the ColumnLabels from the annotation data set appears in the Available Variables field.


Click to add Spot to the Annotation Merge Variables box, as shown in Figure 7.62.

Select Accession from the list of available variables.

Click to add Accession to the GenBank Accession Variable box, as shown in Figure 7.61.

Select ShortDescription from the list of available variables.

Click to add ShortDescription to the Annotation Label Variable box, as shown in Figure 7.61.

Figure 7.61: The selected annotation variables

Click Annotation 2.




Figure 7.62: The selected annotation variables

Click Run to produce a Scatterplot Matrix for each of these groups, as partially shown in

Figure 7.63.

Figure 7.63: Scatterplot matrix

Most of the arrays fall into an elliptical space along the diagonal axis. In general, a tighter ellipse means a higher correlation, and more circular ellipses indicate increased noise. The spots outside the ellipses may be outlier genes of interest in terms of quality control or an inconsistent signal across replicates. Clicking on one of the genes highlights it across all arrays. In one example, CG10992 falls outside the ellipse across many of the arrays. To obtain more detailed annotation about this gene, select the spot and click on GenBank-Nucleotide, on the top of the Correlation ScatterPlots window, to bring up the GenBank web page for CG10992, as shown in Figure 7.64.


Figure 7.64: A portion of the GenBank web page for spot CG10992

You can mouse over other spots too in the scatterplot matrix to see their label. Also, you can drag a rectangle around the spots to select them in the associated JMP table. With a considerable number of outliers, users often check raw image files for abnormalities at those spots. You may also apply the Pseudo Image and Surface Summary processes under the Quality Control submenu to check on the raw image. This can help to decide whether to keep the outliers or filter them out of the analysis. To filter a set of spots, select them in the corresponding JMP table in any of the following ways:

• Click and drag a rectangle in a scatterplot matrix window. • Use the lasso tool. • Hold the Shift key and click spots one by one. • Click Rows > Row Selection > Select Where to define a filtering rule.

To delete the selected rows, complete the following steps.

Select Rows > Row Selection > Invert Row Selection. This command inverts the selection to the desired rows.

Select Tables > Subset to create a subset table with the desired rows.

Select File > Save As SAS Data Set to save the subset table as a .sas7bdat file. The new data set can now be used as input for further processes. Refer to the JMP Users’ Guide for additional information and directions on row selection and creating subset data sets.


Primary Data Analysis for Determining Significant Differences in Gene Expression

ANOVA After performing quality control and normalization on your microarray data set, Analysis of Variance (ANOVA) is a popular method for inferring differentially expressed genes. The ANOVA process in JMP Genomics is quite flexible and enables you to specify multi-factor models that might also include random effects. Input data should be normalized, either by the Data Standardize and Loess Normalization processes, as described previously, or by other methods available in the Normalization menu, prior to carrying out the ANOVA process. This example uses the drosophilaaging_loess1_loess2.sas7bdat input data set, described previously.

Select Genomics > Row-by-Row Modeling > ANOVA, as shown in Figure 7.65.

Figure 7.65: Selecting the ANOVA process

The ANOVA dialog opens, as shown in Figure 7.66.


Figure 7.66: The ANOVA dialog

The ANOVA process contains eight tabs: General, Annotation1, Annotation 2, Model, LSMeans, Multiple Testing, Residuals, and Options.

Make sure that the General tab is selected. To choose the Loess-normalized data set as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the variables from the input data set appears in the Available Variables field. In this example Spot served as the gene identifier, and is selected as the By Variable to generate a separate model fit for each gene. Typically, the By Variable should be a specific identifier. If the annotation file is included on the annotation tab, a gene identifier must be listed in the Variable by which to Merge Annotation Data field to link the input file and annotation file. In this example, choose Spot as this Link Key. The data set contains no chromosome or position information. To specify the output folder, complete the following steps.

Click Choose.



Open the ProcessResults folder and click Select to select this folder. The General tab of the dialog should appear like the one shown in Figure 7.67

Figure 7.67: The completed General tab of the ANOVA dialog

Click Annotation 1.

To choose the annotation data set for this experiment, complete the following steps.

Click Choose.



A list of the ColumnLabels from the annotation data set appears in the Available Variables field.





Click Annotation 2.



Select Drosophila melanogaster from the Organism drop-down menu.

The completed Annotation tabs of the dialog should appear like those shown in Figure 7.68.


Figure 7.68: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the ANOVA

dialog

Click the Model tab. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. This list is used to choose the Class Variables. Any variable that is selected in Class Variables is used as a non-numerical value or nominal (group) type of numerical value in the model. The order of the variables in the Class Variables determines the way LSMeans and/or interaction effects are sorted.

Select Array, Channel, Line, Sex, and Age from the list of available variables

Click to add Array, Channel, Line, Sex, and Age to the Class Variables box. Fixed Effects allow users to set the one-, two-, or multiple-way ANOVA to model the mean of the response variable. Variables entered into this field are delimited by a space. Variables can be composed of either main effects, such as Line or Sex as in this example, or as interactions between effects, such as Line*Sex.

Type Line Sex Age Channel Line*Sex Line*Age Sex*Age Line*Sex*Age in the Fixed Effects field.


LSMeans Effects are used to construct differences and least-squares means profiles. Any effects listed here must also be listed in the Fixed Effects field and all the variables comprising them must be listed as Class Variables.

Type Line*Sex*Age in the LSMeans Effects field.

Estimate statements are arbitrarily complex hypothesis tests of the relative importance of different combinations of different fixed effects on gene expression. They can be constructed using the Estimate Builder AP. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

Leave the File Containing Estimate Statements field blank. Random Effects are used to construct the covariance structure of the response variable. In this example and in most of the two-color microarray data, Array should be considered as a random effect since the arrays applied in the experiment are randomly selected and are not re-usable. They are typically comprised of class variables and their interactions, but cannot include those effects already specified in the Fixed Effects field.

Type Array in the Random Effects field. The completed Model tab of the dialog should appear like the one shown in Figure 7.69.

Figure 7.69: The completed Model tab of the ANOVA dialog

Click the LSMeans tab.

On the LSMeans tab, you can choose different preferences for the LSMeans Difference Set for Volcano Plots.

All Pairwise Differences lists all the possible combinations for all the experimental conditions in the volcano plots and significant gene list.

Differences with a Control compares the conditions to the control only by defining the control in the LSMeans Control Values.

None results in having no LSMean differences listed.

Make sure that the All Pairwise Differences option is selected. The LSMeans Standardization Method may be selected from among the 17 choices in the drop-down menu.

Select STD as the LSMeans Standardization Method.


The completed LSMeans tab of the dialog should appear like the one shown in Figure 7.70.

Figure 70: The completed LSMeans tab of the ANOVA dialog

Click the Multiple Testing tab.

On the Multiple Testing tab, you can define the –log10(p-value) cutoff value and choose one of nine multiple-testing correction methods.

Make sure that Bonferroni is selected from the drop-down menu as the Multiple Testing Method.

Make sure that the Alpha value is set to the default value of 0.05.

The completed Multiple Testing tab of the dialog should appear like the one shown in Figure 7.71.

Figure 7.71: The completed Multiple Testing tab of the ANOVA dialog

Click on the Residuals tab.

On the Residuals tab, you can define several parameters describing how to handle the residuals from the ANOVA model fits. Residuals are statistics useful for quality control and assessment of goodness-of-fit. Selecting a Filtration Method for Data with Large Residuals allows you to set up rules to filter outliers which are statistically far from fitting the model (Chu, Weir et al. 2002).

Make sure the Plot Standardized Residuals checkbox is checked.

The completed Residuals tab of the dialog should appear like the one shown in Figure 7.72.


Figure 7. 72: The completed Residuals tab of the ANOVA dialog


On the Options tab, you can specify additional output options including Uniformly Scale Y-Axis in Volcano Plots or even Activate Spoken Description.

Make sure the options are all deselected.

Output model file names may be specified. Otherwise, the program assigns output file names for you based on the name of the Input Data Set.

Do not specify names for the output files.

The completed Options tab of the dialog should appear like the one shown in Figure 7.73.

Figure 7.73: The completed Options tab of the ANOVA dialog

Click Run to run the ANOVA.

Running the ANOVA process produces several windows including Volcano Plots, Parallel and PCA Plots, Clustering, Variability Estimates, Action Buttons, and Significant Differences table. The Variability Estimates window, shown in Figure 7.74, displays estimates of sources of variability for each gene and provides a final QC check to ascertain how well the models fit the data.


Figure 7.74: The estimates of variability

Here, the PropVar_Array distribution quantifies the proportion of array-to-array (equivalent to spot-to-spot) variability, and the PropVar_Residual distribution quantifies the proportion of within-spot and unexplained variability. The RSquared distribution displays the proportion of variability explained by the model for each gene. The Volcano Plots window, shown in Figure 7.75, indicates the genes that show significant differential expression.


Figure 7.75: The Volcano plots

The genes above the red dotted line exceed a multiple testing cutoff for significant differential expression. The red dotted line is computed in this case according to the Bonferroni criterion. The Hierarchical Clustering of LSmeans display, shown in Figure 7.76, clusters those genes that are significantly differentially expressed. Zooming in on clusters and comparing them with known biological groups in the annotation table can help interpret the results.


Figure 7.76: Hierarchical clustering of LSmeans

The Action Buttons window, shown in Figure 7.77, offers easy access to the multiple biological interpretation websites to search for the selected genes. It also provides an opportunity to change the cutoff value without resetting the model.

Figure 7.77: The Action Buttons window

Mixed Model Analysis While the aforementioned ANOVA process is fairly flexible, even more complex mixed models are available by using the Mixed Model Analysis process. For this process you must be familiar with SAS Proc Mixed syntax. Refer to the SAS 9.1.3 User’s Guide for additional information.


Select Genomics > Row-by-Row Modeling > Mixed Model Analysis, as shown in Figure 7.78.

Figure 7.78: Selecting the Mixed Model Analysis process

The Mixed Model Analysis dialog opens, as shown in Figure 7.79.


Figure 7.79: The Mixed Model Analysis dialog



Click Choose.


Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the variables from the input data set appears in the Available Variables field. In this example Spot served as the gene identifier, and is selected as the By Variable to generate a separate model fit for each gene. Typically, the By Variable should be a specific identifier. If the annotation file is included on the annotation tab, a gene identifier must be listed in the to Keep in Output or By which to Merge Annotation Data field to link the input file and annotation file. In this example, choose Spot as this Link Key. The data set contains no chromosome or position information.


Click to add Spot to the By Variables box.


Click to add Spot to the Variable to Keep in Output or By which to Merge Annotation Data box.



Click Choose.



The completed General tab of the dialog should appear like the one shown in Figure 7.80.

Figure 7.80: The completed General tab of the Mixed Model Analysis dialog

Click Annotation 1.

To choose the annotation data set for this experiment, complete the following steps.

Click Choose.



A list of the ColumnLabels from the annotation data set appears in the Available Variables field.





Click Annotation 2.




Select Drosophila melanogaster from the Organism drop-down menu.

The completed Annotation tabs of the dialog should appear like those shown in Figure 7.81.

Figure 7.81: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the Mixed Model

dialog

Click the Model tab. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open. A list of the ColumnLabels from the EDDS appears in the Available Variables field. To ensure that all of the data from any one row is used in the model,

Leave the Design-Level by Variables field blank.


The Proc Mixed Statements box contains the primary SAS code. The SAS syntax needed to run the model is illustrated in Figure 7.82.

Figure 7.82: The Proc Mixed Statements box and associated SAS syntax

The SAS code can be divided into a number of distinct statements.

o The CLASS statement specifies all variables whose levels form distinct categories in the model.

o The MODEL statement specifies the dependent variable (always set this to RESPONSE) and the fixed effects.

o The RANDOM statement specifies Array as a random effect, which models spot-to-spot

(whole plot) variability for this example.

o The LSMEANS statement requests means for the full three-way interaction. Although not shown here, you can specify the DIFF option in the LSMEANS statement to automatically obtain a set of pairwise differences.

o The ESTIMATE statements specify custom hypothesis tests. Each ESTIMATE statement

generates one volcano plot.

Refer to the SAS/Stat Proc Mixed documentation in the SAS 9.1.3 User’s Guide for further information on these and other statements you can use.

Specify the same parameters in the Multiple Testing, Residuals, and Options tabs as previously done for the ANOVA process, and illustrated in Figures 7.70, 7.71, and 7.72, respectively.

Click Run to run the Mixed Model process.

The output from the Mixed Model Analysis process (not shown) contains the same displays as the ANOVA process.

Further Analysis JMP Genomics provides additional procedures for analyzing microarray data. However, while the preceding analyses have all utilized a tall form of the Drosophila data along with an accompanying Experimental Design Data Set, these additional processes require that data set be in the wide form. We begin by showing you how to


combine these two data sets into one wide data set that can be used for other processes like those found in the Pattern Discovery and Predictive Modeling folders. Refer to Chapter 4 for additional information Note: SAS data sets are tuned to work best with a large number of rows rather than a large number of columns. For wide data sets with tens of thousands of columns, execution times may be long. One way to address this issue is to work with the tall data set whenever possible. When use of tall data sets is not possible, reducing the number of genes under consideration can help reduce the execution times. For example, use Data Set Utilities > Statistics for Rows to filter genes that have low overall variance; that is, genes that have a flat profile across the whole experiment. For a more rigorous statistical criterion, you can use either the ANOVA or the Mixed Model Analysis process to select only those genes that have a significant difference somewhere in the experiment. However, keep in mind that such a pre-filtering will bias cross-validation rates computed from any of the Predictive Modeling processes. Alternatively, you can use K-Means Clustering on the tall data set to select a representative set of genes that does not depend directly on experimental design variables.

Transpose Tall and Wide Recall from Chapter 4 that a tall data set and its accompanying EDDS can be transformed into a wide data set using the Transpose Tall and Wide command.

Select Genomics > Data Set Utilities > Transpose Tall and Wide, as shown in Figure 7.83.

Figure 7.83: Selecting the Transpose Tall and Wide process

The Transpose Tall and Wide dialog opens, as shown in Figure 7.84.


Figure 7.84: The Transpose Tall and Wide dialog

Make sure the Tall -> Wide tab is selected.

To choose the Loess normalized data set as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1_loess2.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. In this example, the values contained in the Spot column serve as the column names in the wide data set.


Click to add Spot to the Variables Defining Wide Column Names box.

To enter a prefix for the wide column names,

Type Spot_ into the Prefix for Wide Column Names box. To choose the EDDS, complete the following steps.

Click Choose.


Select the drosophilaaging_exp.sas7bdat file and click Open.



Click Choose.


Open the ProcessResults folder and click on Select to select this folder.

The completed Transpose Tall and Wide dialog should appear like the one shown in Figure 7.85.

Figure 7.85: The completed Transpose Tall and Wide dialog


The transposed SAS data set, drosophilaaging_loess1_loess2_wide.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 7.86).


Open the data set and note how the experimental design data has been combined with the expression data, with individual genes now forming columns in the wide data set. This data set is now ready to serve as input for further analyses.


K-Means Clustering K-Means clustering is a standard technique for partitioning data into a set number of similar groups. The K-Means Clustering process clusters the rows of the input data set, so depending on whether you want to cluster samples or genes, you might need to transpose your data as shown previously. This example clusters the genes, in the normalized, wide data set just transposed, that have similar expression profiles.

Select Genomics >Pattern Discovery > K-Means Clustering, as shown in Figure 7.87.

Figure 7.87: Selecting the K-Means Clustering process

The K-Means Clustering dialog opens, as shown in Figure 7.88.


Figure 7.88: The K-Means Clustering dialog


To choose the wide Loess normalized data set, generated previously, as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1_2_wide.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. To choose the variable by which to label points in the plots, complete the following steps.

Select Array from the list of available variables.

Click to add Array to the Label Variable box.

To choose the variables whose observations are to be clustered (in this example, expression data), complete the following steps.

Select all of the numeric variables (Spot_4 through Spot_297) from the list of available variables.

Click to add these variables to the Variables Whose Rows are to be Clustered box.

Alternatively, you could specify variables Spot_4 through Spot_297 by typing Spot_ in the List-Style Specification of Variables Whose Rows are to be Clustered field.


Type 5 in the Number of Clusters box to cluster the genes into 5 groups.


Click Choose.



The completed dialog should appear like the one shown in Figure 7.89.

Figure 7.89: The completed K-Means Clustering dialog

Click Run to generate two JMP tables and two graphics.

The drosophilaaging_loess1_2_wide_kmc table (partially shown in Figure 7.90) lists various statistics about the clusters that were generated.


Figure 7.90: The drosophilaaging_loess1_2_wide_kmc table

The drosophilaaging_loess_1_2_wide_kmd table (shown in Figure 7.91) shows various statistics about the arrays. Note the Cluster and Distance to Cluster Seed columns on the far right hand side of the table denoting which cluster each array fit in, and what the distance was to the cluster seed. The cluster seed is the mean of the cluster.

Figure 7.91: The drosophilaaging_loess_1_2_wide_kmd table

The parallel plots show the cluster profiles across all 100 genes.

Right-click on a parallel plot to change the color scheme or make a legend for a particular variable.

Right-click the first parallel plot and select Row Legend to bring up the list of variables.

Choose the Array variable to produce the plots and legends shown in Figure 7.92.


Figure 7.92: Parallel plots for the clusters

Hovering the cursor over a peak gives the array that generated it. (Additional label variables can be specified in the JMP table to add more data to the mouse-over pop-up boxes.) Note that the sharp dip in cluster 4 belongs to array 6, which showed up as an outlier in the principal components analysis described previously (see Figure 7.49). Use Rows > Color or Mark by Column to color the profiles according to known variables. Try coloring the profiles by Sex to see if the five clusters segregate according to this variable. In another window, the graphic (shown in Figure 7.93) that displays the frequency cluster shows four histograms instead of five. That is because clusters 2 and 4 have similar frequencies.


Figure 7.93: Cluster frequencies

Because the graphic is linked to the underlying cluster table, highlighting the tallest bar also highlights the two clusters it represents, making this relationship relatively clear. Distance Matrix The Distance Matrix process computes various measures of distance or dissimilarity between the observations/rows of a data set.

Select Genomics > Pattern Discovery > Distance Matrix. The Distance Matrix dialog opens, as shown in Figure 7.94.


Figure 7.94: The Distance Matrix dialog


To choose the wide Loess normalized data set, generated previously, as the input data set, complete the following steps.

Click Choose.


Select the drosophilaaging_loess1_2_wide.sas7bdat file and click Open. A list of the ColumnLabels from the input data set appears in the Available Variables field. There are different categories of variables.

o Variables within which to compute differences o ID variables serve to identify rows in the data set and are not a formal part of the clustering

process.


o Copy variables are simply copied to the output data set.

o By variables instruct the process to perform clustering separately for each distinct

combination of the By variable levels.

Refer to the DISTANCE Procedure documentation in the SAS 9.1.3 User’s Guide for additional information. To choose the Variables within which to compute differences (in this example, expression data), complete the following steps.

Select all of the numeric variables (Spot_4 through Spot_297) from the list of available variables.

Click to add these variables to the Variables Within Which to Compute Differences field.

Alternatively, you could specify variables Spot_4 through Spot_297 by typing Spot_ in the List-Style Specification of Variables Within Which to Compute Differences field. To choose names for the distance variables, complete the following steps.

Select array, sex, age, and line from the list of available variables.

Click to add these variables to the ID Variable box.

To specify the level of measurement used to compute the distance,

Select Interval from the drop-down menu. To specify the method to be used to compute the distance,

Select DSQCORR from the drop-down menu. To specify the Output Folder, complete the following steps.

Click Choose.



The completed General tab of the Distance Matrix dialog should appear like the one shown in Figure 7.95.


Figure 7.95: The completed General tab of the Distance Matrix dialog.


To specify the method for standardization,

Select STD from the drop-down menu.

The completed Options tab of the Distance Matrix dialog should appear like the one shown in Figure 7.96.

Figure 7.96: The completed Options tab

Click Run to generate the distance matrix.

A new data set (drosophilaaging_loess1_2_dm.sas7bdat) and a heat map are generated. The heat map is shown in Figure 7.97.


Figure 7.97: Heat map

This heat map shows that channels from the same array are closest in terms of the DSQCORR metric, and form the 2 × 2 blocks in the heat map.

Predictive Modeling In conjunction with, or as an alternative to, the Row-by-Row Modeling and Pattern Discovery processes described previously, you might want to perform exploratory predictive modeling and data mining. See Chapter 10 for a description of relevant processes. Note that Predictive Modeling processes require the data to be in wide format, as created previously with the Transpose Tall and Wide process.

Microarray Case Study II: Affymetrix Latin Square Data

8 C H A P T E R

In Chapter 7, we considered an experiment conducted with cDNA microarrays. Here we consider oligonucleotide arrays. Oligonucleotide arrays differ from cDNA arrays since the sequence for each oligonucleotide is shorter and is usually determined a priori using bioinformatics techniques. Oligonucleotides are typically mass-produced and are often used to study model systems. This chapter uses the Affymetrix Latin Square data set described in Chapter 1. Recall that this data set was originally generated by Affymetrix Inc. to develop and validate their U95A GeneChip and Microarray Suite (MAS) 5.0 algorithm over a range of known concentrations. The experiment consists of 14 experimental groups. Each group contains a pool of non-specific RNA as well as a set of 14 distinct human transcripts spiked in at known concentrations. The concentrations are staggered in a Latin Square arrangement. The data have been trimmed to only 100 genes and trimmed versions of .CEL files containing just these 100 genes are available in the JMP Genomics Sample Data folder.

Generation of the Required SAS Data Set and EDDS The SAS data set and EDDS required for the analyses presented here were generated from the raw .CEL files and an Experimental Design File, as discussed in Chapter 3. If you have not already generated these files, review the instructions for this example and generate the SAS data set and EDDS now. Make sure the output files are saved in the ProcessResults folder. The output consists of three SAS data sets:

o the affyinputengine.sas7bdat input data set, o the affyinputengine_exp.sas7bdat experimental design data set (EDDS), and o the probemap_hg_u95a_trim.sas7bdat data set listing the physical x and y array coordinates of

each spot. Note: Importing standard .CEL files generates a fourth output data set, containing the quality control (QC) probe sets. These QC probe sets, normally contained in Affymetrix data sets, are not included in the custom trimmed .CEL files in this example.

Assessing the Quality of the Data With a new data set, it is advisable to perform quality control analyses before proceeding to other analyses.

Data Standardization and Distribution Analysis In this example, as in the Drosophila example, we use a univariate distribution analysis to initially assess the quality of the data.

Click Genomics> Quality Control > Distribution Analysis, as shown in Figure 8.1.

8 Microarray Case Study II: Affymetrix Latin Square Data 196

Figure 8.1: Selecting the Distribution Analysis process

The Data Distribution dialog opens, as shown in Figure 8.2.

Figure 8.2: The Data Distribution dialog

Click Choose to select the input data set.



Select the affyinputengine.sas7bdat file.


Select all of the available variables from a_01 to q_59.

Click to add the selected variables to the Variables for which to Display Distributions box, as shown in Figure 8.3.

Figure 8.3: Selecting the variables for distribution view


Click Choose.



The General tab of the Data Distribution dialog appears like the one shown in Figure 8.4.

Figure 8.4: The completed General tab of the Data Distribution dialog

Click Experimental Design.



Click the affyinputengine_exp.sas7bdat file.


Select Experiment and ColumnName as the color variable and label variable. respectively, as

shown in Figure 8.5.


Figure 8.5: Selecting the Color and Label variables

The Experimental Design tab of the Data Distribution dialog appears like the one shown in Figure 8.5.

Figure 8.6: Selecting the Label Variable

The Option tab shows display options for the results.

Click the Options tab to view the default settings.



Running this process produces the overlay plot of kernel density estimates shown in Figure 8.7.

Figure 8.7: The overlayed kernel density plot for the affyinputengine.sas7bdat data set

The distributions for these arrays are very similar, indicating that this is a high quality data set. Correlation and Principal Components


Now, examine the quality of the data using several variables. Run the Correlation and Principal Components process.

Select Genomics > Quality Control > Correlation and Principal Components, as shown in Figure 8.8.

Figure 8.8: Selecting the Correlation and Principal Components process

The Correlation and Principal Components dialog opens, as shown in Figure 8.9.


Figure 8.9: The Correlation and Principal Components dialog



Click the affyinputengine.sas7bdat file.


Select all of the available variables from a_01 to q_59 to compute their correlations.

Click to add the selected variables to the Variables box.





Select Experiment as the color variable.


Click Choose.




The completed General tab of the Data Distribution dialog appears like the one shown in Figure 8.10.

Figure 8.10: The completed General tab of the Data Distribution dialog

The Analysis tab allows you to transform the data prior to analysis and to specify the type of correlation and number of principal components.

Click Analysis to view the default settings.

Do not make any changes to the Variance Components tab.

The Variance Components tab allows you to compute a variance components decomposition of the principal components, partitioning variability in terms of known effects.

Click Variance Components to view the default settings.

Do not make any changes to the Variance Components tab.

The Option tab contains display preferences for the results.



Click Run.

Running this process produces the correlation heat map shown in Figure 8.11.


Figure 8.11: The correlation heat map

The clustered heat map shown in Figure 8.11 displays the correlation matrix of the 59 samples. The samples cluster tightly according to their spike in profiles and generate a very distinct pattern of correlation. This plot and its dendrogram are linked to a principal components plot (not shown).


Correlation and Grouped Scatterplots The Correlation and Grouped Scatterplots process is a related multivariate quality control that annotates and computes correlations and scatterplot matrices for expression measurements across groups of arrays.

Select Genomics >Quality Control > Correlation and Grouped Scatterplots. The Correlation and Grouped Scatterplots dialog opens, as shown in Figure 8.12.

Figure 8.12: The Correlation and Grouped Scatterplots dialog



Click the affyinputengine.sas7bdat file.


Select Unit from the list of available variables.

Click to add Unit to the Variables By Which to Merge Annotation Data box.



Select the affyinputengine_exp.sas7bdat file.



Select Experiment from the list of available variables.

Click to add Experiment to the Variables Defining Groups box.


Click Choose.



The completed General tab of the Correlation and Grouped Scatterplots dialog appears like the one shown in Figure 8.13.

Figure 8.13: The completed General tab of the Correlation and Grouped Scatterplots dialog

The Annotation tabs allow you to merge information regarding individual genes and experimental groups into your output.

Click Annotation 1.

Click Choose to select the annotation data set.

Navigate to the Sample Data\Microarray\Affymetrix Latin Square folder.

Select the u95a.sas7bdat file.




Click to add Unit to the Annotation Merge Variables box.

Select Description from the list of available variables.

Click to add Description to the Annotation Label Variable box.

Click Annotation 2.


Click to add Accession to the GenBank Accession Variable box.

Select Gene Symbol from the list of available variables.

Click to add Gene Symbol to the Gene Symbol Variable box.

Select Description from the list of available variables.

Click to add Description to the Gene Description Variable box.

Select LocusLink from the list of available variables.

Click to add LocusLink to the Gene or LocusLink ID Variable box.

Select Homo sapiens from the Organism drop-down menu.

The completed Annotation tabs of the Correlation and Grouped Scatterplots dialog appear as shown in Figure 8.14.


Figure 8.14: The completed Annotation 1 (top) and Annotation 2 (bottom) tabs of the Correlation

and Grouped Scatterplots dialog

The Option tab contains display options for the results.



Click Run.

Running this process produces the correlation scatterplots shown in Figure 8.15.


Figure 8.15: The correlations and scatterplot matrices for the AffymetrixLatinSquare input data

example

There is a separate scatterplot matrix for each of the 14 experimental groups. Note the cigar-shaped distribution along the 45-degree diagonal and the very high correlations (shown above the scatterplot matrices). These results indicate very high repeatability within sample groups. There are a few outlying probes that appear far from the main diagonal. These represent probes whose measurements were inconsistent across the arrays, and should be handled carefully. Mouse over them to see their label, and then drag a rectangle around them to select them in the associated JMP table. With a considerable number of outliers, go back and check the raw image files for abnormalities at those spots. This can help decide whether to keep them or delete them from the analysis. To filter a set of spots, select them in the corresponding JMP table using one of the tools available in JMP and delete them. To select rows, choose from one of these options:

• click and drag a rectangle around the spots in one of the scatterplot matrix windows • use the lasso tool • hold the Shift key and click spots one by one • click Rows > Row Selection > Select Where to define a filtering rule

Refer to the JMP User Guide for more details on selecting rows. With the rows selected, complete the following steps.

Click Rows > Row Selection > Invert Row Selection to invert the selection to include only the rows that should be kept.

Click Tables > Subset to create a subset table with the desired rows.


Click File > Save As SAS Data Set to save the subset table as a .sas7bdat file.

This new subset data set can now be used as input for further analyses. Note: The ANOVA and Mixed Model Analysis processes described later also provide a means to automatically filter outliers based on the magnitude of discrepancy from a fitted statistical model. Feature Flagger This quality control process flags specific probe-level observations that have unusually low signals, as compared to a specified group median.

Select Genomics > Quality Control > Feature Flagger. The Feature Flagger dialog opens, as shown in Figure 8.16.

Figure 8.16: The Feature Flagger dialog





Select Probe_Set_ID from the list of available variables.

Click to add Probe_Set_ID to the Feature Variable box.

Select Probe from the list of available variables.


Click to add Probe to the Sub-Feature Variable box.





Select Array from the list of available variables.

Click to add Experiment to the Design-Level Grouping Variables box.

The Threshold is specified as 5 by default. Observations, whose intensities differ from the median intensity by more than this value, are flagged in the output.

Do not change the threshold value. To specify the output folder, complete the following steps.

Click Choose.



The completed Input tab of the Feature Flagger dialog appears like the one shown in Figure 8.17.


Figure 8.17: The completed Input tab of the Feature Flagger dialog

The Options tab allows you to specify the various types of output from this process.

Make no changes to the Options tab. Click Run to generate the table shown in Figure 8.18.

Figure 8.18: The Flagged Features table

The probes highlighted in red have unusually low signals.


Array PseudoImage In the event that original images of the arrays are not available, JMP Genomics can generate a pseudo-color representation of the data on a given array. Note that because the trimmed data set used previously to illustrate the processes discussed in this chapter does not have complete information for all of the probes, the image generated with this data set may not accurately reflect the real image of the array. In this example, therefore, we use the default example to generate a pseudo-image of array f_45 from the Affymetrix Latin Square Example data set.

Select Genomics > Quality Control > Pseudo Image.

The Array Pseudo Image dialog opens, as shown in Figure 8.19.

Figure 8.19: The Array PseudoImage dialog

To load the default AffymetrixLatinSquareExample, complete the following steps.

Click Load.

Select the settings for the AffymetrixLatinSquareExample.

Click OK to complete the Array Pseudo Image dialog, as shown in Figure 8.20.


Figure 8.20: The completed Input tab of the Array PseudoImage dialog

Click Run.

Select 45 in the Array Data Library dialog to generate the pseudoimage shown in Figure 8.21.


Figure 8.21: The pseudoimage of array f_45

A data set listing probes, x- and y-coordinates and response for each spot is generated in addition to the pseudo image. Highlighting the appropriate gene in the JMP table also highlights it in the pseudo image, and vice-versa, thus providing another potential way to filter data. Surface Summary Another technique that can be helpful for quality control and normalization constructs a spatially smoothed surface plot of the background intensity of a chip. This process plots the surface data in three dimensions. Anomalies in the area surface might indicate areas of poor quality due to technical issues. Note that because the trimmed data set used previously to illustrate the processes discussed in this chapter does not have complete information for all the probes, the image generated with this data set might not accurately reflect the real surface of the array. In this example, therefore, we use the default example to generate surface summaries of array f_45 and m_55 from the Affymetrix Latin Square Example data set used previously. To generate surface plots of the Affymetrix sample data, complete the following steps.

Select Genomics > Quality Control > Surface Summary. The Surface Summary dialog opens, as shown in Figure 8.22.


Figure 8.22: The Surface Summary dialog

To load the default Affymetrix Latin Square Example, complete the following steps.

Click Load.

Select the settings for the AffymetrixLatinSquareExample.

Click OK to complete the Surface Summary dialog, as shown in Figure 8.23.


Figure 8.23: The completed General tab of the Surface Summary dialog

The Analysis tab allows specification of the following parameters:

o the number of blocks in the surface plot, o the range of acceptable z-values, o the summary statistic calculated for z in each x-y block, o the origin, and o any subsetting of the data.

Click the Analysis tab.

Examine the default settings. Default specifications include a 32 by 32 grid, no minimal/maximal z-values, the Min summary statistic, a bandwidth multiplier of 1 for moderate smoothing, and the top left corner designated as the origin.

Make no changes to the Analysis tab. The Options tab allows you to specify the various types of output from this process.

Make no changes to the Options tab.

Click Run to generate the surface plots for arrays f_45 and m_55 of the Affymetrix Latin Square Example, as shown in Figure 8.24.


Figure 8.24: The surface plot for array f_45 (left) and m_55 (right)

Note that the background surface of chip f_45 appears fairly smooth whereas that for chip m_55 has a region of unusually high background signal.

Data Normalization Since the data set contains high quality data, prepare it for analysis. To do this, use the Data Standardization process to normalize the affyinputengine.sas7bdat data set used previously.

Select Genomics > Normalization > Data Standardize.

The Data Standardize dialog opens, as shown in Figure 8.25.


Figure 8.25: The Data Standardize dialog







Select the affyinputengine_exp.sas7bdat file.



Click Choose.



The completed Data Standardize dialog appears as shown in Figure 8.26.


Figure 8.26: The completed Data Standardize dialog

Click Run to standardize the data.

The standardized SAS data set, affyinputengine_std.sas7bdat, generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 8.27).


The Data Distribution process was run for the normalized affyinputengine_std.sas7bdat data set. Examination of the resulting overlayed kernel distribution (not shown) indicates that the sample distributions of the normalized data set are even more consistent than those seen in Figure 8.7. The normalized data set is therefore used for subsequent analysis.

Pattern Discovery Once you have performed quality control and normalization on your data, you might want to run pattern discovery processes on the data to understand them better. Chapter 7 provides examples of these processes. For this case study, we move directly to statistical modeling of the probe-level data.

Analysis of Variance (ANOVA)


The ANOVA process fits a linear model to each probe set in a normalized data set.

Select Genomics > Row-by-Row Modeling > ANOVA. The ANOVA dialog opens, as shown in Figure 8.28.

Figure 8.28: The ANOVA dialog



Select the normalized affyinputengine_std.sas7bdat file.



Click to add Probe_Set_ID to the By Variables box.


Click to add Probe_Set_ID to the Variables to Keep in Output or By Which to Merge Annotation Data box.

There is no chromosome or position data in the data set.


Leave both the Chromosome Variable and Position Variables fields blank. Select Probe from the list of available variables.

Click to add Probe to the Class Variables box.


Click Choose.



The completed General tab of the ANOVA dialog appears like the one shown in Figure 8.29.

Figure 8.29: The completed General tab of the ANOVA dialog

Click Annotation 1.

Click Choose to select the annotation data set.


Select the u95a_trim.sas7bdat file.



Click to add Probe_Set_ID to the Annotation Merge Variables box.


Click to add Probe_Set_ID to the Annotation Label Variable box.

Click Annotation 2.


Select Sequence_Derived_From from the list of available variables.

Click to add Sequence_Derived_From to the GenBank Accession Variable box.

Select Gene_Symbol from the list of available variables.

Click to add Gene_Symbol to the Gene Symbol Variable box.

Select Title from the list of available variables.

Click to add Title to the Gene Description Variable box.


Click to add LocusLink to the Gene or LocusLink ID Variable box.

Select Homo sapiens from the Organism drop-down menu.

The completed Annotation tabs of the ANOVA dialog appears as shown in Figure 8.30.

Figure 8.30: The completed Annotation 1 tab (top) and Annotation 2 tab (bottom) of the ANOVA

dialog


The Model tab allows you to specify different variables and effects, taken from your experimental design that may affect your model. It is important to appropriately specify class variables, fixed effects and random effects.

Click the Model tab. To specify the EDDS, complete the following steps.

Click Choose.



Click Open to select the file. Class variables are those whose levels form distinct categories in the model. (They are distinguished from continuous variables whose numeric values are used directly in the model.) Here both Experiment and Array are class variables.

Select both Array and Experiment from the list of available variables.

Click to add Array and Experiment to the Class Variables box. Fixed effects contain a specific set of levels that are of sole interest for comparison. They are typically the primary variables of interest in the design.

Type Experiment and Probe in the Fixed Effects box. LSMeans effects are used to construct differences and least-squares means profiles.

Type Experiment in the LSMeans Effects box. Random effects model correlation patterns in the data and are assumed to arise randomly from a population of observable effects. Those observations in the data which share the same level of a random effect are assumed to be correlated. Here, Array is specified as a random effect to model the correlation between probe-level data from the same array (and also from the same probe set, since Probe_Set_ID is specified as the By Variable on the first tab).

Type Array in the Random Effects box.

The completed Model tab of the ANOVA dialog appears as shown in Figure 8.31.


Figure 8.31: The completed Model tab of the ANOVA dialog

The LS Means tab allows you to specify which LS Means difference set to use for volcano plots and how those means are to be standardized.

Click the LS Means tab. By default, all pair wise LSMeans differences are selected. In addition, STD is chosen as the default LSMeans test.

Do not make any changes to the LSMeans tab. The Multiple Testing tab allows you to run multiple hypothesis tests across all LSMeans differences to identify a cutoff for determining significant expression differences.

Click the Multiple Testing tab. The default test is the Bonferroni test. In this example, instead of running multiple hypothesis tests, we define this cutoff value directly using the –log10(p-value) cutoff parameter. To change the default setting, complete the steps.

Select the blank space in the middle of the Multiple Testing Method drop-down menu.

Type 15 in the –log10(p-value) Cutoff text box. The Residuals tab allows you to define several parameters describing how to handle the residuals from the ANOVA model fits. Residuals are statistics useful for quality control and assessment of goodness-of-fit. Selecting a Filtration Method for Data with Large Residuals allows you to set up rules to filter outliers which are statistically far from fitting the model (Chu, Weir et al. 2002).

Do not make any changes to the Residuals tab. On the Options tab, you can choose different preferences for the output of this procedure. The only change you should make to the Output tab is to select a name for the Mixed Model Expression Index Output Data Set.


Type affylatin_mmei in the Mixed Model Expression Index Output Data Set Name field. The completed Options tab of the ANOVA dialog appears like the one shown in Figure 8.32.

Figure 8.32: The completed Options tab of the ANOVA dialog

Running the ANOVA process produces various graphical displays of statistical results. These graphics are all driven by JMP tables, one of which lists the significant genes and is illustrated in Figure 8.33.

Figure 33: A portion of the table listing differentially-expressed genes

Note that there are 17 differentially expressed genes. Fourteen of these genes correspond to the transcripts that were experimentally spiked in as expected. Two sets of genes, probes #36202_at and #546_at, and probes #407 and #37777, respectively, correspond to the same spiked-in genes. Two genes, probe #33818_at and probe #1032_at, are unexpected and warrant further investigation. To highlight these genes, complete the following step.

Hold down the Ctrl key and click each of the selected genes, as shown in Figure 8.34.


Figure 8.34: Highlighting selected genes

Highlighting these genes in the data table allows us to visualize them in other windows as well, such as the Hierarchical Clustering window, shown in Figure 8.35.

Figure 8.35: Clustering window

Examination of the Hierarchical Clustering window reveals that the 33818_at gene, which encodes a valosin-containing protein, clusters with the interleukin receptor-like 40322_at gene. In addition, the 1032_at gene, which encodes the beta-subunit of the interleukin 8 receptor, clusters with the angiotensinogen proteinase inhibitor 684_at gene, as shown in Figure 8.36.


Figure 8.36: Clustering of differentially-expressed genes

With these genes highlighted, we can select the Action Buttons window (shown in Figure 8.37) and use the search options to further explore their relationships.

Figure 8.37: The Action Buttons window

Click Annotation Summary to open a Gene Summary HTML page (shown in Figure 8.38) providing specific links to information on each of the highlighted genes contained in various online databases.

Figure 8.38: The Gene Annotation Summary

From here, connect to public web pages for further analysis. It turns out that the spike in concentration of the interleukin 1 receptor-like gene (#40322_at) was 0.25pM. A mistake in the experimental setup caused the valosin-containing protein gene (probe set #33818), which was supposed to go into group 12, to be omitted. This gave it a concentration of 0pM, which would intuitively cluster together with a concentration of 0.25pM. The probe set #1032_at contains the motif: 5’GCAGCCGTTT3’. In addition to having specificity for the interleukin 8 receptor (beta) gene, this motif also hybridizes to a similar sequence contained in the K02215 gene (target of the 684_at probe set) (Hsieh, Chu et al. 2003). Therefore, it is not surprising that the genes specified by these two probe sets cluster together.

Predictive Modeling In conjunction with, or as an alternative to, row-by-row modeling as described previously, you might want to perform exploratory predictive modeling and/or data mining. See Chapter 10 for a description of relevant processes available through JMP Genomics.

Proteomics Spectral Preprocessing: The Prostate Cancer Example

9C H A P T E R

JMP Genomics offers analyses for spectrometry data, including those from mass spectrometers and nuclear magnetic resonance instrumentation. In this example, the data set was obtained by Surface-Enhanced Laser Desorption/Ionization (SELDI). This method allows an investigator to detect and resolve multiple proteins bound to protein chip arrays (Merchant and Weinberger 2000). This approach was used by Qu, et al. (2002) to discriminate prostate cancer from non prostate cancer patients. The promise of this approach is that a panel of multiple biomarkers can be used to distinguish important phenotypes such as cancer status. However, great care must be taken to pre-process and analyze the data appropriately to ensure generalizability of results.

The Prostate Cancer Example

The example data set consists of serum samples collected from 165 men. 84 of the men had prostate cancer. The remaining 81 men are considered to be controls. The primary goal is to determine differences in protein expression between these groups. To examine the primary data set, complete the following steps.

Select File > Open, as shown in Figure 9.1.

Figure 9.1: Opening the data set

The Open Data File window opens.

Navigate to Sample Data > Proteomics, as shown in Figure 9.2.

9 Proteomics Spectral Preprocessing: The Prostate Cancer Example 228

Figure 9.2: Selecting the data set

Select the wright_tall_2k_10k.sas7bdat file.

Click Open to open the data set.

The wright_tall_2k_10k.sas7bdat data set, partially shown in Figure 9.3, opens.

Figure 9.3: The wright_tall_2k_10k.sas7bdat data set

The format of the primary dataset is in tall form, with mass-to-charge (or m/z) values, or as rows and individuals as columns. As with the microarray data, there is an accompanying experimental design file that provides characteristics of the columns. To examine the experimental design for this example, open the wright_design.sas7bdat file in JMP by completing the following steps.

Select File > Open.


Navigate to Sample Data > Proteomics.

Select the wright_design.sas7bdat file.

Click Open to open the design file, as shown in Figure 9.4.

Figure 9.4: The EDF

Note that the format of this file conforms to the EDF specifications described in Chapter 3. The primary variable of interest is status, with values CCD (cancer) and NOR (normal). The Array variable provides a unique numerical indicator for each row, and ColumnName lists the names of the columns in the primary data set.

Preprocessing the Data JMP Genomics contains a few processes to assist in basic preprocessing of spectral datasets. Running these processes before rigorous statistical analyses typically increases the reliability of these analyses.

2-Dimensional Analysis A first step in analyzing this kind of dataset is to get a good view of the entire dataset. For two-dimensional spectral data like these SELDI data, this can be done using the 2D Plot process located under the Spectral Preprocessing menu.

Select Genomics > Spectral Preprocessing > 2D Plot, as shown in Figure 9.5.


Figure 9.5: Selecting the 2D Plot process

The Spectral 2D Plot dialog opens, as shown in Figure 9.6.

Figure 9.6: The Spectral 2D Plot dialog


To load the prostate cancer example, complete the following steps.

Click Load, as shown in Figure 9.7.

Figure 9.7: Loading the default example

Select the ProstateCancerExample and click OK, as shown in Figure 9.8.

Figure 9.8: Selecting the default example

The completed dialog appears.

Figure 9.9: The completed Spectral 2D Plot dialog

This process plots the spectra and enables comparisons between two groups, designated A and B. This example compares all of the cancer patients versus all of the non-cancer patients. The variables with CCD in their name are assigned to the A group, and those with NOR in their name are assigned to the B group. The index variable is plotted on the x-axis of the overlay plots.


Click Run to generate the overlay plots.

The overlay plots appear as shown in Figure 9.10. Note: Several additional results windows also open.

Figure 9.10: The overlay plots

This plot shows the mean values of the two groups of spectra plotted against each other (CCD is Group A, in red, and NOR is group B, in green). The black spectrum along the bottom, which is indexed on the right axis, displays negative log10 p-values from t-tests between the two groups, conducted separately for each m/z value and without any adjustment for multiple testing. The peaks in this plot represent m/z values exhibiting statistically significant differences between the two groups. The peaks in the black spectrum show places between the red and green groups where there is a significant difference. Use the magnifying tool to select a rectangular region of interest. This shows results in more detail and allows you to explore how and why the peaks were differentiated. This can also be useful to resolve doublet peaks. This analysis produces a rather large set of plots. It can be informative to consider smaller sets of variables. This can be done by removing variables from the Plot Variables Group boxes. Note: To shift or scale the axes, click on either the left or right vertical axes until a hand icon appears. Then drag shift or scale the axes. Double-click on an axis to change its properties. These adjustments can enhance your ability to discern differences between the spectral profiles, as can be seen in Figure 9.11.


Figure 9.11: Portion of the overlay plot between m/z values of 3750 and 4050

The Overlay Plot by MZ graph (not shown) displays a similar graph of all the individual spectra. This can be useful if something in the Mean values plot of interest warrants further exploration. Since all the data are plotted on this graph, manipulating it is memory-intensive and some sluggishness may occur in performance. The Cell Plot graph (not shown) displays all the spectra in a gray scale heat map. All of the plots are driven by a single underlying table, wright_tall_2k_10k_s2g. Scrolling to the extreme right side of this table (shown in Figure 9.12), shows various computed statistics. The last column in the table is the NegLog10 PValue column.

Figure 9.12: Te wright_tall_2k_10k_s2g table

Click on the column label to select the NegLog10 PValue column.

Select Tables > Subset in the JMP menu, as shown in Figure 9.13.


Figure 9.13: Selecting the Subset process

The Subset dialog opens, as shown in Figure 9.14.

Figure 9.14: The Subset dialog

Click OK to generate a subset table of this data (shown in Figure 9.15).

Figure 9.15: A subset of the wright_tall_2k_10k_s2g table

Select Analyze > Distribution from the JMP menu (Figure 9.16).


Figure 9.16: Selecting the Distribution process

The Report: Distribution dialog opens.

Select the NegLog10PValue column and click Y,Column to select this column for distribution analysis.

Figure 9.17: The completed Report: Distribution dialog

Click OK to generate the histogram of the p-values.


Figure 9.18: Histogram of the p-values of the data reported in the wright_tall_2k_10k_s2g table

This is a highly skewed distribution. The p-values in the top quartile, those above 3.903, are the interesting peaks. To select these p-values directly from the distribution display, click and drag a rectangle in either the histogram or the box plot windows. Then click Tables > Subset to obtain a table of the most significant peaks. Refer to the JMP User Guide for more details on generating subset tables. Note: Results from this and all JMP Genomics processes, including a re-executable JMP script, are saved in the output folder as specified at the bottom of the General tab in the Spectral 2D Plot dialog. The default settings specify this output folder as the ProcessResults folder.

2D Detrend Spectral data often contain an unwanted baseline trend that varies from spectrum to spectrum. Removing these trends is recommended to ensure comparability of the spectra. The 2D Detrend process creates a new SAS data set in the same form as the original input data set, except that baseline trends in the dataset are subtracted out for each spectrum.

Select Genomics > Spectral Preprocessing > 2D Detrend, as shown in Figure 9.19.


Figure 9.19: Selecting the 2D Detrend process

The Spectral 2D Detrend dialog opens, as shown in Figure 9.20.

Figure 9.20: The Spectral 2D Detrend dialog

Click Load to load the default example.

Select the ProstateCancerExample settings and click OK.



Figure 9.21: The completed Spectral 2D Detrend dialog

Examine the General tab of the dialog. The input data set is the same data file used previously. Each column in the input file shows as an available variable. The spectral variables are columns containing the numerical data from the spectra. The index variable is mz. This example automatically specifies the output folder and assigns a name for the output data set.

Click the Analysis tab. The Analysis tab appears, as shown.

Figure 9.22: The Analysis tab

Examine the Analysis tab. The bandwidth represents the moving m/z value width used to calculate the average baseline for subtraction from the points on the spectra. Peaks are determined using the standard cutoff of 3 above baseline.

Click Run to subtract the baseline. The modified SAS data set generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 9.23).



JMP automatically adds the _dt suffix in the name of the new data set. This data set is available for subsequent analyses.

2D Bin Spectral data sets can be quite large, and it is often useful for rapid initial exploration of the major features of the data to bin them across groups of m/z values. The 2D Bin process (not shown) performs simple binning in this fashion and reduces the total number of rows in the main data set.

2D Peak Find Another way to reduce the size of spectral data is to compute peak locations and their heights or areas. The 2D Peak Find process executes a basic peak-finding algorithm based on a specified number of peaks to be found.

Select Genomics > Spectral Preprocessing > 2D Peak Find. The Spectral 2D Peak Find dialog opens, as shown in Figure 9.24.

Figure 9.24: The Spectral 2D Peak Find dialog

Click Load to load the default example.

Select the ProstateCancerExample settings and click OK.



Figure 9.25: The completed Spectral 2D Peak Find dialog

Examine the General tab of the dialog. Note the automatic specification of the wright_tall_2k_10k.sas7bdat input data file. The x-axis and spectral variables have been specified, as discussed previously. The output folder has been specified by default.

Click on the Noise Estimation tab. The Noise Estimation tab appears, as shown in Figure 9.26.

Figure 9.26: The completed Noise Estimation tab of the Spectral 2D Peak Find dialog

The x-axis value intervals for noise are 2000-2500 and 19500-20000. These regions of the 2-D spectra (Figure 9.10) appear to result from pure noise.

Click the Options tab. The Options tab appears, as shown in Figure 9.27.


Figure 9.27: The completed Options tab of the Spectral 2D Peak Find dialog

Examine the Options tab. Note that the maximum number of peaks is set to 100 by default. Change this number depending upon the resolution of the data.

Click Run to find the peaks. This process invokes SAS/IML and may take several minutes to run for large data sets. Upon completion, several graphs are produced showing various statistics about the peaks, as shown in Figure 9.28.

Figure 9.28: The Peak Finding Statistics plots

The peak-finding process also generates two different output data sets that are listed in a SAS Message dialog as shown in Figure 9.29.



The wright_tall_2k_10K_s2p_det.sas7bdat data set contains peak details that are useful for further exploration. The wright_tall_2k_10k_s2p.sas7bdat data set is useful for subsequent analyses.

Proteomics Data Quality Control and Normalization After pre-processing, run statistical quality control and normalization processes on the spectral data. These capabilities are available in the Quality Control and Normalization submenus of the main Genomics menu. Refer to Chapters 7 and 8 for demonstrations of different Quality Control processes of microarray data. Note: After appropriate pre-processing, from a statistical perspective, protein or metabolite expression data is similar to gene expression. Many of the processes used for analysis of microarray data are, therefore, applicable to proteomic analyses.

Proteomics Pattern Discovery and Row-by-Row Modeling As with Quality Control and Normalization, the processes available under the Pattern Discovery and Row-by-Row Modeling are useful for protein or metabolite expression. These processes are illustrated with microarray data in Chapters 7 and 8 and are not shown here.

Preparing Data for Predictive Modeling

Often the goal of a proteomics study is to find a model for prediction of a categorical or continuous characteristic of the samples. Several processes are available for this in the Predictive Modeling submenu. These processes are fully described in Chapter 11. Before running these processes, the data must be transformed into wide form.

Transform Tall and Wide

Select Genomics > Data Set Utilities > Transpose Tall and Wide. The Data Transpose dialog opens. To select the wright_tall_2k_10k_s2p.sas7bdat file that was generated previously as the input data set, complete the following steps.

Click Choose.



Select the wright_tall_2k_10k_s2p.sas7bdat file and click Open. To select the Experimental Design Data Set, complete the following steps.

Click Choose.

Navigate to Sample Data > Proteomics.

Select the wright_design.sas7bdat file and click Open. To select the output folder,

Click Choose.


Click Select to specify the output folder. The completed Transpose Tall and Wide dialog appears, as shown in Figure 9.30.

Figure 9.30: The completed Transpose Tall and Wide dialog

Click Run to transpose the data.

The transposed SAS data set generated by this process is listed in a SAS Message dialog that is displayed in a new window (shown in Figure 9.31).



Click Open to examine the transposed data set (Figure 9.32).

Figure 9.32: The transposed data set

The transposed data set has individuals as rows and both experimental design variables and peaks as columns. Note the “_wid” suffix on the end of the name of the data set. The wright_tall_2k_10k_s2p_wid.sas7bdat data set can be used as the input data set for the Predictive Modeling processes described in Chapter 10.

Predictive Modeling

10C H A P T E R

The primary focus of JMP Genomics is scientific discovery and understanding through statistics and graphics. However, the software does offer some basic capabilities for creating predictive models. You can construct predictors of either continuous or categorical outcomes using data from genetic markers, microarrays, or proteomics as predictor variables. These processes, which include Discriminant Analysis, Distance Scoring, General Linear Model Selection, K Nearest Neighbors, Logistic Regression, Partial Least Squares, Partition Trees, Radial Basis Machine, and Binary Response Effect Selection, are grouped under the Predictive Modeling submenu, as shown in Figure 10.1. Additional processes (Binary Response Effect Selection, Cross Validation Model Comparison, and Test Set Model Comparison), help you to select the most appropriate model for your data.

Figure 10.1: The Predictive Modeling submenu

Predictive modeling is also known as exploratory modeling or data mining. This chapter discusses the JMP Genomics functions that target exploratory and basic data mining for genomics data. For advanced, enterprise-scale data mining, SAS Enterprise Miner software offers a full spectrum of methods and a convenient, workflow-style interface. After the genomics data has been appropriately preprocessed and stored as a wide SAS data set, one or more of the processes, described in this chapter, can be run to perform exploratory data mining. The same data set can also be used with Enterprise Miner to obtain more rigorous results and scoring rules.


Data Sets All of the processes described in this chapter require the data to be in wide format, with individual samples as rows and experimental design variables, phenotypes, genetic markers, transcripts, and/or peptides as columns. Genetic marker data is likely already in this form, but any microarray or proteomics data that are in tall form must be converted to the wide format. Use the Transpose Tall and Wide command to convert the tall data set and its accompanying experimental design data set data to wide form. See Chapter 4 for detailed instructions on transforming the data set. With multiple tables containing different forms of data on a set of samples (for example, both genetic marker and microarray data), merge them into one single wide data set using the Data Set Utilities > Merge command, as described in Chapter 4. These data can then be used together to build jointly predictive models. We recommend you preprocess and analyze the different data types separately and then combine them just prior to predictive modeling. For large data sets with tens or hundreds of thousands of predictors, computing time for some of the JMP Genomics predictive modeling processes can become prohibitively long. In this situation, perform a preliminary reduction of the predictor set by using the Pattern Discovery > K-Means Clustering process to select a thousand or so representative predictors. (The data must be in tall form to execute this process. Use the Transpose Tall and Wide AP to go back and forth between tall and wide forms.) When performing variable selection/reduction with an entire data set, it is important to realize that an optimistic bias can be introduced in subsequent analyses. To compensate for this, hold out a fraction of the data from the beginning and use for subsequent prediction. Many of the processes have built-in cross-validation capabilities to help prevent selection bias. Alternatively, cross validation can be done manually by creating one or more new columns that are copies of the variables being predicted and then setting subsets of them to missing values. While the ultimate test of generalizability of any predictive model is with new data from an independent laboratory, computer-based cross-validation is invaluable in assessing initial performance of the models.

Predictive Modeling Processes The proteomics prostate cancer example, described in Chapters 1 and 9, is used to illustrate several of the predictive modeling processes available from JMP Genomics.

Discriminant Analysis Discriminant Analysis is a traditional method for classifying a categorical variable from a set of continuous responses. To run this process, complete the following steps.

Select Genomics > Predictive Modeling > Discriminant Analyses. The Discriminant Analysis dialog opens, as shown in Figure 10.2.


Figure 10.2: The Discriminant Analysis dialog

Click Load.

Select the ProstateCancerExample settings.

The completed General tab of the Discriminant Analysis dialog appears, as shown in Figure 10.3.


Figure 10.3: The completed General tab of the Discriminant Analysis dialog

Click Open to examine the wright_wide_2k_10k_dt_sig6.sas7bdat input data set.

The input data set contains data from 165 men, 84 men with prostate cancer and 81 cancer-free men considered as controls. Samples are listed in rows, while the responses from a set of mass spectrometry peaks are listed in columns beginning with mz. Note this data set is in the wide format.

Examine the completed General tab. The Dependent Class Variable is the variable to be predicted; in this case, status indicates whether or not the individual is likely to develop cancer. In this example, individuals with cancer are identified as CCD, while members of the control group are identified as Nor. A discriminant prediction model can be built from two types of predictor variables, Continuous and Class.

o Predictor continuous variables must be numeric and their numeric values are used directly as predictors as in linear regression.

o Predictor class variables can be numeric or categorical. Their unique values are used to form a set of columns with 0s and 1s indicating class level.

With a large number of predictor variables, it is often more convenient and advisable to use the List-Style specifications rather than selecting and moving variable names to the boxes on the right. For the List-Style specifications, you can use SAS syntax to indicate a range of variables, for example, x1-x12345 specifies the variables x1, x2, x3, …, x12345. For this example, you could clear the Predictor Continuous Variables field and instead specify mz: in the List Style Specification of Predictor Continuous Variables field. This specification is a shorthand syntax that indicates all variables beginning with mz. The variable sample is specified as the Label Variable. The values listed in this variable are used to create labels in the output JMP table and plots.


The Predictor Reduction tabs allows you to trim down the number of predictor variables used before modeling, eliminating redundant variables and, potentially, increasing the speed of execution.

Click Predictor Reduction 1.

Examine the Predictor Reduction 1 tab.

Do not make any changes to the Predictor Reduction tab.

Click Predictor Reduction 2.

Examine the Predictor Reduction 2 tab.

Do not make any changes to the Predictor Reduction tab. The Analysis tab allows you to select and adjust specific analysis parameters.

Click Analysis.

Do not make any changes to the Analysis tab.

The Genetic Algorithm tab allows you to input the algorithm used to complete the analysis.

Click Genetic Algorithm.

Do not make any changes to the Genetic Algorithm tab. The Options tab allows you to specify the output of the discriminate analysis.



Click Run to launch the JMP Discriminant platform, as shown in Figure 10.4.


Figure 10.4: The JMP Discriminant platform

The JMP Discriminant platform can be used to interactively select a set of predictors for the discriminant model. The Step Forward and Step Backward commands force JMP to select, in a stepwise manner, the predictors according to statistical significance.

Click Step Forward to add the most significant of the non-selected variables to the list of predictors.

Click Step Backward to remove the least significant variable from the selected predictors.

Alternatively, specific predictors can be selected manually by checking the corresponding boxes in the Entered column.

Click Step Forward five times to select the five most significant variables.

Click Apply This Model to obtain the display of the results shown in Figure 10.5.

Results for a model with the first five variables are shown in Figure 10.5.


Scores are for these five variables

Figure 10.5: Model derived from the first five variables Note that 11 of the 165 samples are misclassified with this model. To further refine this model, complete the following steps.

Select Stepwise Variable Selection from the drop-down menu in the Discriminant Analysis box, as shown in Figure 10.6, to return to the JMP Discriminant platform shown in Figure 10.4.

Figure 10.6: Selecting the Stepwise Variable Selection

Select additional predictors or deselect inappropriate predictors, as warranted by your scientific

objectives.

Click Apply This Model to obtain the display of the new results (not shown).


Additionally, refer to the JMP Statistics and Graphics Guide for details on the output displays and further analyses. General Linear Model Selection The General Linear Model Selection process performs predictor variable selections in the framework of general linear models for a continuous dependent variable. A variety of model selection methods are available, including forward, backward, stepwise, lasso, and least-angle regression. This process offers a wide variety of selection and stopping methods, from traditional and computationally efficient significance-level-based criteria to more computationally intensive validation-based rules. It also provides graphical summaries of the selection search. It calls the experimental PROC GLMSELECT from SAS/STAT.

Select Genomics > Predictive Modeling > General Linear Model Selection. The GLM Select dialog opens, as shown in Figure 10.7.

Figure 10.7: The GLM Select dialog

Click Load.


The completed General tab of the GLM Select dialog appears as shown in Figure 10.8.


Figure 10.8: The completed General tab of the GLM Select dialog

The input data set was described previously in the example for the Discriminant Analysis process. The Dependent Variable is the variable to be predicted; in this case, status indicates whether or not the individual is likely to develop cancer. In this example, individuals with cancer are identified with a 1, while members of the control group are labeled with a 0. All of the candidate predictor variables have names that begin with mz, so they are specified using the List-Style specification. In this case, mz: has been entered. The colon indicates that all variables with the common prefix mz are to be considered. Because they are all continuous, no Predictor Class Variables are specified. No label variables are used, and because each observation represents a single individual, no Weight variables are specified. The Predictor Reduction 1, Predictor Reduction 1, Analysis, Genetic Algorithm, and Options tabs on the GLM Select dialog are similar to those described for the Discriminant Analysis AP. You should examine the default settings for each tab.

Click Run to run the GLM Select process.


The resulting output (available in either plain text or HTML) describes the details and results of the general linear model selection process. This example uses a stepwise model selection with entry and stay significance levels of 0.01. In addition to an overall mean value (Intercept), the twelve mass-over-charge values listed in the output Parameter Estimates table (shown in Figure 10.9) are selected as predictive.

Figure 10.9: Output of the GLM Select process

These can be considered to be initial candidate prostate cancer biomarkers, and provide starting points for more extensive computational and experimental cross-validation. For full documentation and details on the underlying options and methods available with this process, complete the following step.

Select Help > JMP Genomics Web Links > The GLMSELECT Procedure Documentation.

K Nearest Neighbors The K Nearest Neighbors process is very similar to the Discriminant Analysis process, but it employs a nonparametric method based on neighboring averages to perform predictions. Example output is not shown. Logistic Regression Logistic regression is another classic method used to predict probability of a response being in a particular categorical class. It models this probability using a link function that transforms a linear function of the predictor variables to a probability scale.

Select Genomics > Predictive Modeling > Logistic Regression. The Logistic Regression dialog opens, as shown in Figure 10.10.


Figure 10.10: The Logistic Regression dialog

Note that the structure of this dialog is very similar to the Discriminant Analysis dialog shown in Figure 10.3. Variables are specified in the same way as described for Discriminant Analysis.

Click Load.

Select the ProstateCancerExample settings. The completed General tab of the Logistic Regression dialog appears as shown in Figure 10.11.


Figure 10.11: The completed General tab of the Logistic Regression dialog

Unlike the previous Discriminant Analysis example, predictors are selected in an automated stepwise fashion. Running the example with the default settings invokes SAS PROC LOGISTIC and produces a SAS Output window, a JMP table, and the Logistic Regression Results window shown in Figure 10.12.


Figure 10.12: Results of the Logistic Regression

The Distributions panel shows the distributions of the original samples and the number of correctly classified observations. The Contingency Analysis panel provides a further breakdown of the results. For this run, a total of 9 of the 165 samples are misclassified. To select rows in the corresponding JMP table, click on bars in the histograms or cells in the mosaic plot. Partial Least Squares Partial least squares (PLS) is a technique popular in chemometrics. It is different from discriminant and logistic regression methods in that it uses all of the predictor variables at one time. It can be viewed as a supervised principal components analysis, in that it constructs linear combinations of the predictor variables that maximize covariability with the dependent response variables.

Select Genomics > Predictive Modeling > Partial Least Squares. The Partial Least Squares dialog opens, as shown in Figure 10.13.


Figure 10.13: The Partial Least Squares dialog

Click Load.


The completed General tab of the Partial Least Squares dialog appears as shown in Figure 10.14.


Figure 10.14: The completed General tab of the Partial Least Squares dialog

Variables are specified as previously described for Discriminant Analysis, with the addition of a Color Variable that is used to color the JMP plots. On the Analysis tab, note that three partial least squares components are specified. As with principle components, this number can be changed.

Click Run to perform the partial least squares analysis. Several results windows open. Figure 10.15 shows the tabular SAS output.

Figure 10.15: The tabular SAS Output window


The Model Effects columns show that 88.4% of the variability of the mz predictor variables is explained by the three PLS components, whereas the Dependent Variables columns show that 72.4% of the variability of cancer status is explained. The output also contains both 2D and 3D plots of the multivariate scores (Figures 10.16 and 10.17) from the partial least squares analysis.

Figure 10.16: 2D Plots of row multivariate scores from the partial least squares analysis


Figure 10.17: 3D Plots of row scores (left) and column scores (right) from the partial least squares

analysis

In both plots, the cancer samples are colored red and the normal samples are colored blue. The cancer samples are fairly well separated from the control samples and are more heterogeneous. PLS provides a good initial indication of the difficulty in discriminating groups. PLS can be more difficult to interpret than results from other processes because the prediction is a linear combination of all the variables. Partition Trees Partition trees provide an intuitive way to hierarchically split data in a way that best predicts a response.

Select Genomics > Predictive Modeling > Partition Trees. The Partition Trees dialog opens, as shown in Figure 10.18.


Figure 10.18: The Partition Trees dialog

Click Load.


The completed General tab of the Partition Trees dialog appears as shown in Figure 10.19.


Figure 10.19: The completed Partition Trees dialog

The Predictor Reduction 1, Predictor Reduction 2, Analysis, and Options tabs on the Partition Trees dialog are similar to those described for the Discriminant Analysis AP. You should examine the default settings for each tab. There is an additional Pruning tab

Do not make any changes to the Predictor Reduction, or Options tabs.

Click Analysis.

Examine the Mode.

Partition tree analysis can either be carried out in automated mode, in which partition are generated using SAS programming or interactively, in which you can interactively create a partition tree. By default, the mode for this setting is set to Automated. You can run the example using this mode or change the mode to interactive.

Click Interactive to change the mode.

Click Run to launch the JMP Partition platform shown in Figure 10.20.


Figure 10.20: The JMP Partition platform

You can use this platform to interactively create a partition tree.

Click Split to generate a new branch on the tree.

Click Prune to remove the last branch added.

Clicking Split three times produces the tree shown in Figure 10.21.


Figure 10.21: The resulting partition tree

Refer to the JMP Statistics and Graphics Guide for details on how to use and interpret results.

Annotation Analysis

11C H A P T E R

The Annotation Analysis submenu provides a set of bioinformatic tools that can help scientists incorporate biological meaning with their statistical results. Users can access these tools through the JMP Genomics main menu as shown in Figure 11.1.

Figure 11.1: The Annotation Analysis submenu

Processes available under the Annotation Analysis submenu include:

• Create 0-1 Indicator for Selected Rows creates a new column in the active JMP table whose value equals 1 for all rows that are selected and 0 for those that are not selected. Such columns are needed for the subsequent Column Enrichment process.

• Venn Diagram is a JMP Scripting Language (JSL) script that allows you to examine and compare up

to five variables in a data set using Venn diagrams to explore their similarities and differences and to identify observations of special interest. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

• Create Web Link−enables easy access to gene information, protein information, pathway

information, and so on, that are stored in various biomedical databases such as GenBank, Gene, Pubmed, KEGG pathway, and Genome Map by creating a web link report based on your input annotation table.


• IPA Upload uploads statistical results directly from JMP Genomics to Ingenuity Pathway Analysis software. Creates an HTML form with a button that launches Ingenuity’s multiple observations analysis platform.

• KEGG Pathway Search searches the KEGG pathway database, enabling identification of the

molecular interactions, reaction networks and functions that are relevant to genes of interest.

• KEGG Pathway Color colors KEGG pathways with statistical results enabling visualization and interpretation of these results in the context of pathways and biological systems.

• UCSC Genome Browser Link creates an HTML table with links to the UCSC Genome Browser

based on locations, gene names or other parameters. This AP allows users to create a custom track for upload to the UCSC Genome Browser by specifying a quantitative variable of interest from an analysis performed in JMP Genomics. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

• Affymetrix > Integrated Genome Browser creates a table with embedded hyperlinks to

chromosomal features or locations within the Affymetrix Integrated Genome Browser. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

• Affymetrix > Download NetAffx Files allows you to search out and retrieve annotation, library,

map, or other accessory files used with Affymetrix arrays. These files, which are associated with different microarrays produced by Affymetrix, are often required for data analysis. Refer to the JMP Genomics User Guide – Supplement for more details on this process.

• Column Enrichment performs an enrichment analysis by comparing a binary significance column

with a set of annotation categories to construct a set of unique categories, based on the annotation, and assigns individual genes to those categories.

• List Enrichment compares a set of curated lists (such as genes, proteins, or metabolites) against a

table of significance values and then tests for significant enrichment using Fisher's exact test for association.

• Configure Proxy Settings resets the proxy server name and port number in the genomics.config

file. If your computer accesses the Internet through a proxy server, you must specify the proxy server name and port number before JMP Genomics will access the Internet. If your computer does not access the Internet through a proxy server, do not change the default settings.

Annotation Data Sets An Annotation Data Set contains biological or chemical information and properties about genes, SNPs, probes, probe sets, or peptides. This annotation information comes from various online bioinformatic resources, including government agencies, academic organizations and commercial entities. It is used to create a custom Annotation Data Set for your analysis. The structure of the Annotation Data Set for JMP Genomics’ genetics processes differs from that of the microarray and proteomics processes. For genetics, each row in the Annotation Data Set represents a marker or SNP used in the analysis, with variables typically containing the following information: a name or identifier for each marker, the chromosome or candidate gene on which it is located, its location (in terms of kilobases or centiMorgans, for example), and an accession number that can be used to retrieve more information about the locus from a publicly available on-line database. Use this data set in the Create Web Link process to combine web links to the appropriate databases for all the markers into a single report as shown in an example later in this chapter. This data set can also be specified on the Annotation tab found on most of the process dialogs where the columns can be assigned to various roles:


• Annotation Label Variable−the name or ID variable that is used to label markers in the output

• Annotation Group Variable−the variable, such as chromosome, that can be used to group the analyses and output

• Annotation Location Variable−the variable containing marker locations to be used to accurately

represent distances between markers in p-value plots

• Accession Number Variable−the variable containing GenBank accession number or dbSNP reference sequence ID for example, to be used to create buttons on p-value plots that provide direct access to the website for the selected marker from the appropriate on-line database

This tab also allows conditional inclusion of markers in your analysis based on particular values of variables from the Annotation Data Set. The criteria can be entered in the Annotation Where Clause in accordance with SAS syntax for WHERE statements. For the microarray and proteomics processes, the Annotation Data Set must contain a merge key variable whose values exactly match those of some variable in a tall data set. The structure of an Annotation Data Set can vary depending on the application and source(s) of the data. Table 11.1 lists information commonly contained in an Annotation Data Set. Keep in mind that different providers might name annotation information differently.

Table 11.1: Types of information commonly found in an Annotation Data Set

Items or Properties Description

Probe or Probe Set ID A unique identifier given to a probe or probe set in a probe array or microarray

GenBank Accession Number An Accession Number is a unique identifier given to a biological polymer sequence (such as DNA or a protein) when it is submitted to a sequence database (GenBank, EMBL, DDBJ).

UniGene Cluster ID A unique identifier given to a cluster of sequences in UniGene

Gene ID A unique identifier assigned to a gene record in Entrez Gene. It is an integer and is species specific. For genomes that had been represented in LocusLink, the Gene ID is the same as the Locus ID.

Gene Symbol A short-form abbreviation or symbol assigned to a gene by species-specific nomenclature committees. Each symbol is unique and each gene is only given one approved gene symbol.

Description Description about a gene, probe, or probe set

Chromosomal Location The physical location of a gene or a sequence on a chromosome

Ensembl ID A unique identifier assigned to a sequence in Ensembl

Swiss-Prot Id A unique identifier assigned to a protein sequence in Swiss-Prot−a curated protein sequence database that provides a high level of annotation (such as the description of protein function, domain structures, post-translational modifications, variants, etc.), a minimal level of redundancy, and significant integration with other databases

EC number A number assigned to an enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). The EC number is a unique identifier in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.


OMIM ID A unique identifier assigned to a genetic disorder in the Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.

dbSNP ID A unique identifier assigned to a single nucleotide polymorphism when it is submitted to the SNP database. Also known as a 'rs' ID.

RefSeq Accession A unique identifier given to a sequence in the NCBI RefSeq database. The RefSeq database is a curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, and entire chromosomes.

Gene Ontology ID A unique alphanumerical identifier given to a GO term.

Genomic Location/Coordinate

A location assigned to a gene or a sequence at both the chromosome and sequence-levels

Raw annotation data can come in a variety of formats. These include tab delimited (.txt), Comma-separated (.csv), or Excel (.xls) files. You can open any of these file formats in JMP; however, before an Annotation Data Set can be used in JMP Genomics processes, first save it as a SAS data set, with the suffix .sas7bdat. The Genomics > Data Set Creation >Import Individual Text, CSV, or Excel File process can also transform an annotation file into a SAS Annotation Data Set (.sas7bdat). When combining data from multiple sources, the Tables > Join process in JMP can be used to join two JMP tables into one, or the Genomics > Data Set Utilities > Data Merge process to join two SAS data sets. The following example demonstrates how to generate in an input annotation data set in the required format.

Annotation Data Set Creation This example generates an annotation data set for the Affymetrix Latin Square example data described in Chapter 1. The GeneChip® expression array used in the Latin Square experiment is the Human Genome U95 array, described in Chapter 8. The workflow for this process is, as follows: 1. Create a separate directory for the data sets for storing data and files. 2. Download the annotation file from the Affymetrix website, unzip, and save the file in the directory you

created. 3. Use the JMP Genomics data import function to generate the SAS data set.

Create a Separate Directory


Create a new folder.

Name the folder AnnotationData.

This folder is used for storing data and files. Download the Annotation File

Go to the Affymetrix web site and browse to the technical support documentation for the Human Genome U95 Set.

At the time of printing, the URL for this web page is http://www.affymetrix.com/support/technical/byproduct.affx?product=hgu95.

http://www.affymetrix.com/support/technical/byproduct.affx?product=hgu95


Select the HG_U95Av2 Annotations, CSV format link (circled in Figure 11.2) from the list of annotation files.

Figure 11.2: Annotation files available from Affymetrix

Click the link to begin the download process. The File Download window, shown in Figure 11.3,

opens.

Figure 11.3: The File Download window

Click Save (circled in Figure 11.3) to bring up the Save As window shown in Figure 11.4.


Figure 11.4: The Save As window

Navigate into the AnnotationData folder you just created and click Save (circled in Figure 11.4)

to download the annotation file.

Unzip the downloaded HG_U95Av2.na23.annot.csv.zip file. The file opens in Excel, as shown in Figure 11.5.

Figure 11.5: A portion of the HG_U95Av2.na21.annot.csv file

Copy following columns into a new Excel workbook.

Probe Set ID, Representative Public ID, UniGene ID, Gene Title, Chromosomal Location, Ensembl, Entrez Gene, SwissProt, EC, OMIM, RefSeq Protein ID, RefSeq Transcript ID,


Gene Ontology Biological Process, Gene Ontology Cellular Component, Gene Ontology Molecular Function

Name the workbook as my_HG_U95Av2_annot and save it in the AnnotationData folder.

Figure 11.6: A portion of the subset my_HG_U95Av2_annot.xls file

The size of the subset my_HG_U95Av2_annot.xls file is about one third the size of the original file. The column names provided by Affymetrix can be renamed to make them more descriptive. For example, The Representative Public ID column lists the GenBank Accession numbers in the Human Genome u95 Set’s annotation file, but it lists the FlyBase Accession number in the corresponding Drosophila Genome Array’s annotation file.

Rename the Representative Public ID column as Accession. Some column values contain multiple entries that are separated by an entry delimiter. For example, values in the SwissProt column contain three forward slashes (///) as the entry delimiter in its annotation file. Some column values contain entries that consist of both identifier and description. In these cases, the identifier and description are separated by an ID delimiter. For example, values in the Gene Ontology Biological Process column contain two forward slashes (//) as the ID delimiter. These delimiters are commonly used in Affymetrix’s annotation files. Be aware that different annotation providers might use different entry and ID delimiters.

Generating the SAS Data Set

Select Genomics > Import > Text > Import Individual Text, CSV, or Excel Files, as shown in

Figure 11.7.


Figure 11.7: Selecting the Import Individual Text, CSV, or Excel Files process

The Import Individual Text, CSV, or Excel Files dialog opens, as shown in Figure 11.8.

Figure 11.8: The Import Individual Text, CSV, or Excel Files dialog


To select the annotation file you just created complete the following steps.

Click Choose to select the folder containing the input file.

Navigate to ProcessResults > AnnotationData.

Click OK to choose the folder.

All of the files contained in the AnnotationData folder are listed in the Available Files box in the dialog.

Select the my_HG_U95Av2_annot.csv.xls file.

Click to add the file to the Files to Import box.

You must indicate both the row in which the column names are listed and the first row containing data.

Examine the my_HG_U95Av2_annot.csv.xls file

The column names are listed in row 1 and the data starts in row 2.

Type 1 in the Row Number of Variable Names [0, 10000] box.

Type 2 in the Data Start Row [0, 10000] box To select the output folder, complete the following steps.

Click Choose to select the output folder.





Figure 11.9: The completed Import Individual Text, CSV, or Excel Files dialog

Click Run to import the annotation file.

The SAS data set generated by this process is listed in a SAS Message dialog (Figure 11.10).


Click Open to examine the contents and structure of the my_hg_u95av2_annot.sas7bda

annotation data set shown in Figure 11.11.


Figure 11.11: A portion of the my_hg_u95av2_annot.sas7bdat annotation data set

Annotation Analysis Processes

Create Web Link This example uses the annotation data set my_hg_u95av2_annot.sas7bdat, generated in the Annotation Data Set Creation example, to create a web link report.

Select Genomics > Annotation Analysis > Create Web Link. The Create Web Link dialog opens, as shown in Figure 11.12.


Figure 11.12: The General (left) and Options (right) tabs of the Create Web Link dialog


To select the input data set, complete the following steps.

Click Choose to select the input file.


Select the my_HG_U95Av2_annot.sas7bdat file.



The column names from the input data set are listed in the Available Variables box. To specify the individual parameter variables for the analysis, complete the following steps.


Click to add Probe_Set_ID to the Probe Id box, as shown in Figure 11.13.

Figure 11.13: Specifying the Probe Id variable


Click to add Accession to the GenBank Accession box, as shown in Figure 11.14.

Figure 11.14: Specifying the GenBank Accession variable

Select Unigene_ID from the list of available variables.

Click to add Unigene_ID to the Unigene_Id box.

Select Gene_Title from the list of available variables.

Click to add Gene_Title to the Description box.

Select Entrez_Gene from the list of available variables.

Click to add Entrez_Gene to the Gene_Id box.

Select Chromosome_Location from the list of available variables.

Click to add Chromosome_Location to the Chromosome Location box.

Select Ensembl from the list of available variables.

Click to add Ensembl to the Ensembl Id box.

Select SwissProt from the list of available variables.

Click to add SwissProt to the Swiss-Prot Id box.

Select EC from the list of available variables.

Click to add EC to the Enzyme Id (EC number) box.

Select OMIM from the list of available variables.

Click to add OMIM to the OMIM Id box.


Select both RefSeq_Protein_ID and RefSeq_Transcript_ID from the list of available variables.

Click to add both RefSeq_Protein_ID and RefSeq_Transcript_ID to the RefSeq Id box.

Select Gene_Ontology_Biological_Process, Gene_Ontology_Cellular_Component and

Gene_Ontology_Molecular_Function from the list of available variables.

Click to add Gene_Ontology_Biological_Process, Gene_Ontology_Cellular_Component and Gene_Ontology_Molecular_Function to the GO Id box.

Leave the Gene Symbol and dbSNP Id boxes blank.

Because the U95Av2 array contains human genome sequences,

Select Homo sapiens from the Organism pull-down menu.

To specify the U95Av2 array,

Select HG_U95Av2(Human_Genome_U95Av2_Array) in the Affymetrix GeneChip Array box. To select the output folder, complete the following steps.




The completed General tab of the Create Web Link dialog appears as shown in Figure 11.15.


Figure 11.15: The completed General tab of the Create Web Link dialog


The Options tab allows you to specify delimiters used in the annotation data set. The Entry and Entry ID Delimiters used by Affymetrix are entered by default. The name of the output file is optional. If left blank, JMP Genomics assigns a default name to the output file.


The options for generating links to the various databases are initially, by default, disabled (grayed-out). These options are enabled when their dependent, corresponding variables are specified on the General tab. Checkboxes for enabled options are selected by default. Note: Specifying a single variable on the General tab might enable more than one link option. For example, specifying the Gene Id variable enables both the Entrez Gene Link and the KEGG Gene Database Link. The completed Options tab of the Create Web Link dialog appears as shown in Figure 11.16.

Figure 11.16: The completed Options tab of the Create Web Link dialog


Make no changes to the Options tab.

Click Run to generate a .html file containing the web links.

Figure 11.17: A portion of the .html file listing the web links for the data contained in the annotation

file

Click on the links to explore the information available for each of the genes.

IPA Upload This process creates either an .xls or .txt export file that can be uploaded to the Ingenuity Pathway Analysis system for contextual analysis of expression and/or functional data for a suite of genes under specific experimental conditions. Up to ten different experimental comparisons can be made for each analysis. This example uses a normalized expression data set for the 100 genes in the Affymetrix Latin Square Example discussed in Chapter 1, under two experimental conditions. Two different expression statistics are used for the comparison: the simple difference in expression for each of the genes between the two conditions and the p-values for those differences.

Select Genomics > Annotation Analysis > IPA Upload. The IPA Upload dialog opens, as shown in Figure 11.18.


Figure 11.18: The IPA Upload dialog

To load the parameters for the Affymetrix Latin Square example, complete the following steps.

Click Load.

Select AffymetrixLatinSquareExample.

Click OK.

The completed IPA Upload dialog appears as shown in Figure 11.19.


Figure 11.19: The completed General tab

Examine the dialog. The affylatin_norm_amr.sas7bdat file, included in the Sample Data folder, has been selected as the input data set.

Click Open to examine this file.

The column labels in the input data set are listed in the Available Variables box of the dialog. The AffyID column, which contains the probe set IDs, is selected as the gene identifier.

Scroll down the list of available variables.

Variables beginning with the letter d represent differences in expression, between experiments, for individual genes. For example, the values in column da_b_, represent differences in gene expression between experiments a and b. Variables beginning with the letter p represent the –log10 p-values of those differences. For example, the values in column pa_b_, represent the –log10 p-values of the differences in gene expression between experiments a and b. When selected, variables containing –log10 p-values must always be listed in the Negative Log10P-Value Variables box. The Ingenuity Pathway Analysis system requires p-values rather than –log10 p-values. The variables pa_b_ and da_b_ have been selected as first and second expression values, respectively (Figure 11-20). Note the selection of the type for each expression value matches that described above.


Figure 11.20: Selecting the first (left) and second (right) expression values

Click Run to generate an .html output file (Figure 11-21) that can be uploaded to Ingenuity.

Figure 11.21: The output .html file

Click Upload to IPA to upload the file to Ingenuity.

Note: You must have either an Ingenuity Pathway Analysis System license or trial package to run the analysis.


KEGG Pathway Search The KEGG Pathway Search function allows users to identify the molecular interaction, reaction networks and functions that are relevant to genes of interest. It searches the KEGG Pathway database by Entrez Gene Id, GenBank Accession, NCBI Protein GI Number, UniGene Cluster Id, UniProt Id, and OMIM Id. Finally, it generates a report listing the search results and links. Note: This process might take a long time to run, depending on internet traffic, the number of genes specified, and the number of pathways found. This example illustrates the search process using two human genes from the Affymetrix Latin Square example that show significant expression differences. The first of these genes, LocusLink ID #5787, encodes protein tyrosine phosphatase, receptor type B. The second, LocusLink ID #5602, encodes mitogen-activated protein kinase 10. The data for these and other significant genes are listed in the u95a_significant_differences.sas7bdat file included in the Sample Data folder.

Select Genomics > Annotation Analysis > KEGG Pathway Search.

The KEGG Pathway Search dialog opens, as shown in Figure 11.22.

Figure 11.22: The KEGG Pathway Search dialog

Type the LocusLink gene numbers 5787 and 5602 into the Gene/Protein Ids box, as shown in

Figure 11.23.

Figure 11.23: The Gene/Protein Ids box

The gene identifiers can be entered on one or more lines. If more than one gene is entered on a line, the identifiers must be separated by a space.


Note: The same gene can have different identifiers, depending on the species. For example, the Gene Id for the human gene A1BG, which encodes the alpha-1-B glycoprotein, in Human is 1, whereas the Gene Ids for the mouse and rat homologs are 117586 and 140656, respectively. JMP Genomics supports the use of identifiers from different species. Supported gene/protein identifiers include Entrez Gene ID, GenBank Accession, NCBI Protein GI Number, UniGene Cluster ID, UniProt ID, and OMIM ID. All of the gene/protein identifiers in an analysis must be of the same type (GenBank Accession numbers, for example). To indicate the identifier type, complete the following steps:

Click the downward arrow in the Type of Gene/Protein Ids box, as shown in Figure 11.24.

Figure 11.24: Selecting the identifier type

Select Entrez Gene Id from the drop-down menu.

The selected type appears as shown in Figure 11.23. If your computer accesses the internet through a proxy server, say so in the dialog.

Click Yes if you use a proxy server to access the internet. Specify the name of your proxy server before running either the KEGG Pathway Search process or the KEGG Pathway Color process on your computer for the first time. See Chapter 12 for further instructions on specifying the proxy server. To select the output folder, complete the following steps.




The completed KEGG Pathway Search dialog appears as shown in Figure 11.25.


Figure 11.25: The completed KEGG Pathway Search dialog

Click Run to generate an .html report and two SAS data sets.

The report, shown in Figure 11.26, lists and provides links to information on all the metabolic/regulatory pathways involving each of the subject genes and their products and to information on other genes in those pathways.

Figure 11.26: A portion of the KEGG Pathway Search report

The output SAS data sets are listed in a SAS Message window, shown in Figure 11.27.


Figure 11.27: The SAS Message window

Click Open to examine each of the files.

The first column in the keggpathwaysearch.sas7bdat file (Figure 11.28) lists all the genes involved in the relevant pathways. Subsequent columns that identify each of the pathways relevant to the input genes are listed on the first column. A “1” indicates the gene participates in the pathway, a “0” indicates that it does not.

Figure 11.28: A portion of the keggpathwaysearch.sas7bdat file

The keggpathwaysearch_bypathwayid.sas7bdat file (Figure 11.29) lists the different pathways for each gene and other genes involved in those pathways.

Figure 11.29: The keggpathwaysearch_bypathwayid.sas7bdat file


KEGG Pathway Color The KEGG Pathway Color function allows users to visualize and interpret their statistical results in the context of pathways and biological systems. This process adds color to the gene nodes in the pathway diagrams, if the genes are found in the input data set. The colors are determined according to the values of one or more numeric variables that you specify. A report is generated to display the results and links. Note: This process can take a long time to run, depending on internet traffic, the number of genes specified and the number of pathways found. This example illustrates the KEGG Pathway Color process with the Adherens junction pathway (hsa04520) identified in the example used to illustrate the KEGG Pathway Search process. This example uses the affylatin_norm_amr.sas7bdat file, included in the Sample Data folder, as the input data set.

Select Genomics > Annotation Analysis > KEGG Pathway Color. The KEGG Pathway Color dialog opens, as shown in Figure 11.30.

Figure 11.30: The KEGG Pathway Color dialog

Kegg Pathway IDs can be found using the KEGG Pathway Search process. Single or multiple pathways can be defined. Multiple pathways should either be entered on separate lines or, if entered on one line, be separated by a space. Identifiers of pathways should be species specific. To enter the KEGG Pathway ID for this example,

Type hsa04520 in the IDs of KEGG Pathways to be Colored box. To select the input data set, complete the following steps.


Click Choose to select the input file.


Select the affylatin_norm_amr.sas7bdat file.


The column names of the input data set are listed as available variables. Select specific analysis variables from this list.


Click to add the variable to the Variables Containing Gene IDs box (Figure 11.31).

Figure 11.31: Selecting the variable containing the gene IDs

The pathways can be colored with one or more variable. All genes found in the input data set will be colored red if this is left blank. Any numeric variables are valid for this selection, although they should all be of the same type (-log p-values or lsmeans, for example), because a common color scale is used for all variables.

Select IsmExperiment_a, IsmExperiment_g, IsmExperiment_j, and IsmExperiment_q from the list of available variables.

Click to add these variables to the Variables by Which to Color Pathways box (Figure

11.32).

Figure 11.32: Selecting the variables by which to color the pathways

If your computer accesses the internet through a proxy server, you must indicate that in the dialog.

Click Yes if you use a proxy server to access the internet. Specify the proxy server name before running either the KEGG Pathway Search process or the KEGG Pathway Color process on your computer for the first time. See Chapter 12 for further instructions on specifying the proxy server. To select the output folder, complete the following steps.





The completed General tab of the KEGG Pathway Color dialog appears as shown in Figure 11.33.

Figure 11.33: The completed KEGG Pathway Color dialog

The Options tab allows you to specify how the output is presented.


Type the RGB number #86CDFF in the Low Color RGB box to specify the low end of the spectrum.

Type the RGB number #E3E4DA in the Middle Color RGB box to specify the midpoint of the

spectrum.

Type the RGB number #FFB19Fin the High Color RGB box to specify the high end of the spectrum.

Specify 0 and 100 as the percentiles to use as the lowest and highest color values, respectively.

Do not specify either a title for the pathway results file or a name for the output file.

Click Run to run the KEGG Pathway Color process.

When the process is completed, the generated report (.html file) opens as shown in Figure 11.34.


Figure 11.34: The KEGG Pathway Color Results report

The information in the report includes the name and ID number of the colored pathway and the definition and ID number of the colored gene. The links include the web links to Entrez Gene and to the colored pathway map output files in the ProcessResults folder. List Enrichment The List Enrichment process compares a set of curated lists (of genes, proteins, or metabolites, for example) against a table of significance values and then tests for significant enrichment using Fisher's exact test for association. It generates a report on the results in .rtf or .pdf or .html format. This example illustrates the List Enrichment process using the following data set and files:

• u95a_anov_amr.sas7bdat−This file contains a subset of the results dataset from Affymetrix Latin Square ANOVA analyses example and functions as the significance input data set.

• Example_List_Description_File.TXT−This list description file contains the names of the files containing ID lists to be compared with the Significance Input Data Set. Note: This table must have two columns with first-row headers Name and File. Name provides names that are to appear in the output file, and File contains the file names with extensions of the files containing the list data. Each row of this table references a different list, and Fisher exact tests are computed for each. This file must be comma-separated, tab-delimited, or Excel, with corresponding extensions being one of the following: .csv, .txt, or .xls.

• Interleukin_Receptors.TXT and Protein_Kinases.TXT−This file functions as the list data file.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene

http://www.genome.jp/kegg/pathway.html


Select Genomics > Annotation Analysis > List Enrichment. The List Enrichment dialog opens as shown in Figure 11.35.

Figure 11.35: The List Enrichment dialog

To select the significance input data set, complete the following steps.

Click Choose.


Select the u95a_anov_amr.sas7bdat file.


The column names of the input data set are listed as available variables. Select specific analysis variables from this list. The ID variable must identify entities (genes or proteins, for example) to be compared with the curated lists. The Values of this variable must match values in the lists. Only one variable should be selected.


Click to add Probe_Set_ID to the ID Variable box, as shown in Figure 11.36.


Figure 11.36: Selecting the ID variable

The significance variable must contain the significance values of the ID Variable values. Values in the significance variable are typically −log10 p-values derived from prior analyses.

Select _Log10_p_value_for_Diff_of_Exp4 from the list of available variables.

Click to add _Log10_p_value_for_Diff_of_Exp4 to the Significance Variable box, as shown in Figure 11.37.

Figure 11.37: Selecting the significance variable

A value between 0 and 100, used to determine the significant difference cutoff, must be specified. For this example, a significance cutoff of 10 is specified. Observations with a Significance Variable greater than 10 are considered significant.

Type 10 in the Significance Cutoff [0,100] box. To select the list description file, complete the following steps.

Click Choose.


Select All Files(*.*) from the Files of type drop-down menu.

Select the Example_List_Description_File.TXT file.


To select the folder of list files, complete the following steps.

Click Choose.



The default format for the output file is .rtf.

Do not change the output file type.

Leave the Output File Name field blank.

To select the output folder, complete the following steps.





The completed List Enrichment dialog appears as shown in Figure 11.38.

Figure 11.38: The List Enrichment dialog

Click Run to run the List Enrichment process.

When the process is completed, the generated report (.rtf file) opens as shown in Figure 11.39.

Figure 11.39: The List Enrichment report


All files created by the process are contained in this folder. Output files include the SAS program ListEnrichment_u95a_anov_amr.sas, the result file list_enrichment.rtf, several SAS data sets, and a SAS log.

Troubleshooting

12C H A P T E R

This troubleshooting guide may help in the diagnosis of any problems when running JMP Genomics and their resolution. After checking for solutions, contact JMP Technical Support at [email protected], if the problem persists

Process Problem Suggested Cause/Resolution

Installation of JMP Genomics

The message: “Existing Client Found” is displayed in the Install Shield Wizard window, indicating that a preexisting copy of SAS has been found on a network server.

A pre-existing copy of SAS has been found that is configured to run as a thin Client from a network server. JMP Genomics will only work with a personal copy of SAS loaded on the same Client machine and configured to work locally. Contact JMP Technical Support ([email protected]) for instructions and assistance in resolving this problem.

A SAS log is displayed in your JMP Genomics session along with a message preceded by ERROR.

The generated SAS code might not complete successfully because of mis-specified parameters. Most of the error messages should be self-explanatory and provide some idea about what to do next. If not, examine the broader context provided by the SAS log to determine the problem. If this fails, consult and search the SAS documentation for the SAS code generating the error by clicking Help > SAS Documentation – Local or Help > SAS Documentation – Web. There is also the possibility of a bug in the SAS macro code. If you have found what appears to be a bug, please send the SAS log and explanation to [email protected]. Please describe your procedure in sufficient detail for us to reproduce the problem. If you are a SAS programmer, you might wish to view and even edit the original SAS code in the ProcessLibrary and/or MacroLib folders. Please also feel free to send suggested changes to the code to [email protected].

Any JMP Genomics process that utilizes one or more SAS programs

A WARNING dialog appears, telling you that SAS is connected and a process is already running.

JMP Genomics can only run one process at a time and does not queue jobs. Click OK in the dialog to wait, disregard the Run you just clicked, and let the current process continue running. Click View Log to view the current SAS log to get information on the current process. Click Disconnect SAS to stop the current process. If the SAS process does not stop in a short period of time, it is okay to kill the sas.exe process directly from Windows Task Manager, and then click Disconnect SAS again.

mailto:[email protected]






A process runs longer than expected or produces no output.

In this situation, perform the following steps:

1. Click Run again. A WARNING: SAS is Connected window should appear.

2. Click View Log. If any SAS ERROR messages appear,

click Disconnect SAS and follow the steps in the first box of this guide. If not, proceed to the step below.

3. View the SAS log that is displayed in the JMP Log

window to see the most recently executed code. You can continue to click View Log as many times as you like to check the status of the SAS program. Alternatively, you can monitor generated file activity in the SAS working folder. The location of this folder is specified in your SASV9.CFG file, which is located in <SAS Installation folder>\nls\en\ . The row beginning with –WORK indicates the folder. Open this folder, sort the files by Date Modified, and navigate into the most recent one. You should see various files being generated as the process runs. On Windows, press F5 to refresh the folder while you are monitoring it.

If these steps do not help, try running the process in the SAS 9.1 Display Manager as described below.

Any JMP Genomics process

Output of the process does not automatically open.

The output file name may contain the following characters: (), @, ^ and &, any place of output name, or contains [] at the beginning of the name, (such as [name], for example). If these characters are present, you can open the output by completing the following steps:

1. Navigate to the specified output folder. 2. Double-click on the sasclean.jsl script in the folder.

All of the output should open.

Processes that perform repetitive computations

The SAS log gets truncated.

Processes that specify a lot of variables into one macro

The line length can become too long for SAS batch mode.

In either of these cases, an alternative way to debug the process is to open the .sas file in the SAS 9.1 Display Manager (right-click and select Open with SAS 9.1) and run it from there by pressing F3. The SAS Display Manager provides options for saving or deleting sections of long logs. On Windows operating systems, you can alternatively right-click on a .sas file and select Submit to SAS 9.1. SAS will then run in batch mode and produce .log and .lst files.



Processes using wide data sets composed of long lists of variables

Numerous ERROR messages are generated in the SAS log.

The SAS Macro text expression limit of 65534 bytes might have been exceeded. Workarounds for this situation include the following: 1. Recreate the data set or rename the variables to have the shortest

possible names. 2. Modify the process specification to have list-style input for long

lists of variables, such as Col1-Col20000. 3. Reduce the number of variables using K-Means Clustering, as

follows. Transpose the data to tall form using Transpose Rectangular, run K-Means Clustering to generate a few thousand or less clusters, retain representatives from each cluster to use as the data, and then transpose back to wide form using Transpose Rectangular.

Opening a data file using the File > Open command in any JMP Genomics process

The column names listed in the Available Variables box of a dialog appear different than the original column names in the data set.

SAS employs two ways to name a column: the variable label and the variable name. When a file is opened using the File > Open command from the JMP menu, SAS variable labels will be displayed. These might differ from those displayed in the Available Variables list in the JMP Genomics process dialogs, which display SAS variable names for the available variables. To solve the problem, open the data file using the Open button on the process dialogs. This displays the table with names the same as those in the Available Variables lists. Alternatively, use the File > Open command from the JMP menu and, in the Open Data File dialog, change File of type to SAS Data Sets and click the Use SAS Variable Names for Column Names checkbox.

Changing the name of a column to a SAS data set in JMP

The new column name is not saved when you save the file as a SAS data set (.sas7bdat) using JMP’s File > Save as command.

In the Save JMP File As window, a Preserve SAS Formats and Variable Names check box becomes available when you select SAS V7 Dataset(*.SAS7BDAT) from the pull down menu. You must uncheck this box to save the new column name.

Agilent Import Engine

Running the process generates a long ERROR message along with a SAS Log and a SAS Message dialog indicating the successful generation of the SAS Data Set, EDDS and an Annotation data set.

The process has run successfully despite the appearance of the ERROR message. The likely cause of the ERROR message is the presence of non-numeric character strings in numerical columns. For example, Agilent places the string #IND, in empty numeric cells to indicate missing values. When SAS imports the data from these files, it reports an error and replaces the character string with a period (.). Open the resulting data sets to verify they are as you intended. If so, you may safely ignore the ERROR message and proceed with the data analysis.



Bioconductor Expresso for Affymetrix Import Engine

An ERROR message is generated when you try to choose an input data set using the Universal/ Uniform Naming Convention (UNC).

The Bioconductor Expresso wrapper does not accept the Universal/Uniform Naming Convention (UNC) for describing the location of a volume, directory or file. The UNC format is (\\directory\subdirectory\file). To avoid using a UNC formatted path, do not begin navigating to the desired files/folders by clicking on the directories shown in the box on the left side of the Open Data File window, as this will format the resulting path in the UNC. Instead, begin navigating by clicking within the Look in: box at the top of the window. The format of the resulting path (C:\Directory\Subdirectory\file) is acceptable to the Bioconductor Expresso process.

Any input engine

An ERROR message is generated when you try to use an EDF generated by the EDF Builder and saved as a text file

JMP's Text Data File default Import setting for the End of Field is set to Tab and Comma and the export settings preference for the End of Field is set to Comma. If the EDF is saved as a .txt file and the fields end with commas instead of tabs, the format of the EDF is not recognized by the input engines. JMP Genomics’ default Import and Export should both be set to Tab. To change the preference, select either File > Set Genomics Preference or File > Preferences. Select Text Data Files from the list on the left side of the JMP: Preferences Settings dialog. Change the End of Field default from Comma to Tab in the Data Export box. (Note: you should recheck the preferences after making this change.)Rebuild the EDF. The JMP Genomics installation instructions describe additional preferences that should be changed.

A SAS log is displayed in your JMP Genomics session along with a message preceded by ERROR or there are notes in the SAS log indicating Invalid data for particular variables.

When importing a file to a SAS data set, SAS determines the type of variable (character or numeric) based on the first N observations, where N is the value provided in the Number of Rows to Scan parameter on the Options tab of most of the Import processes. Sometimes, when a character value is present after the first N observations and the previous observations have all been numeric (so that the variable has already been defined as numeric), an error occurs when SAS attempts to read this character value. Try increasing the value for N in the Options tab until you no longer see these notes in the log.

Any Import process

The values in one or more columns are truncated.

When importing a file to a SAS data set, SAS determines the length of variable (character or numeric) based on the first N observations, where N is the value provided in the Number of Rows to Scan parameter on the Options tab of most of the Import processes. Sometimes, when subsequent values are longer than those in the first N observations, SAS will truncate those values to the length determined for the N observations. Try increasing the value for N in the Options tab.



Hierarchical Clustering

Heat map/dendrogram containing sample information is not correctly displayed when saved to a journal.

You have saved the heat map to a journal and closed JMP Genomics. When you open the journal, the sample information heat map displayed to the right of the main heat map does not display normal colors. The sample information has been saved to the output table. To see this information displayed correctly, make sure the data table is open before opening the journal.

An ERROR message is generated stating: You selected to use proxy server to access web, but did not specify proxy server name or port number. Please run Configure Proxy Settings to set the value.

The Proxy Server or Proxy Port number could have been incorrectly specified. Select File > Configure Proxy Settings or Genomics > Annotation Analysis > Configure Proxy Settings. Click and follow the instructions to identify your Proxy Server and Proxy Port Number. Make sure the correct name and number are entered in the dialog and click Run to configure your settings.

KEGG Pathway Search and KEGG Pathway Color

ERROR: KEGG throws RemoteException when searching pathways for hsa04520. Please refer to the Java log for further details.

The KEGG API server is either down, very busy, or the connection to the KEGG API server is denied. Retry the process at another time.



Create Web Link, KEGG Pathway Search, and KEGG Pathway Color

An ERROR message is generated stating: ERROR: Could not find class com/sas/genomics/annotation/ErrorMsgGetter at line 10557 column 222. Please ensure that the CLASSPATH is correct.

Check to see if any of following jar files are missing from the <sasroot>\core\sasmisc directory (the default <sasroot> is C:\Program Files\SAS\SAS 9.1\):

axis.jar axis-ant.jar axis-schema.jar commons-discovery.jar commons-logging.jar jaxrpc.jar keggapi.jar log4j-1.2.8.jar log4j.properties saaj.jar wsdl4j.jar sas.genomics.annotation.jar

If these files are missing, reinstall JMP Genomics. The install copies these jar files to: C:\Program Files\SAS\SAS 9.1\core\sasmisc\.

Create Web Link, KEGG Pathway Search, and KEGG Pathway Color

An ERROR message is generated stating: ERROR: Failed to find genomics.config file.

Check to see if the genomics.config file is missing from the <sasroot>\sds\sasmisc directory (the default <sasroot> is C:\Program Files\SAS\SAS 9.1\): If the config file is missing, reinstall your JMP Genomics. The install copies this configuration file to: C:\Program Files\SAS\SAS 9.1\sds\sasmisc\.

KEGG Pathway Color

A black KEGG pathway map results when you click and open a pathway map-link in your KEGG Color Process result.

Upgrade the SAS private JRE1.4.1 to SAS private JRE 1.4.2._09 (or JRE 1.4.2 and up) as follow.

1. Install the recommended Java JRE. 2. After installing the JRE, verify that it has been installed at the

default destination (in C:\Program Files\Java\ j2re1.4.2_09 , for example)

3. Update the SASV9.CFG file. The typical location for this file

is: C:\Program Files\SAS\SAS 9.1\nls\en\SASV9.cfg

4. Use a text editor to change the line

-Dsas.jre.home=C:\PROGRA~1\SAS\ SHARED~1\JRE\14267D~1.1

to -Dsas.jre.home=C:\PROGRA~1\Java\ j2re1.4.2_09.

5. Save the file.



Partial Least Squares

An ERROR message is generated stating: Error: The model contains more than 32767 effects.

The message is generated whenever the data set contains more than 32,767 columns due to inherent limitations in SAS PROC PLS. Use Predictor Reduction or some other means to get the number of + predictors below the upper bound.

Partial Least Squares Normalization

An ERROR message is generated stating: ERROR: PLS Normalization can be performed on a maximum of 32767 rows, and your data set has XXX.∗ You may wish to summarize, cluster, or subset your data.

The message is generated whenever the data set contains more than 32,767 columns due to inherent limitations in SAS PROC PLS. Use Predictor Reduction or some other means to get the number of + predictors below the upper bound.

Workflow

Attempts to run a second AP or new Workflow fails. The JMP Log shows the following message: A second script is attempting to execute, possibly during a nested click event. It may be necessary to press Escape to terminate the previous script.

Press ESC to exit the JMP script.

∗ XXX represents some number greater than 32767.

References

Abecasis, G.R., W.O.C. Cookson, and L.R. Cardon. (2000). Pedigree tests of transmission disequilibrium.

European Journal of Human Genetics 8: 545-551. Allison, D.B. (1997). Transmission-disequilibrium tests for quantitative traits. American Journal of Human

Genetics 66: 279-292. Allison, D.B., M. Heo, et al. (1999) Sibling based tests of linkage and association for quantitative traits.

American Journal of Human Genetics 64: 1754-1764. Benjamini, Y. and Hochberg, Y. (1995). Controlling the False Discovery Rate: A practical and powerful

approach to multiple testing. Journal of the Royal Statistical Society, Series B 57: 289 - 300. Blangero, J., J.T. Williams and L. Almasy. (2001). Variance component methods for detecting complex trait

loci. in Genetic Dissection of Complex Traits, ed. D.C. Rao and M.A. Province, San Diego, CA: Academic Press, 151-181.

Carlson, C.C., M.A. Eberle, et al. (2004). Selecting a maximally informative set of single-nucleotide

polymorphisms for association analyses using linkage disequilibrium. American Journal of Human Genetics 74: 106-120.

Chu, T.-M., B. Weir, et al. (2002). A systematic statistical linear modeling approach to oligonucleotide array

experiments. Mathematical Biosciences 176: 35-51. Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55: 997 – 1004 Dobbin, K. and R. Simon. (2002). Comparison of microarray designs for class comparison and class discovery.

Bioinformatics 8(11): 1438-1445. Dudoit, S., Y. H. Yang, et al. (2002). Statistical methods for identifying genes with differential expression in

replicate cDNA microarray experiments. Statistica Sinica 12: 111-140 Elston, R.C. and H.J. Cordell. (2001). Overview of model-free methods for linkage analysis. in Genetic

Dissection of Complex Traits, ed. D.C. Rao and M.A. Province, San Diego, CA: Academic Press, 135-150.

Haseman, J.K. and R.C. Elston. (1972). The investigation of linkage between a quantitative trait and a marker

locus. Behavior Genetics 2: 3-19. Hsieh, W. P., T.-M. Chu, et al. (2003). Who are those strangers in the Latin Square? in Methods of Microarray

Data Analysis III. K. E. Johnson and S. M. Lin. Boston/New York/Dordrecht/London, Kluwer Academic Publishers: 247 pp.

Jin, W., R. M. Riley, et al. (2001). The contributions of sex, genotype and age to transcriptional variance in

Drosophila melanogaster. Nature Genetics 29: 389-395. Kerr, M. K. and G. A. Churchill. (2001). Experimental design for gene expression microarrays. Biostatistics 2:

183-201.

References 308

Merchant, M. and S. R. Weinberger. (2000). Recent advancements in surface-enhanced laser

desorption/ionization-time of flight-mass spectrometry. Electrophoresis 21: 1164-1177. Monks, S.A. and N.L. Kaplan. (2000). Removing the sampling restrictions from family-based tests of

association for a quantitative-trait locus. American Journal of Human Genetics 66: 576-592. Price, A.L., N.J. Patterson, et al. (2006). Principal components analysis corrects for stratification in genome-

wide association studies. Nature Genetics 38: 904-909. Qu, Y., B.-L. Adam, et al. (2002). Boosted decision tree analysis of surface-enhanced laser

desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clinical Chemistry 48: 1835-1843.

Redon, R., et al. (2006) Global variation in copy number in the human genome. Nature 444: 444-454. Tuzun, E., A.J. Sharp, et al. (2005) Fine-scale structural variation of the human genome. Nature Genetics 37:

727–732. Wang, T. and R.C. Elston. (2004). A modified revisited Haseman-Elston method to further improve power.

Human Heredity 57: 109-116. Whittemore, A.S. and I-P. Tu. (1998). Simple, robust linkage tests for affected sibs. American Journal of

Human Genetics 62: 1228-1242. Wiggington, J.E., D.J. Cutler, and G.R. Abecasis, (2005) A note on exact tests of Hardy-Weinberg

equilibrium. Amer. J. of Hum. Gen. 76: 887-893.

Varambally, S., J. Yu, et al. (2005) Integrative genomic and proteomic analysis of prostate cancer reveals signatures of metastatic progression. Cancer Cell 8: 393-406.

Zaykin, D.V., P.H. Westfall, et al. (2002). Testing association of statistically inferred haplotypes with discrete

and continuous traits in samples of unrelated individuals. Human Heredity 53: 79-91.

Appendix

Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes.

JMP Genomics AP SAS PROCs Called Up Experimental Design

Experimental Design Data Set Builder TRANSPOSE, EXPORT Experimental Design File Builder none∗

Create Array Index No SAS called; JSL only Create ColumnName No SAS called; JSL only Create Row Index No SAS called; JSL only Check File Names No SAS called; JSL only Import Tutorials No SAS called; JSL only

Import Affymetrix

Expression CHP Wizard The wizard generates a workflow of import, quality control, and ANOVA APs. See APs in the workflow for specific PROCs

Download NetAffx Files No SAS called; JSL only ARR File Parser No SAS called; JSL only Expression CEL DATASETS, REGISTRY, IMPORT, SORT Expression CHP DATASETS, REGISTRY, IMPORT, SORT SNP CEL DATASETS, REGISTRY, IMPORT, SORT SNP Chip DATASETS, REGISTRY, IMPORT, SORT CNAT IMPORT, SORT Export to CHP Format none

Illumina Expression IMPORT, DATASETS, SORT SNP IMPORT, SORT, TRANSPOSE Copy Number SORT, IMPORT, TRANSPOSE, DATASETS, CONTENTS,

Other Expression Agilent DATASETS, REGISTRY, IMPORT, SORT ArrayTrack DATASETS, REGISTRY, IMPORT, SORT Bioconductor Expresso for Affymetrix none GenePix DATASETS, REGISTRY, IMPORT, SORT QuantArray DATASETS, REGISTRY, IMPORT, SORT ScanAlyze DATASETS, REGISTRY, IMPORT, SORT

Other Genetics Arlequin SORT HapMap IMPORT, TRANSPOSE NEXUS DATASETS, REGISTRY, IMPORT, SORT Pedigree IMPORT, SORT

Proteomics ABI Analyst DATASETS, REGISTRY, IMPORT, SORT

Text Import Individual Text, CSV, or Excel Files DATASETS, REGISTRY, IMPORT, SORT Import a Designed Experiment from Text, CSV, or Excel Files

DATASETS, REGISTRY, IMPORT, SORT

JMP Genomics Import Tutorials No SAS called; JSL only

∗ none – indicates that while the process calls SAS and uses SAS data step and macro code, no SAS PROCs are used.

310

Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued)

JMP Genomics AP SAS PROCs Called Up Data Set Utilities

Column Contents CONTENTS Change Labels none Change Lengths none Rename none Reorder none Append APPEND Merge SORT Transpose Tall and Wide MEANS, SORT, TRANSPOSE Transpose Rectangular SORT, TRANSPOSE Unstack none Data Step none Merge and Transform none Rank Rows RANK Sort Rows SORT Statistics for Columns SORT, SUMMARY Statistics for Rows none Transform none Export EXPORT

Genetics Data Set Utilities Check Data Contents CONTENTS, PRINT, Subset/Reorder Genetics Data none Recode Genotypes ALLELE, SORT, TRANSPOSE

Genetic Marker Statistics Phenotype Summary SORT, FREQ Marker Properties ALLELE, SORT, TRANSPOSE Linkage Disequilibrium ALLELE, SORT, SUMMARY, PRINT LD tagSNP Selection ALLELE, SORT, IML Malecot LD Map SORT, PRINT, DATASETS, NLMIXED, APPEND

Association Testing Case-Control Association CASECONTROL, PSMOOTH, SORT, PRINT

PCA for Population Stratification STDIZE, DATASETS, SORT, IML, CORR, TRANSPOSE, APPEND, PRINCOMP

Marker-Trait Association ALLELE, LOGISTIC, GLMMIX, PHREG, SORT, PRINT

SNP-Trait Association MIXED, PHREG, LOGISTIC, TRANSPOSE, SORT, ALLELE, DATASETS

Quantitative TDT ALLELE, FAMILY, PSMOOTH, MIXED, GLM, UNIVARIATE, MEANS, SORT, PRINT, IML

TDT FAMILY, PSMOOTH, SRT, PRINT

SNP Interaction Selection (Experimental) SORT, MEANS, TRANSPOSE, FREQ, CONTENTS, APPEND, STDIZE, FASTCLUS, GENESELECT, DATASETS, TTEST

Model-free Linkage

Affected Sib-Pair Tests none Haseman-Elston Regression SORT, MIXED, PSMOOTH Variance Components SORT, MIXED, UNIVARIATE, IML, PRINT, PSMOOTH

Haplotype Analysis Haplotype Estimation HAPLOTYPE, PSMOOTH, SORT

Haplotype Trend Regression HAPLOTYPE, LOGISTIC, REG, PHREG, SORT, PRINT, TRANSPOSE

htSNP Selection HTSNP, PRINT, SORT

Copy Number Distribution Analysis KDE Data Standardize STDIZE Correlation and Principal Components CORR, FACTOR, PRINCOMP Bin MEANS One-Way ANOVA none Bivariate One-Way ANOVA SORT, CONTENTS,

311

Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued) JMP Genomics AP SAS PROCs Called Up

Spectral Preprocessing 2D Bin MEANS 2D Detrend TRANSPOSE 2D Peakfind IML, SORT, TRANSPOSE 2D Plot TRANSPOSE 3D Align KDE 3D Plot none

Quality Control Distribution Analysis KDE Correlation and Principal Components CORR, FACTOR, PRINCOMP Correlation and Grouped Scatterplots none Filter Intensitiies UNIVARIATE, MEANS, Feature Flagger SQL Effect Removal via PLS Normalization No SAS called; JSL only Missing Value Imputation DATA STEP Pseudo Image MEANS, UNIVARIATE Surface Summary KDE, UNIVARIATE, MEANS, SORT, FORMAT, G3D

Normalization ANOVA Normalization MIXED Data Standardize STDIZE Factor Analysis Normalization FACTOR Loess Normalization LOESS, MEANS, SORT, DATASETS, APPEND Mixed Model Normalization MIXED, MEANS, SORT Partial Least Squares Normalization PLS, TRANSPOSE Quantile Normalization MEANS, SORT Ratio Analysis LOESS, MEANS, SORT, CONTENTS

Pattern Discovery Hierarchical Clustering TRANSPOSE K-Means Clustering FASTCLUS Principal Components Analysis PLS Distance Matrix DISTANCE, SORT Multidimensional Scaling MDS, SORT

Row-by-Row Modeling One-Way ANOVA none ANOVA MIXED

Mixed Model Analysis MIXED, MULTTEST, MEANS, STDIZE, DATASETS, CONTENTS, SORT, TRANSPOSE, PRINT

Estimate Builder/Compare Means MIXED, PRINT Two-Way Plotter DATASETS, SORT, TRANSPOSE, GPLOT, GCHART, GREPLAY P-Value Adjustment MULTTEST P-Value Quantile Plotter No SAS called; JSL only

Predictive Modeling Recode Genotypes Transpose Tall and Wide MEANS, SORT, TRANSPOSE Discriminant Analysis DISCRIM, TRANSPOSE Distance Scoring None General Linear Model Selection GLMSELECT K Nearest Neighbors DISCRIM Logistic Regression LOGISTIC Partial Least Squares PLS, GLMMOD, TRANSPOSE Partition Trees TRANSPOSE Radial Basis Machine GLIMMIX

Binary Response Effect Selection (Experimental) SORT, MEANS, TRANSPOSE, FREQ, CONTENTS, APPEND, STDIZE, FASTCLUS, GENESELECT, DATASETS, TTEST

Cross Validation Model Selection No SAS called; JSL only Test Set Model Comparison No SAS called; JSL only

312

Table A.1: SAS Procedures Called by JMP Genomics Analytical Processes (continued) JMP Genomics AP SAS PROCs Called Up

Annotation Analysis Create 0-1 Indicator for Select Rows No SAS called; JSL only Venn Diagram No SAS called; JSL only Create Web Link SQL, EXPORT IPA Upload SQL, EXPORT KEGG Pathway Search SORT, MEANS, UNIVARIATE, TRANSPOSE KEGG Pathway Color SORT, EXPORT, TRANSPOSE UCSC Genome Browser Link MEANS,SORT Affymetrix

Integrated Genome Browser MEANS, SORT Download NetAffx Files No SAS called; JSL only

Column Enrichment GLMMOD, TRANSPOSE, SORT, MEANS, MULTTEST List Enrichment none Configure Proxy Settings none

Power and Sample Size Mixed Model Power MIXED SNP Power IML, SORT, PRINT

Workflow Builder Clear Parameter Defaults No SAS called; JSL only Generate Dialogs from XML none

jmp genomics

Documents