chapter 21math.bu.edu/people/sray/preprints/ray_protein_springer.pdf · 2011. 4. 20. · 337...

337

Chapter 21

Data Processing and Analysis for Protein Microarrays

David S. DeLuca, Ovidiu Marina, Surajit Ray, Guang Lan Zhang, Catherine J. Wu, and Vladimir Brusic

Abstract

Protein microarrays are a high-throughput technology capable of generating large quantities of proteomics data. They can be used for general research or for clinical diagnostics. Bioinformatics and statistical analysis techniques are required for interpretation and reaching biologically relevant conclusions from raw data. We describe essential algorithms for processing protein microarray data, including spot-finding on slide images, Z score, and significance analysis of microarrays (SAM) calculations, as well as the concentration dependent analysis (CDA). We also describe available tools for protein microarray analysis, and provide a template for a step-by-step approach to performing an analysis centered on the CDA method. We conclude with a discussion of fundamental and practical issues and considerations.

Key words: Protein microarray, Concentration dependent analysis, Z score, Differential expression analysis, Bioinformatics

Protein microarray technology offers direct detection and quan-tification of protein expression, the endpoint of both molecular and cellular function in health and disease. The potential of pro-tein microarrays is great in research applications and future clini-cal diagnostics due to their high-throughput with minimal sample requirements (1). Protein microarrays have been used for eluci-dation of protein function and signaling (2), protein–protein interactions (3), detection of bacteria and toxins (4), drug dis-covery (5), and identification of protein biomarkers (6). This technology can be used to measure a range of protein properties including protein–protein interactions, protein–phospholipid interactions, protein kinase substrates, small molecule targeting, and antibody–antigen interactions (7–9). Thus, the potential for

1. Introduction

Catherine J. Wu (ed.), Protein Microarray for Disease Analysis: Methods and Protocols, Methods in Molecular Biology, vol. 723,DOI 10.1007/978-1-61779-043-0_21, © Springer Science+Business Media, LLC 2011

338 DeLuca et al.

protein microarrays covers a wide range of applications in both research and diagnostics.

The large quantity of data collected through protein microar-ray analysis necessitates a series of computational processing steps to properly arrive at biological conclusions (10). Protein microarrays, however, are still in their infancy in comparison to nucleotide microarrays. Thus, a comprehensive analysis software solution has yet to be produced. Software packages for protein microarray data analysis are available (11), but have limitations compared to their more mature DNA microarray analysis coun-terparts; while some tools address protein microarray-specific issues such as protein concentration correction, (12, 13) the comprehensive analysis and statistics packages available for DNA chips (14, 15) are still lacking. Some authors have used DNA microarray analysis methods for protein microarrays (16). However, the lessons of DNA microarray analysis are only a starting point for developing robust protein microarray approaches. Future protein microarray analysis pipelines are likely to combine components specific to protein microarrays with generic microarray analytical methods.

The purpose of this chapter is to provide the reader with the background and framework necessary to develop a successful data analysis strategy for protein microarrays. In line with a com-putational approach, Subheading 2 contains neither reagents nor devices, but rather a series of key algorithms followed by a list of software tools which implement them. This catalog represents the computational toolbox from which the investigator may draw the necessary components for data analysis. It should therefore be noted that it is not necessary to combine all of these tools into one analysis and that it is up to the reader to pick and choose depending on the specific application. Some of these techniques have an origin in DNA microarray analysis, such as spot-finding on slide images, Z-score calculations, and significance analysis of microarrays (SAM). Other techniques have been developed solely for protein microarray analysis, such as the concentration dependent analysis (CDA). To aid the reader in assembling these components together into a complete strategy, Subheading 3 contains a simple yet inclusive step-by-step analysis of protein microarray data. In this example, we walk through a CDA-based methodology for measuring differential antibody expression in leukemia patients before and after immunotherapy. Here, we utilize ProtoArray from Invitrogen, a commercially available high-density protein microarray platform. However, the method can be generalized to other platforms and applications. To address technical issues that arise in following this method, including cross-platform applicability, the chapter concludes with Subheading 4 in which the pitfalls and practical considerations are discussed.

339Data Processing and Analysis for Protein Microarrays

Computational processing of microarray data begins with the acquisition of a digital image representation of the signal inten-sity of the protein spots on the microarray. A software is neces-sary to determine the boundaries of spots in the scanned image, and to convert the pixel values into a file correlating protein identifiers with their numeric signal intensities. The GenePix Pro software (Molecular Devices, Union City, CA) is typically used for this task. The user must first align a grid of circles, rep-resenting spot boundaries, to the scanned image. The position and size of the circles must then be adjusted to ensure proper alignment. GenePix offers the capability to calculate signal intensity by determining the mean or median background in the vicinity of the circle as well as the mean or median signal within the spot. The calculation using the median is very robust and is not typically affected by small boundary misalignments or small artifacts. The program then creates an output file, with the file extension: GAL.

In a typical microarray analysis, the investigator wants to deter-mine which signals are significantly different from the expected values. Calculating the Z score, also called normal score, is a con-venient method used for this task. The Z-score equation is:

,SS

SZ

ms−

=

where Zs is the Z score for the sth spot, Ss is the signal for that spot, m is the mean signal across all spots, and s is the standard deviation across all spots. Thus, the Z score represents the distance of a given spot’s signal from the mean signal in units of standard deviations. When the population of signals has a normal distribu-tion, samples with a Z score of 3 or greater in magnitude are in the 99.7th percentile.

A problem that is unique to protein microarrays is the variety in quantity of spotted material on the chip (13). Higher concentra-tions of spotted proteins produce higher absolute signals. Simply dividing the signal by concentration does not correct for this properly, because this correction favors the signal of spots with low protein concentration. Instead, a Z score using the CDA technique can be calculated (13). First, signals are sorted by their spotted protein concentration. Subsequent calculations are then performed within sliding windows in which the concentrations of spotted proteins are similar. The Z score for the given signal is

2. Materials

2.1. Analytical Tools

2.1.1. Determination of Spot Intensity

2.1.2. Z-Score Analysis

2.1.3. Concentration-Dependent Analysis

340 DeLuca et al.

then calculated using the mean and standard deviation of the signal values within that window:

,S wS

w

SZ

ms−

=

where Zs is the Z score for the sth spot, Ss is the signal for that spot, mw is the mean signal for the spots within the window, and sw is the standard deviation for spots within the window. If the concentrations within a given window vary too greatly (as defined by a reasonable threshold), the window is contracted by exclusion of the values at the edges. Additionally, outliers are defined as spots having signals greater than a defined number of standard deviations away from the mean (the usual default value is 3). Outliers are removed iteratively, with the mean and standard deviation recalculated after outliers are removed, and then any new outliers identified. The iteration stops when the recalculation identifies no further outliers within the window.

A typical protein microarray analysis scenario involves the comparison of protein expression in samples under two different conditions A and B, for example before versus after treatment, healthy versus infected, etc. Working with Z scores, the difference (Zdiff) can simply be calculated as the difference between the Z scores of each protein spot under the two conditions, ZA−ZB, for all given spots. Alternately, differences can be expressed as a percentage, or fold increase. That is, Zmult is calculated as ZA/ZB. In practice, Zdiff tends to call false posi-tives when the values are very large (unpublished exploratory work). Conversely, Zmult tends to over-represent spots where signals are very small. Therefore, it has proven useful to combine these two tests and select values where both Zdiff and Zmult are above a threshold (13). Figure 1 provides an example of a differential antibody expres-sion analysis comparing the immunological profiles of leukemia patients before and after immunotherapy.

SAM, originally developed for nucleic acid microarrays (18), has also been applied to protein microarrays (16). This method addresses the problem of a large number of chance hits that are generated when large datasets are analyzed using standard T test P values. SAM addresses this problem by applying the T test in a gene-specific manner (18).

A difference score, dp for a protein, p, is calculated by:

( ) ( ),A B

pp

x p x pd

s−

=

where ( )Ax p is the average signal for protein, p under conditions A, and ( )Bx p is the average signal for that protein under condi-tions B. In the denominator, sp is the standard deviation specific

2.1.4. Differential Expression Analysis

2.1.5. Differential Expression Analysis with Repeat Experiments


Fig. 1. Differential expression analysis comparing antibody expression in CML patients before and after immunotherapy (17). Differential expression analysis was applied to determine antibody reactivity that was significantly increased after therapy (triangles). Such hits are expected to be found high above the diagonal.

to the protein, p, across all repeated measurements. In essence, this score reflects the differences in expression of each protein relative to the standard deviations of the repeated measurements. This score is then compared to a threshold to determine signifi-cance. In the author’s original study, a threshold of dp >1.2 resulted in a false discovery rate of 18% (18). Increasing the threshold would lower the false discovery rate at the expense of missing some true hits.

The following software tools are implemented versions of the analytical methods described in Subheading 2.1. Proprietary as well as free-for-use software are included. Only software which was available at the time of publication has been included.

GenePix Pro from Molecular Devices, Union City, CA. This soft-ware package is able to process a scanned microarray slide and inter-pret the spots to produce a file of signal intensities – the starting data for microarray analysis. URL: http://www.moleculardevices.com/pages/software/gn_genepix_pro.html.

Prospector from Invitrogen, Carlsbad, CA. This software calculated Z scores for data from their ProtoArray platform. The output format from GenePix Pros (GAL) can be read directly into this program. URL: http://tools.invitrogen.com/content.cfm?pageid=10400.

ProtMAT from Dana-Farber Cancer Institute, Boston, MA. This is an online webtool, which implements the CDA algorithm.

2.2. Software Tools

2.2.1. GenePix Pro

2.2.2. Prospector

2.2.3. ProtMAT

http://www.moleculardevices.com/pages/software/gn_genepix_pro.html

http://www.moleculardevices.com/pages/software/gn_genepix_pro.html

http://tools.invitrogen.com/content.cfm?pageid=10400

http://tools.invitrogen.com/content.cfm?pageid=10400

342 DeLuca et al.

It is capable of reading data files in the Prospector output format. CDA Z-score calculations as well as differential expression analysis can be preformed. URL: http://cvc.dfci.harvard.edu/protmat/.

SAM from Stanford University, Stanford, CA. The authors of the SAM algorithm provide an implementation for download which requires R, Microsoft Excel, and Windows 2000 or XP. The implementation is capable of producing graphs and analyzing time series as well as gene set enrichment. URL: http://www-stat.stanford.edu/~tibs/SAM/.

The following steps represent a simple yet inclusive pipeline for analyzing protein microarray data. This example focuses on differential antibody expression analysis using the Invitrogen ProtoArray platform but can be modified to accommodate other platforms. This analysis centers on the CDA method, but could be supplemented with additional tools or algorithms mentioned in Subheading 2.

1. Scan the microarray slide to produce an image file (see Notes 1–3).

2. Open image file in GenePix software. 3. To localize spots within the scanned image, an array file

(*.GAL) specific to the microarray version must be loaded. This file contains a grid of circles in blocks, which define the borders of the spots (see Notes 1–3).

4. Loading the GAL file provides a grid of circles, which define the spot boundaries. Align these spots by shifting the blocks and rows of circles into position.

5. Once the blocks are aligned, the resulting signals can be exported to a results file (*.GPR).

1. Open the Prospector application from Invitrogen (see also Notes 4 and 5).

2. Select “Immune Response Profiling” as the type of applica-tion for this analysis.

3. Load the GPR file(s) produced in the last step of the previous section (Subheading 3.1, step 5).

4. Clicking the “Show” button will produce the correct data format for the next step and will automatically open the data in Excel.

5. In excel, save the data as a text file (tab-delimited).

2.2.4. Significance Analysis of Microarrays

3. Methods

3.1. Signal Acquisition

3.2. Format Preparation

http://cvc.dfci.harvard.edu/tomcat/protmat/

http://www-stat.stanford.edu/~tibs/SAM/

http://www-stat.stanford.edu/~tibs/SAM/


1. To perform hit calling by the CDA method, open the ProtMAT website in a web browser (see Subheading 2.2.3).

2. Load the text file generated in the last step of Subheading 3.2.

3. On the Settings page, set the threshold to 3. 4. Turn off the log data option. 5. Leave the additional settings to their default values. 6. Review the list of hits. If the list is unmanageably long, or

much too short then click the Calculate tab and adjust the threshold accordingly (see Notes 1, 2, 6–8).

7. Use the Export tab to save the results. You may choose what kind of data to export. Select settings, hits, and statistics (see Notes 3, 6–8).

1. Open the ProtMAT website in a web browser (see Subheading 2.2.3).

2. Load two of the data files in the format produced by the instructions in Subheading 3.2.

3. Continue the analysis by performing steps 3–7 of Subheading 3.3.

When following the steps outlined in Subheading 3, the investi-gator is likely to encounter a series of technical issues. Many of these issues are addressed in the following section. However, in addition to the technical issues, there are important fundamental considerations that have to be taken into account in order to ensure the successful application of protein microarray technology. Therefore, (see Note 1), which relates directly to the steps in (see Notes 6-8), is succeeded by (see Notes 9–14), which addresses the broader challenges facing analysts.

1. High-resolution scanning is required to provide enough pixels per spot for accurate measurement. A pixel size between 10 and 25 mm will produce quality results. The higher resolution 10 mm scan may be desirable for publication figures.

2. Saturation should be assessed by viewing a histogram of pixel intensities. If a large number of pixels have the maximum value, the PMT gain may have to be adjusted.

3. When aligning circle boundaries to the image file in GenePix Pro 5.0, there is an automated block alignment feature which can expand or contract circle size (Fig. 2). Experience has

3.3. Hit Calling

3.4. Comparing Multiple Microarrays

4. Notes

344 DeLuca et al.

shown that it is best not to apply this feature and to adjust spots on a block-by-block or row-by-row basis. If automated mode is preferred, it is recommended to adjust the parameters, such as minimum and maximum spot size. Whether a manual or automated method is applied, all alignments should be visually verified to reduce erroneous signal identification during downstream processing.

Spotted protein concentration can be acquired from the protein microarray manufacturer. Alternatively, it can be determined by the investigator specifically for the lot of arrays used in the investigation.

4. To acquire spot concentrations from the manufacturer, follow their instructions. For the ProtoArray protein microarrays, load the GPR file into the Prospector software provided by the manufacturer. Prospector is capable of automatically down-loading the “Protein Information File” containing spotted protein concentrations. Alternatively, it can be acquired from the Invitrogen website by providing the product bar code.

5. To determine protein concentrations when these are not avail-able, several microarrays from a given printing lot must be selected. Ideally, 1–2 microarrays are selected from the begin-ning, middle, and end of the printing lot. These are probed

Fig. 2. Aligning spot boundary grid to scanned slide using GenePix Pro software using the automated alignment mode without proper manual verification or parameter settings (left ) compared to a manual alignment (right ) that preserves spot boundary size and spacing. (a) Some spots are off-center; the spots on the microarrays are actually quite nicely arrayed so they should be evenly spaced. (b) Some spots are exceedingly small and (c) some spots are exceedingly large; although spot size can vary slightly, the boundaries represented here do not reflect reasonable size variation. (d) Spots with vertical lines were identified by the alignment program as unreliable and are flagged as such when the data are exported.


with an antibody against the tag that is constructed into the synthesized proteins, for example glutathione-S-transferase (GST). Medians of these measured fluorescence values are then determined using steps in Subheading 3.1. These values can then be formatted as needed for further analysis.

6. When performing the CDA analysis, the most important parameter is the threshold. It may be reasonable and neces-sary to increase the threshold value to limit the number of hits if the quantity is unmanageable for further downstream experimentation. If few or no hits are found, however, the threshold should not be reduced below 3 because of the loss in statistical significance.

7. On the ProtMAT website, there is a summary statistics page. Inspect the skewness and kurtosis to assess whether the data points are distributed normally. The skewness and kurtosis (with base distribution as normal) must be of small magni-tude for the data to follow an approximate normal distribu-tion. One could use D’Agostino’s K-squared test for skewness (19) and Anderson–Darling test for kurtosis for testing depar-tures from zero (20). One could also directly perform a test of normality, using a separate Anderson–Darling test or Shapiro–Wilks test (21, 22). If the assumption of normality is rejected by any of these tests, downstream analyses based on the normality assumptions are not appropriate. In such situa-tions, consultation with a statistician to propose appropriate parametric or nonparametric methodology is recommended.

8. When exporting the results, it is also possible to output all the datapoints, as well as the controls values. This will produce large files, and should be done separately from the hits and statistics data.

9. The mathematical and computational tools, algorithms, and methods presented above represent the essential components of a protein microarray analysis process. However, successful application of protein microarray technology requires proper experimental design. Factors such as cell purity, time scale, tissue type can have a larger impact on the experiment than which tools are used in the analysis.

10. Artifacts can be caused by inefficient lysis buffer, inconsistent sample processing, varying optimal conditions for different protein interactions, nonspecific associations, and require-ments for appropriate conformation (23).

11. Beyond the scope of this chapter, more complex modeling is certainly possible. Normalizing has been shown to improve hit-calling in certain cases, although such transformations can

346 DeLuca et al.

also lead to information loss (10). Inclusion of SAM analysis would address signal fluctuations at the protein level (18). Alternatively, a Wilcoxon Rank-Sum could provide a useful nonparametric approach (24). Furthermore, more complex designs require formal statistical approach for analysis devel-opment. This is a recommended way of ensuring that such analysis tools are applied appropriately.

12. There are currently no standardized file formats for repre-senting protein microarray data. Consequently, file conver-sion and data processing are necessary in many cases. For instance, running the CDA analysis tool referenced above requires data in a format that is arrived at by using the same software as we have described in Subheading 3. When this is not the case, such projects require programming ability or inclusion of a bioinformatician.

13. The reproducibility of protein microarray experiments has been determined to be quite high by performing repeat experiments (unpublished observations). While multiple mea-surements are desirable to overcome variability in sampling handing or biological factors, little correction for variability in the array as a platform is necessary.

Acknowledgments

C.J.W. acknowledges support from the Department of Defense (W81XWH-07-1-0080), the Miles and Eleanor Shore Award, NCI (5R21CA115043-2), the Early Career Physician-Scientist Award of the Howard Hughes Medical Institute, and is a Damon-Runyon Clinical Investigator supported (in part) by the Damon-Runyon Cancer Research Foundation (CI-38-07). O.M. acknowledges support from a Medical Student Fellowship of the Howard Hughes Medical Institute.

References

1. Hartmann M, Roeraade J, Stoll D, Templin MF, Joos TO (2009) Protein microarrays for diagnostic assays. Anal Bioanal Chem 393:1407–1416

2. Wolf-Yadlin A, Sevecka M, MacBeath G (2009) Dissecting protein function and sig-naling using protein microarrays. Curr Opin Chem Biol 13:398–405

3. Coleman MA, Beernink PT, Camarero JA, Albala JS (2007) Applications of functional protein microarrays: identifying protein-protein interactions in an array format. Methods Mol Biol 385:121–130

4. Ehricht R, Adelhelm K, Monecke S, Huelseweh B (2009) Application of protein arraytubes to bacteria, toxin, and biological warfare agent detection. Methods Mol Biol 509:85–105

5. Michaud GA, Salcius M, Zhou F, Papov VV, Merkel J, Murtha M, Predki P, Schweitzer B (2006) Applications of protein arrays for small molecule drug discovery and characterization. Biotechnol Genet Eng Rev 22:197–211

6. Kerschgens J, Egener-Kuhn T, Mermod N (2009) Protein-binding microarrays: probing disease markers at the interface of proteomics and genomics. Trends Mol Med 15:352–358


7. Hall DA, Ptacek J, Snyder M (2007) Protein microarray technology. Mech Ageing Dev 128:161–167

8. Lubomirski M, D’Andrea MR, Belkowski SM, Cabrera J, Dixon JM, Amaratunga D (2007) A consolidated approach to analyzing data from high-throughput protein microarrays with an application to immune response pro-filing in humans. J Comput Biol 14:350–359

9. MacBeath G, Schreiber SL (2000) Printing pro-teins as microarrays for high-throughput func-tion determination. Science 289:1760–1763

10. Brusic V, Marina O, Wu CJ, Reinherz EL (2007) Proteome informatics for cancer research: from molecules to clinic. Proteomics 7:976–991

11. Zhu X, Gerstein M, Snyder M (2006) ProCAT: a data analysis approach for protein microarrays. Genome Biol 7:R110

12. White AM, Daly DS, Varnum SM, Anderson KK, Bollinger N, Zangar RC (2006) ProMAT: protein microarray analysis tool. Bioinformatics 22:1278–1279

13. Marina O, Biernacki MA, Brusic V, Wu CJ (2008) A concentration-dependent analysis method for high density protein microarrays. J Proteome Res 7:2059–2068

14. Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA (2003) DAVID: database for annotation, visualization, and integrated discovery. Genome Biol 4:P3

15. Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP (2006) GenePattern 2.0. Nat Genet 38:500–501

16. Hueber W, Kidd BA, Tomooka BH, Lee BJ, Bruce B, Fries JF, Sonderstrup G, Monach P,

Drijfhout JW, van Venrooij WJ, Utz PJ, Genovese MC, Robinson WH (2005) Antigen microarray profiling of autoantibodies in rheumatoid arthritis. Arthritis Rheum 52:2645–2655

17. Biernacki MA, Marina O, Zhang W, Liu F, Bruns I, Cai A, Neuberg D, Canning CM, Alyea EP, Soiffer RJ, Brusic V, Ritz J, Wu CJ (2010) Antigen targets of remission-inducing immune therapy are expressed on CML progenitor cells. Cancer Res 70(3): 906–915

18. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121

19. D’Agostino DB, Belanger A, D’Agostino RB (1990) A suggestion for using powerful and informative tests of normality. Am Stat 44:316–321

20. Anderson TW, Linfeng Y (1996) Adequacy of asymptotic theory for goodness-of-fit criteria for spectral distributions. J Time Ser Anal 17:533–552

21. Shapiro SS (1990) How to test normality and other distributional assumptions, Revth edn. ASQC, Milwaukee, WI

22. Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete sam-ples). Biometrika 52:591–611

23. Schena M (2005) Protein microarrays. Jones and Bartlett, Sudbury, MA

24. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Machine Learn 46:389–422

chapter 21math.bu.edu/people/sray/preprints/ray_protein_springer.pdf · 2011. 4. 20. · 337...

Documents