first al-khawarezmi conference: qatar, december 6-8, 2010 ali hadi 0 0 the effects of centering and...
TRANSCRIPT
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
1
1
The Effects of Centering and Scaling the Rows of Multidimensional Data on
Their Graphical and Correlation Structures
Ali S. Hadi and Rida Moustafa
[email protected]@cornell.edu
www.aucegypt.edu/faculty/hadi
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
2
2
Outline of the Talk
1. Introduction
2. Types of Centering and/or Scaling
3. Effects of Centering and/or Scaling
4. The Main Theoretical Results
5. Illustrative Examples
6. Conclusions
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
3
3
1. Introduction
Before performing certain statistical
analysis methods (e.g., principal
components and factor analyses), it may
be necessary to preprocess the data to
make them suitable for the analysis.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
4
4
1. Introduction
Examples:
•Data Editing
• Imputation of missing values
• Transformation
• Identification of outliers
• Centering and/or Scaling (e.g., Rao,
2005)
• etc.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
5
5
1. Introduction
Given an data matrix X, which
represents n multivariate observations on
p variables, the columns and/or the rows
of X may be centered and/or scaled before
applying a statistical method to the data
matrix X.
pn
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
6
6
2. Type of Centering and/or Scaling
1. Column (Variable) Centering:
The ij-th element of can be written as:
where is the mean of the i-th row.
Hence we have:
X11IX )( 1 Tnnn
c ncX
,. jij xx
pcT
n 0X1 jx.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
7
7
2. Type of Centering and/or Scaling
2. Row (observations) Centering:
The ij-th element of can be written as:
where is the mean of the i-th row.
Hence we have:
)( 1 Tppp
r p 11IXX rX
,.iij xx
.ixnp
r 01X
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
8
8
2. Type of Centering and/or Scaling
3. Column and Row Centering:
The ij-th element of can be written as:
where is the mean of all elements of X.
)()( 11 Tppp
Tnnn
rc pn 11IX11IX
rcX
..,.. xxxx jiij
..x
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
9
9
2. Type of Centering and/or Scaling
4. Row Scaling (each row of X or :
This can be obtained by:
a. Scaling by the L1-norm
b. Scaling by the L2-norm
c. Scaling by the standard deviation (SD)
rX
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
10
10
2. Type of Centering and/or Scaling
4a.Scaling the rows of the matrix X or by
the L1-norm of its rows as follows:
where is diagonal matrix
with its i-th diagonal element equals to the
reciprocal of the L1-norm of the i-th row of X
or .
rrL
rLLL XSXXSX
1111 or
1LS
rX
rX
nn
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
11
11
2. Type of Centering and/or Scaling
4b.Scaling the rows of the matrix X or by
the L2-norm of its rows as follows:
where is diagonal matrix
with its i-th diagonal element equals to the
reciprocal of the L2-norm of the i-th row of X
or .
rrL
rLLL XSXXSX
2222 or
2LS
rX
rX
nn
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
12
12
2. Type of Centering and/or Scaling
4c.Scaling the rows of the matrix X or by
the standard deviation (SD) of its rows:
where is diagonal matrix
with its i-th diagonal element equals to the
reciprocal of the standard deviation of the i-th
row of X.
rSD
rSDSDSD XSXXSX or
SDS
rX
nn
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
13
13
2. Type of Centering and/or Scaling
5. Standardizing the variables (each column
of X:
where is diagonal matrix with
its i-th diagonal element equals to the
reciprocal of the standard deviation of the j-th
column of X.
pccs SXX
pS pp
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
14
14
2. Type of Centering and/or Scaling
6. Centering and standardizing both rows
and columns of X:
This is obtained by an iterative
standardization process of rows and
columns until the rows and columns are
approximately standardized.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
15
15
2. Type of Centering and/or Scaling
The above row and column transformations
have been used by several authors in
practical applications: For example:
Holter et al. (2000) , Wen et al. (2007),
Pielou (1984), Jackson (1991), Pyle (1999),
van der Werf, Jellema, and Hankemeier
(2005), van den Berg et al. (2006).
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
16
16
2. Type of Centering and/or Scaling
We argue that of the 11 methods of
centering/scaling the data matrix mentioned
above only two (the which deal with the
columns of the data) are meaningful. The
other 9, which deal with the rows of the data,
are not generally recommended.
Why? Next slide.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
17
17
3. Effects of Centering and/or Scaling
There are several reasons why centering
and/or scaling the rows is not a good to do:
1. Centering and scaling the observations is
not always possible. For example, when
an observation consists of the same
numerical value on all dimensions,
centering will replace the observation by
zeros and scaling would not be possible.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
18
18
3. Effects of Centering and/or Scaling
2. Even when centering rows is possible, it
creates a perfect collinearity among the
variables even if the original variables are
orthogonal. This is because
for any data matrix X.
npr 01X
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
19
19
3. Effects of Centering and/or Scaling
3. In addition, it alters the correlation
structure among the variables. For
example, two positively correlated
variables will turn into two perfectly
negatively correlated variables after row-
centering. This is because the two
variables in the row-centered data are
. hence , 2121rr
nrr XX0XX
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
20
20
3. Effects of Centering and/or Scaling
Two positively correlated variables turn into
two negatively correlated variables
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
21
21
3. Effects of Centering and/or Scaling
4. Centering and/or scaling observations are
not statistically meaningful because We
cannot attach a unit of measurement to
the mean or the standard deviation of the
observations because we may be adding
variables that have very different units of
measurements.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
22
22
3. Effects of Centering and/or Scaling
5. After row scaling, the observations on the
same variable would have different units
of measurements. Thus, the observations
on the same variable will have different
origin and/or different scale.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
23
23
3. Effects of Centering and/or Scaling
55
2020
3535
1020
545
575rXX
2/12/1
2/12/1
2/12/1rsX
Example:
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
24
24
3. Effects of Centering and/or Scaling
6. Finally, perhaps the most damaging effect
of centering and/or scaling the
observations is that they distort the
graphical structure of the observations in
the multidimensional space and
substantially alters the correlation
structure among the variables.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
25
25
Outline of the Talk
1. Introduction
2. Types of Centering and/or Scaling
3. Effects of Centering and/or Scaling
4. The Main Theoretical Results
5. Illustrative Examples
6. Conclusions
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
26
26
4. The Main Theoretical Results
Theorem 1. Centering the rows of X:
For simplicity of notation, let Y be the matrix
obtained by centering the rows of X or .
Then the columns of Y are linearly
dependent even if X and are of full-
column rank.
cX
cX
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
27
27
4. The Main Theoretical Results
Theorem 2. Scaling the rows of X by the L1-
norm:
For simplicity of notation, let Y be the matrix
obtained by scaling the rows of X or by
the L1-norm. Then the rows of Y lie on the
surface of a parallelogram with sides.p2
rX
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
28
28
4. The Main Theoretical Results
Theorem 3. Scaling the rows of X by the L2-
norm:
For simplicity of notation, let Y be the matrix
obtained by scaling the rows of X or by
the L2-norm. Then the rows of Y lie on the
surface of a sphere in p-dimensional space.
rX
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
29
29
4. The Main Theoretical Results
Theorem 4. Scaling the rows of X using the
standard deviation:
Let be the matrix obtained by
centering and standardizing the rows of .
Then the rows of lie on the surface of a
dimensional ellipsoid, centered at the
origin.
rX
rSD
rSD
XSX
rSD
X
1p
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
30
30
5. Illustrative Examples
Example 1. Two Dimensional Data
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
31
31
5. Illustrative Examples
Example 1. Two Dimensional Data
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
32
32
5. Illustrative Examples
Example 1. Two Dimensional Data
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
33
33
5. Illustrative Examples
Example 2. DNA Microarrays Data
The DNA microarrays dataset consists of
genome-wide expression measurements,
where for each of n = 2467 genes, p=79
measurements have been taken resulting in
a data matrix X of 246 rows and 79
Columns (The National Academy of
Sciences Website www.pnas.org).
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
34
34
5. Illustrative Examples
Example 2. DNA Microarrays Data
The dataset has been analyzed by many
authors. For example,
Schena et al. (1996), Shalon et al. (1996),
Cho et al. (1998), Eisen et al. (1998),
Spellman et al. (1998). The authors scale the
rows of X by the L2-norm.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
35
35
5. Illustrative Examples
Example 2. DNA Microarrays Data p = 3
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
36
36
5. Illustrative Examples
Example 2. DNA Microarrays Data p = 3
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
37
37
5. Illustrative Examples
Example 2. DNA Microarrays Data p = 3
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
38
38
5. Illustrative Examples
Example 2. DNA Microarrays Data p = 79
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
39
39
6. Conclusions
1. Centering and/or scaling the rows of X
distorts the graphical structure of the
observations in the multi-dimensional
space and substantially alters the
correlation structure among the
variables.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
40
40
6. Conclusions
2. Accordingly, analysts who use such row
centering and/or scaling should first
demonstrate that the process results in
a new, more appropriate structure for
their questions.
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
41
41
Outline of the Talk
1. Introduction
2. Types of Centering and/or Scaling
3. Effects of Centering and/or Scaling
4. The Main Theoretical Results
5. Illustrative Examples
6. Conclusions
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi
42
42
Thank You!!