first al-khawarezmi conference: qatar, december 6-8, 2010 ali hadi 0 0 the effects of centering and...

42
First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 1 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their Graphical and Correlation Structures Ali S. Hadi and Rida Moustafa [email protected] [email protected] www.aucegypt.edu/faculty/hadi

Upload: brandon-fitzgerald

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

1

1

The Effects of Centering and Scaling the Rows of Multidimensional Data on

Their Graphical and Correlation Structures

Ali S. Hadi and Rida Moustafa

[email protected]@cornell.edu

www.aucegypt.edu/faculty/hadi

Page 2: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

2

2

Outline of the Talk

1. Introduction

2. Types of Centering and/or Scaling

3. Effects of Centering and/or Scaling

4. The Main Theoretical Results

5. Illustrative Examples

6. Conclusions

Page 3: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

3

3

1. Introduction

Before performing certain statistical

analysis methods (e.g., principal

components and factor analyses), it may

be necessary to preprocess the data to

make them suitable for the analysis.

Page 4: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

4

4

1. Introduction

Examples:

•Data Editing

• Imputation of missing values

• Transformation

• Identification of outliers

• Centering and/or Scaling (e.g., Rao,

2005)

• etc.

Page 5: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

5

5

1. Introduction

Given an data matrix X, which

represents n multivariate observations on

p variables, the columns and/or the rows

of X may be centered and/or scaled before

applying a statistical method to the data

matrix X.

pn

Page 6: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

6

6

2. Type of Centering and/or Scaling

1. Column (Variable) Centering:

The ij-th element of can be written as:

where is the mean of the i-th row.

Hence we have:

X11IX )( 1 Tnnn

c ncX

,. jij xx

pcT

n 0X1 jx.

Page 7: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

7

7

2. Type of Centering and/or Scaling

2. Row (observations) Centering:

The ij-th element of can be written as:

where is the mean of the i-th row.

Hence we have:

)( 1 Tppp

r p 11IXX rX

,.iij xx

.ixnp

r 01X

Page 8: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

8

8

2. Type of Centering and/or Scaling

3. Column and Row Centering:

The ij-th element of can be written as:

where is the mean of all elements of X.

)()( 11 Tppp

Tnnn

rc pn 11IX11IX

rcX

..,.. xxxx jiij

..x

Page 9: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

9

9

2. Type of Centering and/or Scaling

4. Row Scaling (each row of X or :

This can be obtained by:

a. Scaling by the L1-norm

b. Scaling by the L2-norm

c. Scaling by the standard deviation (SD)

rX

Page 10: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

10

10

2. Type of Centering and/or Scaling

4a.Scaling the rows of the matrix X or by

the L1-norm of its rows as follows:

where is diagonal matrix

with its i-th diagonal element equals to the

reciprocal of the L1-norm of the i-th row of X

or .

rrL

rLLL XSXXSX

1111 or

1LS

rX

rX

nn

Page 11: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

11

11

2. Type of Centering and/or Scaling

4b.Scaling the rows of the matrix X or by

the L2-norm of its rows as follows:

where is diagonal matrix

with its i-th diagonal element equals to the

reciprocal of the L2-norm of the i-th row of X

or .

rrL

rLLL XSXXSX

2222 or

2LS

rX

rX

nn

Page 12: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

12

12

2. Type of Centering and/or Scaling

4c.Scaling the rows of the matrix X or by

the standard deviation (SD) of its rows:

where is diagonal matrix

with its i-th diagonal element equals to the

reciprocal of the standard deviation of the i-th

row of X.

rSD

rSDSDSD XSXXSX or

SDS

rX

nn

Page 13: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

13

13

2. Type of Centering and/or Scaling

5. Standardizing the variables (each column

of X:

where is diagonal matrix with

its i-th diagonal element equals to the

reciprocal of the standard deviation of the j-th

column of X.

pccs SXX

pS pp

Page 14: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

14

14

2. Type of Centering and/or Scaling

6. Centering and standardizing both rows

and columns of X:

This is obtained by an iterative

standardization process of rows and

columns until the rows and columns are

approximately standardized.

Page 15: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

15

15

2. Type of Centering and/or Scaling

The above row and column transformations

have been used by several authors in

practical applications: For example:

Holter et al. (2000) , Wen et al. (2007),

Pielou (1984), Jackson (1991), Pyle (1999),

van der Werf, Jellema, and Hankemeier

(2005), van den Berg et al. (2006).

Page 16: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

16

16

2. Type of Centering and/or Scaling

We argue that of the 11 methods of

centering/scaling the data matrix mentioned

above only two (the which deal with the

columns of the data) are meaningful. The

other 9, which deal with the rows of the data,

are not generally recommended.

Why? Next slide.

Page 17: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

17

17

3. Effects of Centering and/or Scaling

There are several reasons why centering

and/or scaling the rows is not a good to do:

1. Centering and scaling the observations is

not always possible. For example, when

an observation consists of the same

numerical value on all dimensions,

centering will replace the observation by

zeros and scaling would not be possible.

Page 18: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

18

18

3. Effects of Centering and/or Scaling

2. Even when centering rows is possible, it

creates a perfect collinearity among the

variables even if the original variables are

orthogonal. This is because

for any data matrix X.

npr 01X

Page 19: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

19

19

3. Effects of Centering and/or Scaling

3. In addition, it alters the correlation

structure among the variables. For

example, two positively correlated

variables will turn into two perfectly

negatively correlated variables after row-

centering. This is because the two

variables in the row-centered data are

. hence , 2121rr

nrr XX0XX

Page 20: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

20

20

3. Effects of Centering and/or Scaling

Two positively correlated variables turn into

two negatively correlated variables

Page 21: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

21

21

3. Effects of Centering and/or Scaling

4. Centering and/or scaling observations are

not statistically meaningful because We

cannot attach a unit of measurement to

the mean or the standard deviation of the

observations because we may be adding

variables that have very different units of

measurements.

Page 22: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

22

22

3. Effects of Centering and/or Scaling

5. After row scaling, the observations on the

same variable would have different units

of measurements. Thus, the observations

on the same variable will have different

origin and/or different scale.

Page 23: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

23

23

3. Effects of Centering and/or Scaling

55

2020

3535

1020

545

575rXX

2/12/1

2/12/1

2/12/1rsX

Example:

Page 24: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

24

24

3. Effects of Centering and/or Scaling

6. Finally, perhaps the most damaging effect

of centering and/or scaling the

observations is that they distort the

graphical structure of the observations in

the multidimensional space and

substantially alters the correlation

structure among the variables.

Page 25: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

25

25

Outline of the Talk

1. Introduction

2. Types of Centering and/or Scaling

3. Effects of Centering and/or Scaling

4. The Main Theoretical Results

5. Illustrative Examples

6. Conclusions

Page 26: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

26

26

4. The Main Theoretical Results

Theorem 1. Centering the rows of X:

For simplicity of notation, let Y be the matrix

obtained by centering the rows of X or .

Then the columns of Y are linearly

dependent even if X and are of full-

column rank.

cX

cX

Page 27: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

27

27

4. The Main Theoretical Results

Theorem 2. Scaling the rows of X by the L1-

norm:

For simplicity of notation, let Y be the matrix

obtained by scaling the rows of X or by

the L1-norm. Then the rows of Y lie on the

surface of a parallelogram with sides.p2

rX

Page 28: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

28

28

4. The Main Theoretical Results

Theorem 3. Scaling the rows of X by the L2-

norm:

For simplicity of notation, let Y be the matrix

obtained by scaling the rows of X or by

the L2-norm. Then the rows of Y lie on the

surface of a sphere in p-dimensional space.

rX

Page 29: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

29

29

4. The Main Theoretical Results

Theorem 4. Scaling the rows of X using the

standard deviation:

Let be the matrix obtained by

centering and standardizing the rows of .

Then the rows of lie on the surface of a

dimensional ellipsoid, centered at the

origin.

rX

rSD

rSD

XSX

rSD

X

1p

Page 30: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

30

30

5. Illustrative Examples

Example 1. Two Dimensional Data

Page 31: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

31

31

5. Illustrative Examples

Example 1. Two Dimensional Data

Page 32: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

32

32

5. Illustrative Examples

Example 1. Two Dimensional Data

Page 33: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

33

33

5. Illustrative Examples

Example 2. DNA Microarrays Data

The DNA microarrays dataset consists of

genome-wide expression measurements,

where for each of n = 2467 genes, p=79

measurements have been taken resulting in

a data matrix X of 246 rows and 79

Columns (The National Academy of

Sciences Website www.pnas.org).

Page 34: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

34

34

5. Illustrative Examples

Example 2. DNA Microarrays Data

The dataset has been analyzed by many

authors. For example,

Schena et al. (1996), Shalon et al. (1996),

Cho et al. (1998), Eisen et al. (1998),

Spellman et al. (1998). The authors scale the

rows of X by the L2-norm.

Page 35: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

35

35

5. Illustrative Examples

Example 2. DNA Microarrays Data p = 3

Page 36: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

36

36

5. Illustrative Examples

Example 2. DNA Microarrays Data p = 3

Page 37: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

37

37

5. Illustrative Examples

Example 2. DNA Microarrays Data p = 3

Page 38: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

38

38

5. Illustrative Examples

Example 2. DNA Microarrays Data p = 79

Page 39: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

39

39

6. Conclusions

1. Centering and/or scaling the rows of X

distorts the graphical structure of the

observations in the multi-dimensional

space and substantially alters the

correlation structure among the

variables.

Page 40: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

40

40

6. Conclusions

2. Accordingly, analysts who use such row

centering and/or scaling should first

demonstrate that the process results in

a new, more appropriate structure for

their questions.

Page 41: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

41

41

Outline of the Talk

1. Introduction

2. Types of Centering and/or Scaling

3. Effects of Centering and/or Scaling

4. The Main Theoretical Results

5. Illustrative Examples

6. Conclusions

Page 42: First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi 0 0 The Effects of Centering and Scaling the Rows of Multidimensional Data on Their

First Al-Khawarezmi Conference: Qatar, December 6-8, 2010 Ali Hadi

42

42

Thank You!!