the feasibility of correlation specifications for binary variables

6
The feasibility of correlation specifications for binary variables John Gates School of Mathematics, Statistics, and Scientific Computing, Greenwich University, London, UK In attempting to represent a multivariable system by a simpler one with only binary variables it is not always possible to achieve the required correlations. This paper explores the extra constraints on the pairwise correlations for two and three variables. For correlation spectfkations meeting the required inequalities, expressions are given for construction of a suitable probability model. Keywords: pairwise correlations, binary variables 1. Introduction There have been a sequence of papers in this journal and elsewhere’-3 attempting to represent the statistical properties of a multivariable system by a set of simpler variables chosen to have the same first and second (and possibly higher order) statistics. The chosen n variables are usually two-valued and their joint behavior is represented by a probability model on the “cube” 2”. The purpose of this simplification is that it becomes possible to track explicitly these variables through a nonlinear system and determine the statistical properties of the output. A number of examples of this (PEM, Point Estimation Method) technique are worked through in Ref. 4. The purpose of this paper is to draw attention to the extra limitations imposed by using binary variables. There are additional constraints on the pairwise correlations which mean that the statistics of nonbinary variables cannot always be mimicked by binary variables. For example, with three random variables Xi, X,, X3, the correlation matrix (pii) must always be non-negative definite; multiplying such a matrix on left and right by (111) means we must have PI2 + PI3 + P23 2 -1.5 (1) For Gaussian (multivariate normal) variables, any non- negative definite matrix can be a covariance matrix. The purpose of this paper is to show that this simple result is not true for binary variables; for example, the right-hand side of equation (1) is increased to - 1. In published examples of the PEM method, the problem has led to negative probabilities being fitted (e.g. Ref. 1, p. 152). Of course it is possible to achieve the correlations of n variables by a simpler set of variables taking only 2” n-tuples of values. Address reprint requests to Dr. John Gates at the School of Mathematics, Statistics, and Scientific Computing, Greenwich University, Wellington Street, London SE18 6PF, U.K. Received 24 June 1994; revised 21 March 1995; accepted 5 May 1995 Appl. Math. Modelling 1995, Vol. 19, September 0 1995 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010 The n x n covariance matrix can be square-rooted and written (in many ways) as SS’. If the transformation S is applied to the 2” vectors ( f 1, f 1, . . ., f 1) we produce a data set with the required covariance matrix. However, each resulting variable will (usually) take 2” values and so not be binary. Lind (p. 148) shows a similar mapped square for n = 2 with each of the two variables taking four values. A limitation of the square-root map method is that we are assigning probabilities 2~” to each of the original cube vertices so that resulting variables are symmetric and do not allow us to model skewed variables. The majority of this paper explores the situation for two and three variables but indicates the nature of the problem generally. A number of formulas appear later; their primary purpose is not to advocate the PEM method but as tools to identify its limitation. 2. Multivariable binary probability models This section is primarily concerned with conventions and notations. We will be addressing problems where n random binary variables are required to have given means, standard deviations, pairwise correlations, and perhaps skewness values too. By resealing let us suppose the binary variables take values of 0 or 1 only and use the usual notation Pr(X,= 1) =pi= 1 - qi. Published attention has often considered the symmetric case (pi = f) but we may wish to have skewed variables @if;). We can always assume (if necessary by switching to 1 -Xi) that qi 5 $. As I shall mostly be dealing with two or three variables, let us call these A, B (for n =2) or A, B, C (for n =3). For combinations of values of the variables I shall use the notation familiar from experimental design where the convention is to indicate levels by showing only those at the higher level (1). Thus for three variables A, B, C: acdenotesA= l,B=O,C= 1 IdenotesA=O,B=O,C=O 0307-904X/95/$10.00 SSDI 0307-904X(95)00081-T

Upload: john-gates

Post on 21-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The feasibility of correlation specifications for binary variables

The feasibility of correlation specifications for binary variables

John Gates

School of Mathematics, Statistics, and Scientific Computing, Greenwich University, London, UK

In attempting to represent a multivariable system by a simpler one with only binary variables it is not always possible to achieve the required correlations. This paper explores the extra constraints on the pairwise correlations for two and three variables. For correlation spectfkations meeting the required inequalities, expressions are given for construction of a suitable probability model.

Keywords: pairwise correlations, binary variables

1. Introduction

There have been a sequence of papers in this journal and elsewhere’-3 attempting to represent the statistical properties of a multivariable system by a set of simpler variables chosen to have the same first and second (and possibly higher order) statistics. The chosen n variables are usually two-valued and their joint behavior is represented by a probability model on the “cube” 2”. The purpose of this simplification is that it becomes possible to track explicitly these variables through a nonlinear system and determine the statistical properties of the output. A number of examples of this (PEM, Point Estimation Method) technique are worked through in Ref. 4.

The purpose of this paper is to draw attention to the extra limitations imposed by using binary variables. There are additional constraints on the pairwise correlations which mean that the statistics of nonbinary variables cannot always be mimicked by binary variables. For example, with three random variables Xi, X,, X3, the correlation matrix (pii) must always be non-negative definite; multiplying such a matrix on left and right by (111) means we must have

PI2 + PI3 + P23 2 -1.5 (1)

For Gaussian (multivariate normal) variables, any non- negative definite matrix can be a covariance matrix. The purpose of this paper is to show that this simple result is not true for binary variables; for example, the right-hand side of equation (1) is increased to - 1. In published examples of the PEM method, the problem has led to negative probabilities being fitted (e.g. Ref. 1, p. 152). Of course it is possible to achieve the correlations of n variables by a simpler set of variables taking only 2” n-tuples of values.

Address reprint requests to Dr. John Gates at the School of Mathematics, Statistics, and Scientific Computing, Greenwich University, Wellington Street, London SE18 6PF, U.K.

Received 24 June 1994; revised 21 March 1995; accepted 5 May 1995

Appl. Math. Modelling 1995, Vol. 19, September 0 1995 by Elsevier Science Inc. 655 Avenue of the Americas, New York, NY 10010

The n x n covariance matrix can be square-rooted and written (in many ways) as SS’. If the transformation S is applied to the 2” vectors ( f 1, f 1, . . ., f 1) we produce a data set with the required covariance matrix. However, each resulting variable will (usually) take 2” values and so not be binary. Lind (p. 148) shows a similar mapped square for n = 2 with each of the two variables taking four values. A limitation of the square-root map method is that we are assigning probabilities 2~” to each of the original cube vertices so that resulting variables are symmetric and do not allow us to model skewed variables.

The majority of this paper explores the situation for two and three variables but indicates the nature of the problem generally. A number of formulas appear later; their primary purpose is not to advocate the PEM method but as tools to identify its limitation.

2. Multivariable binary probability models

This section is primarily concerned with conventions and notations. We will be addressing problems where n random binary variables are required to have given means, standard deviations, pairwise correlations, and perhaps skewness values too. By resealing let us suppose the binary variables take values of 0 or 1 only and use the usual notation Pr(X,= 1) =pi= 1 - qi. Published attention has often considered the symmetric case (pi = f) but we may wish to have skewed variables @if;). We can always assume (if necessary by switching to 1 -Xi) that qi 5 $. As I shall mostly be dealing with two or three variables, let us call these A, B (for n =2) or A, B, C (for n =3). For combinations of values of the variables I shall use the notation familiar from experimental design where the convention is to indicate levels by showing only those at the higher level (1). Thus for three variables A, B, C:

acdenotesA= l,B=O,C= 1

IdenotesA=O,B=O,C=O

0307-904X/95/$10.00 SSDI 0307-904X(95)00081-T

Page 2: The feasibility of correlation specifications for binary variables

A parenthesis around such symbols will mean probability of, thus I write

Pr(A=l, B=l, C=O) as WI

For n = 2 variables we will have 22 probabilities adding to 1, for n = 3, 23 probabilities adding to 1. We will assume the marginal probabilities pi, qi are specified, thus adding more constraints.

3. Two binary variables

For binary variables A, B with marginal probabilities

(41, PI) ad Ga, ~21, respectively, we have

(4 + @I = qt 5 (0 + @I = q2

together with

(I) + (a) + (b) + (ab) = 1

and the non-negativity of all the probabilities. Thus there is one degree of freedom, and possible joint probabilities lie on a segment. To be definite let us assume q1 5 q2. The covariance is

cov(A, B) = (Piqip2q2)1’2p(A, B) = (a@ - ~1~2

and this lies between - qlq2 andp2q1. Thus the correlation is in the range

(2)

For example ifpi = 0.9, p2 = 0.5, p(A, B) would be between -l/3 and l/3; but ifpi =p2 =0.5, we can have the whole range from -1 to 1.

4. Three binary variable-qua1 marginals

For three variables A, B, C we have three pairwise covariances (and correlations) y12 @i2) between A and B, y13 @is) between A and C, and 723 (~123) between B and C. In this section we shall assume equal marginals, qi=q (or, equivalently, assume A, B, C have the same skewness). To achieve the specified marginals we have the constraints

(ab) = 4 - (0 - (a) - (b)

(UC) = 4 - (0 - (a) - (c)

(bc) = 4 - (0 - (b) - (c) (3)

(ubc) = 1 - 3q + (a) + (b) + (c) + 2(Z)

Together with the non-negativity conditions this means that the space of probability measures is represented as a polyhedron C in 4-dimensional space-any particular measure being specified by (I), (a), (b), (c) only. To each probability measure I_L on 23 we have a covariance vector y = ( -Y,~, yis, y2s) and a correlation vector p =

@i2, p13, PD) where

y12 = (ub) + (ubc) -p2

y13 = (UC) + (ubc) -p2

723 = (bc) + (ubc) -p2 (4)

and pii = y&q)

Correlation specflications for binary variables: J Gates

Table 1. Define six basic probability measures p, - cc,.

0 (4 (4 (4 (a4 (ac) (bc) (abc) Pl ; Q 0 0 0 0 1-2q P2 0 0 0

: 1-2q

P3 0 0” : 0 l-2q

P4

I% ii

;

0 0" 0"

0 0 1-q 0 0 0 1-3q

P.5 0 ;q fq ;q z : z!

1 -$q

Using r and II for permissible covariance and correlation vectors we have an affme map from C to r (or II). The program here is to identify the extreme vertices of E and then of l’- and l-I; this will enable us to see which correlation specifications are feasible.

Case 1: Oiq(1/3

The last four columns of Table 1 are deducible from the previous four by constraints (3), and p5 only exists within the stated q range. For any probability measure ,u on the 23 cube we can write:

If (bc) L (a)

P = f WP, + @h2 + Wru3 + (0~4 + W) - (G.4

whereas if (bc) I (a)

+ 1~14 + W) - @41/d

This shows that C is spanned by pl - ps; thus r will be spanned by y1 - y6. The following table gives the covariances for each basic measure. We see that

and so ,& iS mapped into the interior Of r. The space r of permissible covariances is a bipyramid as

shown in Figure 1. It has a central triangle spanned by yi, y2, and y3 with extreme vertices y4 and y5 on opposite sides of this triangle.

14

y ,

Figure 1.

Y 12

Y?S

k ; 13

Appl. Math. Modelling, 1995, Vol. 19, September 561

Page 3: The feasibility of correlation specifications for binary variables

Correlation specifications for binary variables: J Gates

Table 2. that is with

Covariance

Measure

Pl

P2

P3

114

P5

Ps

Y12

-q2

-q2

q--q2

Y13

-q2

4- q2

-q2

Thus r is defined by six constraints corresponding to the faces of the bipyramid; in terms of correlations these are:

-P12 + P13 + P23 i 1

PI2 - P13 + P23 5 1

PI2 + P13 - P23 5 ’

P12 2 -w-’

P13 2 -9p-l

P23 ? -&

(5)

If a given triple p = (p12, p13, P23) satisfies equation (5), then we can find a probability measure achieving this specification. Since II is made up of two simplexes, our rule for a suitable measure has two cases:

If -39 5 P(P12 + PI3 + P23) 5 1 - 3% use

p = (PP23 + q)k + (PP13 + q)p2 + tip12 + q)p3

+ i1 - 3q -P(P12 + P13 + P23)k (ha)

whereas if 1 - 39 If012 + p13 + p23)s 3p

p =$23 - P12 - P13 + l)p

+;(P13 - P12 - P23 + ‘)pLz

+;@12 - P13 - P23 + ‘)I3

+ ; [p(Pl2 + P13 + P23) + 3q - ‘1p4

(6b)

For example with p = 0.75 and q = 0.25, and requiring pi2 = 0.5, pi3 =O.l and p23 = -0.2 we would use the model

ZJ = f (0.2~1 + O.8/~+2 + 1.6~3) + 0.025~4

Table 3.

(Z) = 0.00625 (a) = 0.01875

(ab) = 0.15 (UC) = 0.075

(b) = 0.075 (c) = 0.15

(bc) = 0.01875 (abc) = 0.50625

Case 2: l/3 5 q 5 I/2

There are now nine basic probability measures defined in Table 3 as follows.

The polyhedron C is not a simple simplex, it has three sections. In the appendix we show how the measures in Table 3 span E. The covariances for each of the basic cases are calculated and are given in Table 4.

Since y6 = 8~1 + y2 + y3 + ~4) and y7 = $9 - l)y4 +x1 - q)(ys + y9 + ylo), y6 and y7 are not extremal vertices of I-. The polyhedron I- is spanned by y1 - y4 and 7s - ~10. For q in the current range, the ~5 defined in Table 1 does not really exist but can be conveniently thought of as a “virtual” vertex of E and y5 as a virtual vertex of r. The polyhedron r now has the form of a truncated bipyramid, sketched in Figure 2.

r (and so II) now has seven boundary faces and thus II can be defined by seven inequalities.

PI2 + P13 + P23 2 c3Pq - l)P-lq-l

-P12 + P13 + P[23 5 1

P12 - P13 + P23 5 1

Pl2 + P13 - P23 5 1

P12 2 -w’

P13 2 -wl

P23 2 -&

(7)

Table 4.

Covariance yi

Measure, pi Y12 Y13 723

PC, -42 -q2 q-q2

P2 -q2 q-q2 -q2

P3 Q- Q2 -q2 --q2

P4 q-q2 4 - q2 Q- q2

PCS 4(f - 4)

(1 -y-t)

4($-d 4(;-4)

P7

Pa

(1 -y-;' '&y-~~

!+3 2 3q-1 -q2

ho 3qza2 -q2

(0 (4 (4 (4 Cab) (4 (bd WC) Pl 0 09 0 0 0 0 l -2q

P2 0 0 0 PC, 0 0 :

0” :

z

0”

1-2q 0 1-2q

P4 : 0 0 0 0 1 -q I%

J(3qo- $4 ;q ;q 0 0 0 1 - 3/2q

P7 1) 0 0 0 &P $P JP 0 PC, 3q-1 0 0 l-29 I-2q 0 PS 0 0 3q-1 0 1-2q -q2q

1420

0 Pro 0 0 0 3q-1 4 1 l-29 0

562 Appl. Math. Modelling, 1995, Vol. 19, September

Page 4: The feasibility of correlation specifications for binary variables

y,

Y,

k Y 13

J

y,

Figure 2.

In the very special case of q = i (in which each variable is symmetric) ys=yi, y9=y2, and yio=y3 and I- and lI become a simple simplex, the defining conditions in equations (7) reduce to

Pl2 + PI3 + P23 2 -I

-Pl2 + PI3 + P23 5 1

Pl2 - Pl3 + P23 5 1 (8)

PI2 + P13 - P23 4 1

5. Three binary variables-unequal marginals

The situation now becomes more complicated depending partly on the relative values of ql, q2, q3. The extreme probability measures will occur when four of the eight variables (0, (a), (b), . . . , (abc) are zero together with the relationships

(ab) = q3 - (0 - (4 - @I

(4 = q2 - (0 - (4 - (c)

(4 = q1 - (0 - @I - (4

(abc) = 1 - 41 - q2 - q3

+ (4 + @I + (4 + W)

(9)

Let us assume 0 -C q1 < q2 < q3 and also that ql + q2 + q3 < 1, then (ubc) > 0 and there are only four zero variables out of seven to consider. Most of the solutions are infeasible. If we further consider the case that q1 + q2 < q3 then there are eight basic solutions vl - v8 given in Table 5.

Correlation specifications for binary variables: J Gates

For each of these basic cases the correlations can be calculated by

~12 = WI + GW - ~1~214’~

~13 = [(a~> + Wc) - ~1~3lor:‘~ (10)

p23 = KW + Cab4 -p2p3k#2

where OV = @iqp,qj)‘J2. For the relative qi values stated before Table 5 all the

measures produce extreme correlation vertices of II; the feasible region appears as a cuboid with two comers sheared off. Feasible correlations are defined by the conditions

Cl q1q2 +p12012 ? 0

c2 p2q1 - p12a12 2 0

c3 q1q3 +p13013 z 0

c4 p3q1 - Pl3013 L 0

c5 4243 + p23023 ? 0

C6 p3q2 - p23023 ? 0

c7 PI -plp2 -PIP3 +p2p3 - 012pl2 - 013pl3

+ 023p23 1 0

C8 P2 -P1p2 +P1p3 -p2p3 - a12P12 + g13P13

- a23P23 2 0

As a numerical example suppose we wish to have ql = 0.12, q2 = 0.23, and q3 = 0.4, then Table 6 gives the correlations for the basic measures and values of the left- hand side of equation (10) so we can identify tight and slack conditions.

6. Realizing a specification

If a correlation specification is feasible it will usually be attainable in an infinite number of ways (there are 2” - 1 - $z(, + 1) degrees of freedom). If extra arbitrary constraints are imposed one is taking a slice of E which may or may not map onto the full correlation space II. That is, under additional arbitrary constraints, a specification may appear to be infeasible whereas in fact it is feasible.

If the required correlation p is expressed as a convex combination of basic pi then the same combination of pi will be a probability measure producing the required result. We saw this in an earlier section with qi = 0.25. In the case of equal marginal probabilities in the range l/3 _( q 5 l/2 the representation is a little more awkward due to the shape of II. As a convenience we can use the virtual measure p5

Table 5.

(4 (4 (b) (4 (a4 (a4 @cl WC)

"'1 0 0 0 0 f-73 42 41 1 - Ql - 92 - 43 v2 0 0 0

Qd 93 Q2 - Ql 0 1 - 92 - 43

v3 0 0 Ql 43 - Ql 42 0 1 - 42 - 43

v4 0 42 0 0 93 - 92 0 41 l-41 -43

V5 0 42 - 41 0 41 43 - 92 + 91 0 0 1-m -43

V6 41 0 0 0 43 - 41 q2 - Ql 0 1 + Ql - 42 - 43

v7 41 42 - Ql 0 0 43 - 42 0 0 l-43

w 0 q2 Ql 0 93 - Ql - 42 0 0 l-43

Appl. Math. Modelling, 1995, Vol. 19, September 563

Page 5: The feasibility of correlation specifications for binary variables

Correlation specifications for binary variables: J Gates

Table 6.

Yl y2 Y3 v4 v5 V6 V-I WJ

P12 -0.2018 0.6757 -0.2018 -0.2018 0.6757 0.6757 0.6757 -0.2018 P13 -0.3015 -0.3015 0.4523 -0.3015 -0.3015 0.4523 0.4523 0.4523 pi!? -0.4462 0 -0.4462 0.12 -0.4462 0 0 0.6694 0.12 0.0873 0.1358 0.12 0.6694 0.12 0 0.6694

c2 0.12 0 0.12 0.12 0 0 0 0.12 c3 0 0 0.12 0 0 0.12 0.12 0.12 c4 0.12 0.12 0 0.12 0.12 0 0 0 c5 0 0 0 0.23 0.11 0.12 0.23 0.23 C6 0.23 0.23 0.23 0 0.12 0.11 0 0 c7 0.12 0 0 0.35 0.11 0 0.11 0.23 C8 0.23 0.11 0.35 0 0 0.11 0 0.12

defined in Table 1 as component unacceptable in itself but acceptable when combined with other sufficiently positive terms-that is, we can use equations (6a) and (6b) anyway!

As an example suppose we have q = 0.45 and we wish to produce a probability model with

y = (-0.0675,0.0275, -0.1675)

OI

p = (-0.2727,O. 1111, -0.6768)

Then pCpii= - 0.4611 which is less than 1 - 3q( = - 0.35) and thus we should use equation (6a). We can then write down p as a combination of pl, p2, ~3, p5

p = -&0.035~, + 0.23~~ + 0.135p, + 0.05&}

and formally use the same combination of pI, ~2, p3, and p5 even though p5 now has a negative entry.

We see that we do obtain a genuine probability model with the required correlations. In the very special case of q = f, p1 - p4 span II in a simple way and equation (6b) becomes

P = i(P2s - PI2 - PI3 + 1)/h

+ t(Pn - PI2 - P23 + lh2

+ &2 - PI3 - P23 + lb3

+ &312 + PI3 + P23 + lb4

(12)

which is actually the same as given in Harr4 (p. 218). If the required correlations come from nonbinary

variables, they may not be achievable by binary variables, in which case we could take a “close” feasible point in II. Taking the closest feasible point would be a quadratic programming problem-probably the simplest thing is to scale down the required p until it is inside II. For example,

Table 7.

VI (a) WI (d W kc) (bd WC)

Pl 0 0.45 0 0 0 0 0.45 0.1 P2 ; ; 0.45 0 0 0.45 0 0.1 P3 0 0.45 0.45 0 0 0.1 I% ; 0 0 0 0.45 0.45 0.45 -0.35 p 0.035 0.23 0.135 0.185 0.28 0.085 0.05

Hai? required, with q =& that ~12 = 0.89, ~23 = 0.89, and pi3 = 0.75. Although the correlation matrix is positive definite, we have pi3 - pi2 - ~23 = - 1.03 -C -1 giving an improper solution from equation (12). If we relax the correlations toward the origin by a factor of 1.03 and require pi2 =0.8641 =p23 and ~13 = 0.7282, we get a feasible specification achieved by

(I) = (ubc) = 0.43203,

(a) = (bc) = 0.03398 = (c) = (ab),

(b) = (UC) = 0

7. Summary and conclusion

A range of inequality constraints for correlations have been developed for two or three binary variables. These enable the feasibility of a correlation specification to be deter- mined; in these cases a method of achieving the specifica- tion is given. The cases considered in this paper indicate the complexity of the situation and the care required to produce a proper model.

References

1. Lind, N. C. Modelling of uncertainty in discrete dynamical systems. Appl. Math. Modelling 1983; I

2. Roseublueth, E. Two-point estimates in probabilities. Appl. Math. Modelling 198 1, 5

3. Harr, M. E. Probabilistic estimates for multivariate analysis. Appl. Math. ModelZing 1989, 13

4. Harr, M. E. Reliability-Based Design in Civil Engineering, McGraw- Hill, New York, 1987

Appendix: Spanning C for f 5 q 5 $

C consists of three parts: these can be defined by the “level” variable

L = (I) + (a) + (b) + (c) = (a) - (bc) + q

We refer to the measures defined in Table 3. If L>q we use

P = ; ((bc)p, + (a~)~2 + (a@~3 + (0~4

+ w - Qkil

564 Appl. Math. Modelling, 1995, Vol. 19, September

Page 6: The feasibility of correlation specifications for binary variables

If 3q - 1 _<Lsq, we use

P = ;v@)P, + (02 + (c)P3) + (1 - 4K4P8

+ (09 + wd + qw, + (1 - 4P7))

where

&L+1--3q -

I-2q ’ Cf=

2L + 1 - 3q

1-q

Correlation specifications for binary variables: J Gates

IfL_(3q- 1, we use

Other representations are possible as Z is not a simple simplex.

Appl. Math. Modelling, 1995, Vol. 19, September 565