estimating tree-structured covariance matrices via mixed...

DEPARTMENT OF STATISTICS

University of Wisconsin

1300 University Ave.

Madison, WI 53706

TECHNICAL REPORT NO. 1142

July 1, 2008

Estimating Tree-Structured Covariance Matrices via

Mixed-Integer Programming with an Application to

Phylogenetic Analysis of Gene Expression

Hector Corrada Bravo1

Department of Computer Sciences, University of Wisconsin, Madison WI

Kevin H. Eng2

Department of Statistics and Department of Biostatistics and Medical Informatics,

University of Wisconsin, Madison WI

Sunduz Keles3

Department of Statistics and Department of Biostatistics and Medical Informatics,

University of Wisconsin, Madison WI

Grace Wahba1

Department of Statistics, Department of Computer Science and Department of Biostatistics and

Medical Informatics, University of Wisconsin, Madison WI

Stephen Wright4

Department of Computer Sciences, University of Wisconsin, Madison WI

1Research supported in part by NIH Grant EY09946, NSF Grant DMS-0604572 and ONR Grant N0014-

06-00952Research supported in part by NSF Grant DMS-06045723Research supported in part by PhRMA Foundation Research Starter Grant in Informatics and NIH RO1

grant HG0037474Research supported in part by NSF Grants DMS-0427689, CCF-0430504, CTS-0456694, CNS-0540147,

and DOE Grant DE-FG02-04ER25627

Estimating Tree-Structured Covariance Matrices via Mixed-Integer

Programming with an Application to Phylogenetic Analysis of

Gene Expression

Hector Corrada Bravo∗, Stephen WrightDepartment of Computer SciencesUniversity of Wisconsin-MadisonMadison, WI 53706-1685, USA

Kevin H. Eng, Sunduz Keles, Grace WahbaDepartment of Statistics and Biostatistics and Medical Informatics

University of Wisconsin-MadisonMadison, WI 53706-1685, USA

Abstract

We present a novel method for estimating tree-structured covariance matrices directly fromobserved continuous data. A representation of these classes of matrices as linear combinationsof rank-one matrices indicating object partitions is used to formulate estimation as instances ofwell-studied numerical optimization problems.

In particular, we present estimation based on projection where the covariance estimate isthe nearest tree-structured covariance matrix to an observed sample covariance matrix. Theproblem is posed as a linear or quadratic mixed-integer program (MIP) where a setting of theinteger variables in the MIP specifies a set of tree topologies of the structured covariance matrix.We solve these problems to optimality using efficient and robust existing MIP solvers. We alsoshow that the least squares distance method of Fitch and Margoliash (1967) can be formulated asa quadratic MIP and thus solved exactly using existing, robust branch-and-bound MIP solvers.

Our motivation for this method is the discovery of phylogenetic structure directly fromgene expression data. Recent studies have adapted traditional phylogenetic comparative anal-ysis methods to expression data. Typically, these methods first estimate a phylogenetic treefrom genomic sequence data and subsequently analyze expression data. A covariance matrixconstructed from the sequence-derived tree is used to correct for the lack of independence in phy-logenetically related taxa. However, recent results have shown that the hierarchical structure ofsequence-derived tree estimates are highly sensitive to the genomic region chosen to build them.To circumvent this difficulty, we propose a stable method for deriving tree-structured covariancematrices directly from gene expression as an exploratory step that can guide investigators intheir modelling choices for these types of comparative analysis.

We present a case study in phylogenetic analysis of expression in yeast gene families. Ourmethod is able to corroborate the presence of phylogenetic structure in the response of expressionin a subset of the gene families under particular experimental conditions. Additionally, whenused in conjunction with transcription factor occupancy data, our methods show that alternativemodelling choices should be considered when creating sequence-derived trees for this comparativeanalysis.

∗Corresponding Author, [email protected]

1

1 Introduction

Recent studies have adapted existing techniques in population genetics to perform evolutionaryanalysis of gene expression (Fay and Wittkopp, 2007; Gu, 2004; Oakley et al., 2005; Rifkin et al.,2003; Whitehead and Crawford, 2006). In particular, corrections for evolutionary dependencebetween taxa, e.g. species or strains, are used in regression (generalized least squares) or otherlikelihood models. These phylogenetic corrections are well accepted methodologies in phenotypicmodeling (Felsenstein et al., 2004), since, without them, statistical analysis is subject to increasedfalse positive rates and decreased power for hypothesis tests. These corrections take the form of acovariance matrix corresponding to a random diffusion process along a phylogenetic tree.

Evolutionary studies of gene expression so far assume that the single phylogenetic tree structureunderlying the data is known and typically derived from DNA or amino acid sequence data. Whilethis assumption might be valid for the analysis of coarse traits–beak size in birds, for example–as in traditional comparative phylogenetic studies, it might prove too restrictive when carryingout similar analysis at the genomic level. Especially, taking into account recent findings of highvariability in tree topology and branch length estimates contingent on the genomic region usedto estimate the phylogeny (Frazer et al., 2004; Habib et al., 2007; Yalcin et al., 2004). If weare interested in a particular group of genes, given that they are spread throughout the genome,it makes more sense to develop a covariance estimate appropriate to those genes. We present aprincipled way of estimating tree-structured covariance matrices directly from sample covariancesof observed gene expression data. As an exploratory step, this can help investigators circumventissues that arise from estimating a global phylogeny from sequence data in an independent previousstep.

In this paper, we formulate the problem of estimating a tree-structured covariance matrix asmixed-integer programs (MIP) (Bertsimas and Weismantel, 2005; Wolsey and Nemhauser, 1999).In particular, we look at projection problems that estimate the nearest matrix in the structuredclass to the observed sample covariance. These problems lead to linear or quadratic mixed integerprograms for which algorithms for global solutions are well known and reliable production codesexist. The formulation of these problems hinges on a representation of a tree-structured covariancematrix as a linear expansion of outer products of indicator vectors that specify nested partitions ofobjects.

The paper is organized as follows. In Section 2.1 we formulate the representation of tree-structured covariance matrices and give some results regarding the space of such matrices. Sec-tion 2.5 shows how to define the constraints that ensure matrices are tree-structured as constraintsin mixed-integer programs (MIPs). Projection problems are specifically addressed in Section 2.5.3.We present our results on a case study on phylogenetic analysis of expression in yeast gene familiesin Section 3. A discussion, including related work, follows in Section 4. Appendix A presents sim-ulation results on estimating the tree topology from observed data that show how our MIP-basedmethod compares favorably to the the well-known Neighbor-Joining method (Saitou, 1987) usingdistances computed from the observed covariances. Finally, Appendices B and C contain runningtimes and implementation details respectively of the MIP solver used in the experimental resultsof Section 3.

2

2 Materials and Methods

2.1 Tree-Structured Covariance Matrices

Our objects of study are covariance matrices of diffusion processes defined over trees (Cavalli-Sforza and Edwards, 1967; Felsenstein et al., 2004). Usually, a Brownian motion assumption ismade on the diffusion process where steps are independent and normally distributed with meanzero. However, covariance matrices of diffusion processes with independent steps, mean zero andfinite variance will also have the structure we are studying here. We do not make any normalityassumptions on the diffusion process and, accordingly, fit covariance matrices by minimizing aprojection objective instead of maximizing a likelihood function. Thus, for a tree T defined overp objects, our assumption is that the observed data are realizations of a random variable Y ∈ Rp

with Cov(Y ) = B, where B is a tree-structured covariance matrix defined by T .Figure 1 shows a tree with four leaves, corresponding to a diffusion process for four objects. A

rooted tree defines a set of nested partitions of objects such that each node in the tree (both interiorand leaves) corresponds to a subset of these objects. In our example, the lower branch exiting theroot corresponds to subset {1, 2}. The root of the tree corresponds to the set of all objects whileeach leaf node corresponds to a singleton set. The subset corresponding to an interior node is theunion of the non-overlapping subsets of that node’s children. Edges are labeled with nonnegativereal numbers indicating tree branch lengths.

1

2

3

4

a12

a1

a2

a34

a3

a4

(a)

B =

a12 + a1 a12 0 0a12 a12 + a2 0 00 0 a34 + a3 a34

0 0 a34 a34 + a4

(b)

Figure 1: A schematic example of a phylogenetic tree and corresponding covariance matrix. Theroot is the leftmost node, while leaves are the rightmost nodes. Branch lengths are arbitrarynonnegative real numbers.

Denoting B = Cov(Y ), entry Bij is the sum of branch lengths for the path starting at the rootand ending at the last common ancestor of leaves i and j. In our example, B12 = a12 is the lengthof the branch from the root to the node above leaves 1 and 2. For leaf i, Bii is the sum of thebranch lengths of the path from root to leaf. The covariance matrix B for our example tree is givenin Figure 1(b). If we swap the positions of labels 3 and 4 in our example tree such that label 3 isthe topmost label and construct a covariance matrix accordingly we recover the same covariance

3

matrix B. In fact, any tree that specifies this particular set of nested partitions and branch lengthsgenerates the same covariance matrix. All trees that define the same set of nested partitions aresaid to be of the same topology, and we say that covariance matrices that are generated from treeswith the same topology belong to the same class. However, a tree topology that specifies a differentset of nested partitions generates a different class of covariance matrices. For example, Figure 2shows a tree that defines a different set of nested partitions and the matrix it generates.

1

2

3

4

a1

a234

a2

a34

a3

a4

(a)

A =

a1 0 0 00 a234 + a2 a234 a234

0 a234 a234 + a34 + a3 a234 + a34

0 a234 a234 + a34 a234 + a34 + a4

(b)

Figure 2: An example phylogenetic tree with different topology and corresponding covariancematrix.

2.2 Representing Tree-Structured Covariance Matrices

Let d =[a12 a34 a1 a2 a3 a4

]T be a column vector containing the branch lengths of the treein Figure 1. We can write B =

∑6k=1 dkM

k where Mk is a matrix such that Mki,j = 1 if objects i

and j co-occur in the subset corresponding to the node where branch k ends. For the branch withlength a12, we have

M1 =

1 1 0 01 1 0 00 0 0 00 0 0 0

.

Furthermore, we can use indicator vectors vk to specify the Mk matrices in the linear expansionof B as outer products of vk with itself. For example, letting v1 =

[1 1 0 0

]T , we get

M1 = v1vT1 =

1100

[1 1 0 0].

4

Thus, using vectors vk we can write B =∑6

k=1 dkvkvTk and defining matrices V =

[v1 v2 . . . v6

]and D = diag(d), we can equivalently write

B = V DV T . (1)

For Figure 1, the complete expansion is given by

V =

1 1 0 1 0 0 01 1 0 0 1 0 01 0 1 0 0 1 01 0 1 0 0 0 1

, D = diag([0 a12 a34 a1 a2 a3 a4

]T ). (2)

Since the basis matrix V in Equation (1) is determined by the nested partitions defined bythe corresponding tree topology, all covariance matrices of the same class are generated by linearexpansions of a corresponding matrix V with branch lengths specified in the diagonal matrix D.On the other hand, a distinct basis matrix V corresponds to each distinct tree topology. Matricesspanned by the set of matrices V that correspond to valid partitions are tree-structured covariancematrices. We now characterize this set of valid V matrices by defining a partition property, andgive a representation theorem for tree-structured covariance matrices based on this property.

Definition 1 (Partition Property) A basis matrix V of size p-by-(2p− 1) with entries in {0, 1}and unique columns has the partition property for trees of size p if it satisfies the following condi-tions:

• V contains the vector of all ones e = (1, 1, . . . , 1)T ∈ Rp as a column, and

• for every column w in V with more than one non-zero entry, it contains columns u and vsuch that u+ v = w.

A matrix V with the partition property can be constructed by starting with the column e ∈ Rp andsplitting it into two nonzero columns u and v with u+ v = e. These form the next two columns ofV . The remaining columns of V are generated by splitting previously unsplit columns recursivelyinto the sum of two nonzero columns, until we finally obtain columns with a single nonzero. It iseasy to see that the total number of splits is p − 1, with two columns generated at each split. Itfollows that V does not contain the the zero column, and contains all p vectors that contain p− 1zero terms and a single entry of 1. For example, the V matrix in Equation (2) can be constructedby starting with column 1, splitting into columns 2 and 3, and then splitting each recursively toobtain the remaining four columns.

Theorem 2 (Tree Covariance Representation) A matrix B is a tree-structured covariancematrix if and only if B = V DV T where D is a diagonal matrix with nonnegative entries andthe basis matrix V has the partition property.

Proof Assume B is a tree-structured covariance matrix, then construct matrix V using the methodabove starting from the root, splitting each vector according to the nested partitions at each node.By construction, V will satisfy the partition property and by placing branch lengths in diagonalmatrix D we will have B = V DV T . On the other hand, let B = V DV T with D diagonal and Vhaving the partition property. Then construct a tree by the reverse construction: starting at theroot and vector e ∈ Rp, create a nested partition from the vectors u and v such that u+v = e whichmust exist since V has the partition property. Define branch lengths from D correspondingly, andcontinue this construction recursively. B will then be the covariance matrix defined by the resultingtree and therefore be tree-structured.

5

2.3 Characteristics of the Set of Tree-Structured Covariance Matrices

We now state some facts about the set of tree-structured covariance matrices which we make useof in our estimation procedures.

Proposition 3 The set of tree-structured covariance matrices B = V DV T generated by a singlebasis matrix V is convex.

Proof Let d1 and d2 be the branch length vectors of tree-structured covariance matrices B1 =V diag(d1)V T and B2 = V diag(d2)V T . Let θ ∈ [0, 1], then B = θB1 + (1−θ)B2 = V diag(θd1 + (1−θ)d2)V T . So, B is a tree of the same structure with branch lengths given by θd1 + (1− θ)d2.

We will use this fact to express estimation problems for trees of fixed topology as convexoptimization problems. However, estimation of general tree-structured covariance matrices is notso simple, as the set of all tree-structured covariance matrices is not convex in general. We can seethat this is true in the case p = 3 by considering the following example. Defining

V1 =

0 0 1 1 10 1 0 1 11 0 0 0 1

, V2 =

0 0 1 0 10 1 0 1 11 0 0 1 1

,

we see that V1 and V2 both have the partition property. Therefore by Theorem 2, the matricesB1 = V1diag(d1)V T

1 and B2 = V2diag(d2)V T2 are both tree-structured covariance matrices when

d1 and d2 contain nonnegative entries. If B is a convex combination of B1 and B2, we will haveB12 6= 0 and B23 6= 0 but B13 = 0. It is not possible to identify a matrix V with the partitionproperty such that B = V DV T , since any such V may contain only a single column apart fromthe three “unit” columns (1, 0, 0)T , (0, 1, 0)T , and (0, 0, 1)T , and none of the possible candidatesfor this additional column (namely, (1, 1, 0)T , (1, 0, 1)T , and (0, 1, 1)T ) can produce the requirednonzero pattern for B. This example can be extended trivially to successively higher dimensions pby expanding V1 and V2 appropriately.

2.4 Fixed Topology Projection Problems

In this section, we address the problem of estimating a tree-structured covariance matrix from aknown tree topology by minimizing the distance to an observed sample covariance matrix. Thatis, given a sample covariance matrix S and a basis matrix V , we find the nearest tree-structuredcovariance matrix in norm ‖·‖. We will look at problems using Frobenius norm, ‖B‖F =

√∑ij B

2ij ,

and sum-absolute-value (sav) norm, ‖B‖sav =∑

ij |Bij |.As stated above, the set of covariance matrices corresponding to trees of a particular topology

is convex. Since projection problems have convex objective functions, they are convex optimizationproblems for any norm ‖·‖. While our emphasis in this paper is optimization over the non-convex setof all tree-structured covariance matrices, it is illustrative to show the convex optimization problemformulations for projection in Frobenius and sum-absolute-value norm with fixed-topologies.

For Frobenius norm, given a covariance matrix S, the nearest tree-structured covariance matrixB in the class determined by basis matrix V is given by the branch length vector that solves theproblem

mind∈R2p−1

‖S − V diag(d)V T ‖2Fs.t. d ≥ 0.

6

We can simplify this to the following equivalent quadratic problem:

mind∈R2p−1

dTQd− 2cTd

s.t. d ≥ 0,

where Q = (V TV )◦ (V TV ) and c = diag(V TSV ) with ◦ denoting element-wise (Hadamard) matrixmultiplication. For sav norm, the branch lengths d corresponding to the nearest tree-structuredmatrix in the proper class are given by the solution to the following problem:

mind∈R2p−1

‖S − V diag(d)V T ‖sav

s.t. d ≥ 0.

Letting s ∈ Rp(p+1)/2 be the vectorization of symmetric matrix S, we can we can rewrite this asthe following linear problem:

mind∈R2p−1

p,q∈Rp(p+1)/2

eT (p+ q)

s.t.[H I −I

] dpq

= s

d ≥ 0, p ≥ 0, q ≥ 0,

where the row of H corresponding to Sij is V·i ◦ V·j and e is the vector of all ones of appropriatelength.

2.5 Solving Estimation by Projection for Unknown Tree Topologies using Mixed-Integer Programming

The non-convexity of the set of tree-structured covariance matrices requires estimation proceduresthat handle the combinatorial nature of optimization over this set. We model these problems asmixed-integer programs (MIPs). In particular, we make use of the fact that algorithms for mixed-integer linear and quadratic programs are well understood and that robust production codes forsolving them are available.

2.5.1 Mixed-Integer Programming

Mixed-integer programs (MIPs) place integrality constraints on some of the problem variables. Thegeneral statement of a MIP is:

minx∈Rn

f0(x) (3a)

s.t. gi(x) ≤ 0, i = 1, 2, . . . ,m (3b)xj ∈ Z, j = 1, 2, . . . , t, (3c)

for some t ≤ n. The functions gi are (smooth) constraint functions and f0 is the objective function(also assumed to be smooth), and Z is the set of integers. When f0 and gi, i = 1, . . . ,m, are linear,we have a mixed-integer linear program (MILP), and when f0 is quadratic and gi, i = 1, . . . ,m, arelinear, we have a mixed-integer quadratic program (MIQP). We will see that projection problems

7

for tree-structured covariance matrices are MILPs for the sav norm and MIQPs for the Frobeniusnorm.

Although the problem (3) is intractable in general, many practical instances can be solved, andalgorithms for finding solutions have been the subject of intense research for 50 years (see for exam-ple Wolsey and Nemhauser (1999)). Current state-of-the-art software combines two methodologies:branch-and-bound and branch-and-cut. Branch-and-bound is based on construction of a tree1 ofrelaxations of the problem (3), where each node of the tree contains a subproblem in which some ofthe integer variables xj are allowed to take non-integer values (but may be confined to some range).A node is a child of another node in the tree if there is exactly one component xj that is fixed at aninteger value in the current node but that is a continuous variable in the parent node. In the rootnode of the tree, all integer variables are relaxed and allowed to take non-integer values, while at theleaf nodes, all integer variables xj , j = 1, 2, . . . , t are fixed at certain values. Each node of the treeis therefore a continuous linear program (with real variables), so it can be “evaluated” using thesimplex method, usually by modifying the solution of its parent node. The optimal objective at anode gives a lower bound on the optimal objectives of any of its descendants, since the descendantshave fewer degrees of freedom (that is, a more restricted feasible set). Hence, if this lower bound isworse than the best integer feasible solution found to date, this node and all its descendants can be“pruned” from the tree; it is not necessary to evaluate them as they cannot contain the solution of(3). The branch-and-bound algorithm traverses this tree judiciously, avoiding evaluation of largeparts of the tree that are determined not to contain the optimal solution.

Cutting planes are used to enhance the speed of this process. These are additional constraintsthat exclude from the feasible set those values of x that are determined not to be optimal. Cutscan be valid for the whole tree, or just at a certain node and its descendants.

The branching strategy which determines the order in which the search tree is traversed, andthe method of construction of cutting planes, can have significant effects on the efficiency of theMIP solver for particular problems. In Appendix C, we provide details regarding the parameterschosen in our MIP solver for the projection problems we address here.

2.5.2 Mixed-Integer Constraints for Tree Topology

Every tree-structured covariance matrix satisfies the following properties derived from the linearexpansion in Equation (1):

1. Bij ≥ 0 for all i and j, since all entries in V and d are nonnegative.

2. Bii ≥ Bij for all i and j, since V has the partition property, every component of d thatis added to an off-diagonal entry is added to the corresponding diagonal entries along withthe component of d corresponding to the column in V with a single non-zero entry for thecorresponding leaves.

3. Bij ≥ min(Bik, Bjk) for all i, j, and k, with i 6= j 6= k. Since V has the partition property, thenfor every three off-diagonal entries there is one entry that has at least one fewer componentof d added in than the other two components.

Since every tree-structured covariance matrix can be expressed as B = V DV T according to The-orem 2, it is also positive semidefinite, since V DV T =

∑i diviv

Ti is the sum of positive semidefinite

matrices. Also, the three properties above follow from the expansion B = V DV T , therefore any1The tree referred to in this paragraph is a tree of related relaxations of the MIP, not a phylogenetic tree.

8

matrix that satisfies these properties is also positive semidefinite, so we need not add semidefinite-ness constraints in the optimization problems below. Therefore, we can solve estimation problemsfor unknown tree topologies by constraining covariance matrices to satisfy the above properties.However, the third constraint is not convex, so we use integrality constraints to model it.

We begin by rewriting this third constraint for each distinct triplet i > j > k as a disjunctionof three constraints:

Bij ≥ Bik = Bjk (4a)Bik ≥ Bij = Bjk (4b)Bjk ≥ Bij = Bik.. (4c)

This can be derived by noting that the third property above holds for all orderings of the given i,j, and k thus preventing any one of the values Bij , Bik, Bjk from being strictly smaller than theother two values, leading to a tie for the smallest value.

A standard way of modeling disjunctions is to use {0, 1} variables in the optimization prob-lem (Bertsimas and Weismantel, 2005). In our case we can use two integer variables ρijk1 and ρijk2,under the constraint that ρijk1 + ρijk2 ≤ 1, that is, they can both be 0, or, strictly one of the twois allowed to take the value 1. With these binary variables we can write the constraints (4) in away that the constraint corresponding to the nonzero-valued binary variable must be satisfied. Forexample, constraint (4a) is transformed to:

Bij ≥ Bik − (1− ρijk1)MBik ≥ Bjk − (1− ρijk1)MBjk ≥ Bik − (1− ρijk1)M,

where M is a very large positive constant. Constraints (4b) and (4c) are transformed similarly,yielding the full set of mixed-integer constraints in Table 1. When ρijk1 = 1, these constraints implythat constraint 4a is satisfied. However, since ρijk1 = 1 we must have ρijk2 = 0 which implies thatconstraints 4b and 4c need not be satisfied for a solution to be feasible. When ρijk1 = ρijk2 = 0,then constraint 4c must be satisfied.

2.5.3 Projection Problems

Let S be a sample covariance matrix, the nearest tree structured covariance matrix in norm ‖ · ‖to S is given by the solution of the mixed-integer problem:

minB∈Sp

‖S −B‖

s.t. constraints 5a-5m hold for B.

For Frobenius norm ‖ · ‖F , the problem reduces to a mixed-integer quadratic program. Lets2 be the vectorization of symmetric matrix S such that ‖S‖F = ‖s2‖2, then the nearest tree-structured covariance matrix in Frobenius norm to matrix S is given by the corresponding matrixrepresentation of solution b of the following mixed integer quadratic program:

minb∈Rp(p+1)/2,ρ∈Rp

12bT b− sT2 b

s.t. constraints 5a-5m hold for B,

9

Table 1: Mixed integer constraints defining tree-structured covariance matrices

Bij ≥ 0 ∀i, j (5a)Bii ≥ Bij ∀i 6= j (5b)Bij ≥ Bik − (1− ρijk1)M (5c)Bik ≥ Bjk − (1− ρijk1)M (5d)Bjk ≥ Bik − (1− ρijk1)M (5e)Bik ≥ Bij − (1− ρijk2)M (5f)Bij ≥ Bjk − (1− ρijk2)M (5g)Bjk ≥ Bij − (1− ρijk2)M (5h)Bjk ≥ Bij − (ρijk11 + ρijk2)M (5i)Bij ≥ Bik − (ρijk11 + ρijk2)M (5j)Bik ≥ Bij − (ρijk11 + ρijk2)M (5k)

ρijk1 + ρijk2 ≤ 1 (5l)ρijk1, ρijk2 ∈ {0, 1} ∀ i > j > k. (5m)

where p = 2p!(p−3)! .

We can similarly find the nearest tree structured covariance matrix in sum-absolute-value (sav)norm. Let s1 be the vectorization of symmetric matrix S such that ‖S‖sav = ‖s1‖1, then the nearesttree-structured covariance matrix in sum-absolute-value norm is given by the corresponding matrixrepresentation of solution b of the following mixed integer linear program:

minb∈Rp(p+1)/2,ρ∈Rp

‖s1 − b‖1

s.t. constraints 5a-5m hold for B

3 Results: A Case Study in Gene Family Analysis of Yeast GeneExpression

We applied our methods to the analysis of gene expression in Saccharomyces cerevisiae gene familiesas presented in Oakley et al. (2005) 2. Following the methodology of Gu et al. (2002), the yeastgenome is partitioned into gene families using an amino acid sequence similarity heuristic. Thelargest 10 of the resulting families are used in this analysis with family sizes ranging from p = 7 top = 18 genes. Names and sizes for the gene families used in the analysis are given in Table 3 ofAppendix B. We refer to Oakley et al. (2005) for further details.

The gene expression data is from 19 cDNA microarray time course experiments. Each time pointin the series is the log2 ratio of expression at the given time point to expression at the base lineunder varying experimental conditions. To make our results comparable to the analysis in Oakleyet al. (2005), we do not model correlation between measurements at different time points and refer

2All data for this analysis was retrieved from "http://www.lifesci.ucsb.edu/eemb/labs/oakley/pubs/

MBE2005data/".

10

to Oakley et al. (2005) and Gu (2004) for a discussion regarding this violation of the independenceassumption among measurements.

The analysis in Oakley et al. (2005) proceeded as follows:

1. Phylogenetic trees were derived for each family from DNA sequence using Maximum Like-lihood methods. In particular, an alignment of amino acid sequences from the entire genecoding region was used to derive a DNA sequence alignment which was then used to estimatea phylogenetic tree. As stated by the authors (Oakley et al., 2005), this is one of many pos-sible choices, including for example, flanking upstream non-coding regions that could have asignificant role in expression regulation.

2. Based on the resulting trees, gene expression data was analyzed using Maximum Likelihoodmethods under a Brownian diffusion process under two families of models: a phylogeneticclass, where the covariance of the diffusion process has a tree structure, and a non-phylogeneticclass where the covariance of the diffusion process is diagonal. The AIC score of the result-ing ML estimate is used to classify each gene family-experiment pair as evolving under aphylogenetic or non-phylogenetic model.

For each gene family and experiment we have a matrix Ygi of size ni-by-p where ni is the numberof time points in the ith experiment and p is the gene family size. We partition the experiments ofeach gene family into two disjoints sets P = {1, . . . , l} and NP = {l+1, . . . , 19} where l is the numberof experiments classified as phylogenetic in Oakley et al. (2005). This partition yields two matricesof measurements for each gene family YgP =

[Y Tg1 · · · Y T

gl

]Tand similarly for YgNP , obtained

by concatenating the measurement matrices of experiments in the corresponding set. The idea ofconcatenating gene expression measurement matrices directly to estimate covariance was sparked bythe success of Stuart et al. (2003) where gene expression measurements were concatenated directlyto measure correlation between genes. Since we will treat the rows of these two matrices as samplesfrom distributions with EY = 0, we center each row independently to have mean 0.

One of the constraints in Section 2.5.2 that characterize tree-structured covariance matricesis the nonnegativity of their entries. Therefore, to initialize our projection solvers, we first esti-mate Maximum-Likelihood covariance matrices B+

gP and B+gNP constrained to have nonnegative

entries from sample matrices YgP and YgNP . Treating the rows of n-by-p matrix Y as independentsamples from a multivariate normal distribution N(0, B+) the goal is to find matrix B+ that max-imizes likelihood, where B+ is constrained to have nonnegative entries. Following the constrainedmaximum-likelihood formulation in Vandenberghe et al. (1998), we define the following convexdeterminant maximization problem

maxR∈Sp

n log detR− tr(RS) (6a)

s.t. Rij ≤ 0, ∀i 6= j (6b)R � 0, (6c)

where Sp is the space of p-by-p symmetric matrices, n is the number of samples in matrix Y ,and S = Y Y T its sample covariance matrix. The expression R � 0 denotes that R is positivedefinite and we take variable R to be the inverse of the estimate B+ = R−1. By the nonpositivityelement-wise constraints (6b), along with the positive definiteness constraint (6c), feasible solutionsto Problem (6) will be members of the class of M-matrices (Horn and Johnson, 1991) which havethe property that their inverse are matrices with nonnegative entries (Theorem 2.5.3 in Horn and

11

Johnson (1991)). Therefore, the constraints in Problem (6) imply that estimate B+ will be themaximum likelihood estimate with nonnegative entries.

From estimates B+gP and B+

gNP we estimate tree-structured covariance matrices BgP and BgNP

using our MIP projection methods. To describe the strength of the hierarchical structure of theseestimated covariances we define the structural strength metric as follows:

SS(B) =1p

p∑i=1

maxi 6=j BijBii

. (7)

The term maxi 6=j Bij is the largest covariance between gene i and a different gene j. This is thelength of the path from the root to the immediate ancestor of leaf i in the corresponding tree.Therefore, the ratio in SS(B) compares the length of the path from the root to leaf i to the lengthof the subpath from the root to i’s immediate ancestor. A value of SS(B) near zero means that onaverage objects have zero covariance, values near one means that the tree is strongly hierarchicalwhere objects spend very little time taking independent steps in the diffusion process.

Under the classification of experiments as undergoing phylogenetic versus non-phylogeneticevolution we expect that the structural strength metric should be quite different for estimatedtree-structured covariance matrices BgP and BgNP . That is, we expect that SS(BgP ) ≥ SS(BgNP )for most gene families g. We show our results in Figure 3 which validate this hypothesis. We plotSS(BgP ) versus SS(BgNP ) for each gene family g. The diagonal is the area where SS(BgP ) =SS(BgNP ). We see that in fact SS(BgP ) > SS(BgNP ) for all gene families g except the HexoseTransport Family.

●

●

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Structural strength comparison, sav norm

non−phylogentic experiments structural strength

phyl

ogen

etic

exp

erim

ents

str

uctu

ral s

tren

gth

●

●

ABC_TransportersADP_RibosylationAlpha_GlucosidasesDUPGTP_BindingHSP_DnaKHexose_TransportKinases

(a)

●

●

0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Structural strength comparison, Frobenius norm

non−phylogentic experiments structural strength

phyl

ogen

etic

exp

erim

ents

str

uctu

ral s

tren

gth

●

●

ABC_TransportersADP_RibosylationAlpha_GlucosidasesDUPGTP_BindingHSP_DnaKHexose_TransportKinases

(b)

Figure 3: Comparison of structural strengths for tree-structured covariance estimates BgP and BgNP

for projection under sav (a) and Frobenius (b) norms. Each point represents a gene family. Thex-axis is SS(BgNP ). We can see that for all, except the Hexose Transport gene family, SS(BgP ) >SS(BgNP ). Only eight families are shown since the Putative Helicases and Permeases families didnot have any experiments classified as phylogenetic.

We next look at the resulting tree for the ABC (ATP-binding cassette) Transporters gene family

12

(see Jungwirth and Kuchler (2006) for a short literature review) in more detail. In particular,the eight genes included in this group are members of the subfamily conferring pleiotropic drugresistance (PDR) and are all located in the plasma membrane. A number of transcription factorshave been found for the PDR subfamily, including the PDR3 factor considered one of the masterregulators of the PDR network (Delaveau et al., 1994). Figure 4 shows the tree estimated by theMIP projection method for this family along with the sequence-derived tree reported by Oakleyet al. (2005). We can notice topological differences between the two trees, in particular, the subtreein Figure 4(a) containing genes YOR328W, YDR406W, YOR153W and YDR011W.

Estimated Tree for ABC Transporters Gene Family

YDR011W

YNR070W

YPL058C

YOR328W

YDR406W

YOR153W

YIL013C

YOR011W

(a)

Sequence−derived Tree for ABC Transporters Gene Family

YDR011W

YNR070W

YPL058C

YOR328W

YDR406W

YOR153W

YIL013C

YOR011W

(b)

Figure 4: (a) Tree estimated by the MIP projection method using Frobenius norm for the ABCTransporters gene family. (b) Sequence-derived tree reported by Oakley et al. (2005) for the ABCTransporters gene family. The red tips correspond to genes YOR328W, YDR406W, YOR153Wand YDR011W which form a subtree in (a) but not in (b).

In order to elucidate this topological difference, we turn to the characteristics of the promoter(regulatory) regions of the genes and ask whether transcription factor (TF) binding site contents ofthe upstream regions could account for this difference. We compiled a list of known yeast transcrip-tion factor binding site consensus sequences using Gasch et al. (2004) and the Promoter Databaseof Saccharomyces cerevisiae (SCPD) (http://rulai.cshl.edu/SCPD/). Then, we generated atranscription factor binding site occurrence vector for each gene by simply counting the numberof occurrences of each consensus sequence in the 1000 base pairs upstream of the coding region.Putting these profiles together we obtained a 8-by-128 matrix where rows represent the 8 genes inthe ABC Transporters gene family and columns represent 128 transcription factors. Inspection ofthis matrix once the rows are permuted to follow the hierarchy in the tree estimated by the MIPprojection method (Figure 4(a)) immediately revealed that the presence or absence of the PDR3transcription factor binding site in the flanking upstream region may account for the topologicaldifference apparent in the two estimated trees. Table 2 shows the number of times the motif forthe PDR3 factor was detected in the upstream region of each gene.

It is known (Delaveau et al., 1994) that the four genes in Table 2 with multiple PDR3 bindingsites are, as opposed to the other four genes, targets of this transcription factor which controls thepleiotropic drug resistance phenomenon. The structure of the subtree in Figure 4(a) correspondingto the PDR3 target genes essentially follows the frequency of PDR3 occurrences. On the otherhand, the structure of the subtree for the non-PDR3 target genes follows that of the sequence-

13

Table 2: Number of occurrences of the PDR3 transcription factor motif in the 1000 bp upstreamregion for each gene in the ABC Transporters family. Colors match those of Figure 4.

gene Occurrences of PDR31 YOR011W 02 YIL013C 03 YPL058C 04 YNR070W 05 YDR406W 36 YOR328W 47 YDR011W 68 YOR153W 9

derived tree of Figure 4(b). Namely, pairs (YOR011W,YIL013C) and (YPL058C,YNR070W) arenear each other in both the sequence-derived and the MIP-derived trees. Therefore, after takinginto account the initial split characterized by the presence of the PDR3 transcription factor, theMIP estimated tree (Figure 4(a)) is similar to the sequence-derived tree (Figure 4(b)).

We reiterate the observation of Oakley et al. (2005) that the choice of sequence region to createthe reference phylogenetic trees used in their analysis plays a crucial role and results could varyaccordingly. From our methods, we have found evidence that using upstream sequence flankingthe coding region might yield a tree that is better suited to explore the influence of evolution ingene expression for this particular gene family. We believe that finding a good estimate for tree-structured covariance matrices directly from expression measurements can help investigators guidetheir choices for downstream comparative analysis like that of Oakley et al. (2005).

Appendices C and B detail implementation choices and running times of our mixed-integerestimation procedure.

4 Discussion

The issues we hope to address by estimating tree-structured covariance matrices directly fromobserved sample covariances from gene expression data can be illustrated using the work of White-head and Crawford (2006) who characterize evolution patterns of the expression of 329 genes infive strains of the Fundulus heteroclitus fish. One of their analyses uses generalized least squaresregression of gene expression on habitat temperature using a tree-structured covariance matrixfor correction. This structured covariance matrix is derived from a phylogeny constructed fromfive microsatellite markers (short repeating strings) which are random characters expected notto be influenced by selection and to evolve at the same base rate as the whole genome. Thetree is constructed with the greedy neighbor-joining algorithm (Saitou, 1987) from Cavalli-Sforzaand Edward’s (CSE) chord distances between the five microsatellite markers. We reproduce thismicrosatellite-derived tree in Figure 5(a). The neighbor-joining algorithm is a greedy algorithmsusceptible to generating different solutions depending on how the algorithm is implemented. Forexample, the implementation of this algorithm in the ape R package 3 yields a different tree (Fig-ure 5(b)) given the CSE distances. For the purpose of generalized least squares, and therefore the

3Version 1.10-2. We thank Dr. Andrew Whitehead for providing the distance data through personal communica-tion.

14

evolutionary statements asserted as a result, this difference in topology can be significant. Con-sidering this instability of the resulting neighbor-joining tree and the importance it plays in theauthors’ analyses, we posit that deriving tree-structured covariance matrices directly from the ex-pression data can guide investigators in comparing sequence-derived phylogenetic trees for use insubsequent comparative analysis.

Microsatellite−derived tree

CT

GA

ME

NC

NJU

0.05 0.04 0.03 0.02 0.01 0

(a)

Microsatellite−derived tree fromsecond neighbor−joining implementation

CT

GA

ME

NC

NJU

0.08 0.06 0.04 0.02 0

(b)

Figure 5: Microsatellite-derived trees built by two implementations of the neighbor-joining algo-rithm from Cavalli-Sforza and Edward’s chord distances. Figure 5(a) is the tree reported in White-head and Crawford (2006), and Figure 5(b) was obtained by the ape R package.

To address these shortcomings and motivated by what we think is a problem of genomic reso-lution as described in the Introduction, we have described a method for estimating tree-structuredcovariance matrices directly from observed sample covariance matrices by projection methods. Weshowed that projection problems for known topologies are linear or quadratic programs dependingon the approximation norm used. For unknown topology problems, we proposed and evaluated amixed-integer formulation which can be solved to optimality by existing branch-and-bound solvers.

The work of McCullagh (McCullagh, 2006) on tree-structured covariance matrices is the closestto our work. He proposes the minimax projection to estimate the structure of a given sample covari-ance matrix. Given this structure, likelihood is maximized as in Anderson (1973). The minimaxprojection is independent of the estimation problem being solved as opposed to our MIP methodwhich minimizes the estimation objective while finding tree structure simultaneously. Furthermore,the MIP solver guarantees optimality upon completion, at the cost of longer execution in difficultcases where the optimal trees in many tree topologies have similar objective values.

Rifkin et al. (2003) use expression directly to estimate phylogenetic structure, but use a distance-based method utilizing the number of pairwise differentially expressed genes as the source of dis-tances. They observe that for the resulting distance matrix the neighbor joining tree-buildingalgorithm (Saitou, 1987) produces a tree estimate that matches the sequence derived tree for asubgroup of Drosophila species.

Using the MIP formulation to model tree-structured matrix constraints, we can also addressthe need to solve existing tree estimation problems exactly. In particular, the least squares method

15

of Fitch and Margoliash (1967) estimates a tree that minimizes the least-squares deviation of thedistance between objects in the tree and a given distance matrix D. However, given a covariancematrix B, we can compute squared distances between objects using the linear expressionD2

ij = Bii+Bjj − 2Bij . This implies that the least squares distance-deviance objective is a quadratic functionof the entries of covariance matrix B. Therefore, using the MIP formulation of Section 2.5 andthe quadratic least squares distance-deviance objective, we can express the least-squares methodof Fitch and Margoliash (1967) as a MIQP. Thus, generic branch-and-bound solvers of quadraticMIPs fill the gap observed in Felsenstein et al. (2004) which states that no branch-and-boundmethod to solve the least-squares problem exactly has been proposed.

Along the same line, MIPs have been used to solve phylogeny estimation problems for haplotypedata (Brown and Harrower, 2006; Huang et al., 2005; Sridhar et al., 2008; Wang and Xu, 2003).The observed data from the tree leaves in this case is haplotype variation represented as sequencesof ones and zeros. Although our MIP formulation is related, the data in our case is assumed to beobservations from a diffusion process along a tree, suitable for continuous traits like gene expression.

We can place the problem of estimating tree-structured covariance matrices in the broadercontext of structured covariance matrix estimation (Anderson, 1973; Li et al., 1999; Schulz, 1997).The work of Anderson (1973) is especially relevant since an iterative procedure is used to fitmatrices, or matrix inverses, which can be expressed as linear combinations of known symmetricmatrices. For known topologies, this method solves likelihood maximization problems where anormality assumption is made on the diffusion process underlying the data. However, for unknowntopologies, maximum likelihood problems require that we extend our computational methods to,for example, determinant maximization problems. Solving these and similar types of nonlinearMIPs is an active area of research in the optimization community (Lee, 2007). In recent years, theproblem of structured covariance matrix estimation has been mainly addressed in its application tosparse Gaussian Graphical Models (Banerjee and Natsoulis, 2006; Chaudhuri et al., 2007; Drton andRichardson, 2003, 2004; Yuan and Lin, 2007). In this instance, sparsity in the inverse covariancematrix induces a set of conditional independence properties that can be encoded as a sparse graph(not necessarily a tree).

Although we presented a descriptive metric of structural strength in our estimates in Section 3,future work will concentrate on leveraging these methods in principled hypothesis testing frame-works that better assess the presence of hierarchical structure in observed data. We expect thatthe resulting methods are likely to impact how evolutionary analysis of gene expression traits isconducted.

References

T.W. Anderson. Asymptotically Efficient Estimation of Covariance Matrices with Linear Structure.The Annals of Statistics, 1(1):135–141, 1973.

O. Banerjee and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphicalmodels. Proceedings of the 23rd international conference on Machine learning, pages 89–96, 2006.

D. Bertsimas and R. Weismantel. Optimization over integers. Dynamic Ideas, 2005.

D.G. Brown and I.M. Harrower. Integer programming approaches to haplotype inference by pureparsimony. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(2):141–154, 2006.

16

L.L. Cavalli-Sforza and AWF Edwards. Phylogenetic Analysis: Models and Estimation Procedures.Evolution, 21(3):550–570, 1967.

S. Chaudhuri, M. Drton, and T.S. Richardson. Estimation of a covariance matrix with zeros.Biometrika, 94(1):199, 2007.

T. Delaveau, A. Delahodde, E. Carvajal, J. Subik, and C. Jacq. PDR3, a new yeast regulatory gene,is homologous toPDR1 and controls the multidrug resistance phenomenon. Molecular Geneticsand Genomics, 244(5):501–511, 1994.

M. Drton and T.S. Richardson. A New Algorithm for Maximum Likelihood Estimation in GaussianGraphical Models for Marginal Independence. UAI (Uffe Kjærulff and Christopher Meek, eds.),San Francisco: Morgan Kaufmann, pages 184–191, 2003.

M. Drton and T.S. Richardson. Iterative conditional fitting for Gaussian ancestral graph models.Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 130–137, 2004.

J.C. Fay and P.J. Wittkopp. Evaluating the role of natural selection in the evolution of generegulation. Heredity, 1:9, 2007.

J. Felsenstein et al. Inferring phylogenies. Sinauer Associates Sunderland, Mass., USA, 2004.

W.M. Fitch and E. Margoliash. Construction of Phylogenetic Trees. Science, 155(3760):279–284,1967.

K.A. Frazer, C.M. Wade, D.A. Hinds, N. Patil, D.R. Cox, and M.J. Daly. Segmental PhylogeneticRelationships of Inbred Mouse Strains Revealed by Fine-Scale Analysis of Sequence VariationAcross 4.6 Mb of Mouse Genome. Genome Research, 14:1493–1500, 2004.

A.P. Gasch, A.M. Moses, D.Y. Chiang, H.B. Fraser, M. Berardini, and M.B. Eisen. Conservationand evolution of cis-regulatory systems in ascomycete fungi. PLoS Biol, 2(12):e398, 2004.

X. Gu. Statistical Framework for Phylogenomic Analysis of Gene Family Expression Profiles.Genetics, 167(1):531–542, 2004.

Z. Gu, A. Cavalcanti, F.C. Chen, P. Bouman, and W.H. Li. Extent of Gene Duplication in theGenomes of Drosophila, Nematode, and Yeast. Molecular Biology and Evolution, 19(3):256–262,2002.

F. Habib, A.D. Johnson, R. Bundschuh, and D. Janies. Large scale genotype-phenotype correlationanalysis based on phylogenetic trees. Bioinformatics, 23(7):785, 2007.

R.A. Horn and C.R. Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.

Y.T. Huang, K.M. Chao, and T. Chen. An approximation algorithm for haplotype inference bymaximum parsimony. Journal of Computational Biology, 12(10):1261–1274, 2005.

SA Ilog. Ilog Cplex 9.0 Users Manual, 2003.

H. Jungwirth and K. Kuchler. Yeast ABC transporters–A tale of sex, stress, drugs and aging.FEBS Letters, 580(4):1131–1138, 2006.

J. Lee. Mixed-integer nonlinear programming: Some modeling and solution issues. IBM JOURNALOF RESEARCH AND DEVELOPMENT, 51(3/4):489, 2007.

17

H. Li, P. Stoica, and J. Li. Computationally efficient maximum likelihood estimation of structuredcovariance matrices. Signal Processing, IEEE Transactions on [see also Acoustics, Speech, andSignal Processing, IEEE Transactions on], 47(5):1314–1323, 1999.

P. McCullagh. Structured covariance matrices in multivariate regression models. Technical report,Department of Statistics, University of Chicago, 2006.

T.H. Oakley, Z. Gu, E. Abouheif, N.H. Patel, and W.H. Li. Comparative Methods for the Analysisof Gene-Expression Evolution: An Example Using Yeast Functional Genomic Data. MolecularBiology and Evolution, 22(1):40–50, 2005.

E. Paradis, J. Claude, and K. Strimmer. Ape: analyses of phylogenetics and evolution in Rlanguage. Bioinformatics, 20:289–290, 2004.

D. Penny and M.D. Hendy. The use of tree comparison metrics. Syst. Zool, 34(1):75–82, 1985.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Founda-tion for Statistical Computing, Vienna, Austria, 2007. URL http://www.R-project.org. ISBN3-900051-07-0.

S.A. Rifkin, J. Kim, and K.P. White. Evolution of gene expression in the Drosophila melanogastersubgroup. Nature Genetics, 33(2):138–144, 2003.

N. Saitou. The neighbor-joining method: a new method for reconstructing phylogenetic trees, 1987.

T.J. Schulz. Penalized maximum-likelihood estimation of covariance matrices with linear structure.Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing,IEEE Transactions on], 45(12):3027–3038, 1997.

S. Sridhar, F. Lam, G. Blelloch, R. Ravi, and R. Schwartz. Mixed Integer Linear Programming forMaximum Parsimony Phylogeny Inference. IEEE/ACM Transactions on Computational Biologyand Bioinformatics, 2008.

J.M. Stuart, E. Segal, D. Koller, and S.K. Kim. A Gene-Coexpression Network for Global Discoveryof Conserved Genetic Modules. Science, 302(5643):249–255, 2003.

R.H. Tutuncu, K.C. Toh, and M.J. Todd. Solving semidefinite-quadratic-linear programs usingSDPT3. Mathematical Programming, 95(2):189–217, 2003.

L. Vandenberghe, S. Boyd, and S.P. Wu. Determinant Maximization with Linear Matrix InequalityConstraints. SIAM Journal on Matrix Analysis and Applications, 19(2):499–533, 1998.

L. Wang and Y. Xu. Haplotype inference by maximum parsimony, 2003.

A. Whitehead and D.L. Crawford. Neutral and adaptive variation in gene expression. Proceedingsof the National Academy of Sciences, 103(14):5425–5430, 2006.

L.A. Wolsey and G.L. Nemhauser. Integer and Combinatorial Optimization. Wiley-Interscience,1999.

B. Yalcin, J. Fullerton, S. Miller, DA Keays, S. Brady, A. Bhomra, A. Jefferson, E. Volpi, RR Copley,J. Flint, et al. Unexpected complexity in the haplotypes of commonly used inbred strains oflaboratory mice. Proc Natl Acad Sci US A, 101(26):9734–9739, 2004.

18

M. Yuan and Y. Lin. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika,94(1):19–35, 2007.

A Simulation Study: Comparing MIP Projection Methods andNeighbor-Joining

An alternative method to estimate a tree-structured covariance matrix from an observed samplecovariance is to use a distance-matrix method such as the Neighbor-Joining (NJ) algorithm (Saitou,1987) as follows: given sample covariance B, create a distance matrix D such that Dij = Bii+Bjj−2Bij , and use the NJ algorithm to estimate a tree and its corresponding tree-structured covariancematrix. In this simulation, we compare the closeness of the correct tree structure to the estimatedtree-structured covariance matrix when using this NJ-based method against using our MIP-basedprojection methods. Specifically, we measure how close the structure of estimated tree-structuredmatrices are to the true structure of true matrices by using the tree topological distance definedby Penny and Hendy (1985) which essentially counts the number of mismatched nested partitionsdefined by the trees.

The simulation setting was the following: 1) we first generated 10 {T1, . . . , T10} trees with 10leaves each at random using the rtree function of the R ape library (Paradis et al., 2004), whichgives 10 tree-structured covariance matrices {B1, . . . , B10} of size 10-by-10; 2) from each tree-structured covariance matrix Bi, we draw 10 sample covariances randomly {B1

i , . . . , B10i } using a

Wishart distribution with mean Bi and the desired degrees of freedom df . This corresponds tothe sample covariance matrix of a sample with df observations from a multivariate normal randomvariable distributed as N(0, Bi). Note that the resulting sample covariances are not necessarily tree-structured. Then, we estimate a tree-structured covariance matrix Bj

i from each sample covariancematrix Bj

i and record its topological distance to the true matrix Bi. In Figure 6 we report themean topological distance of the resulting 100 estimates as a function of the degrees of freedom df ,or number of observations. The values of the x-axis are defined to satisfy df = 10×2x, so for x = 0there are 10 observations in each sample and so on.

We can see that the method based on NJ is unable to recover the correct structure even forlarge numbers of observations. On the other hand the MIP-based method is able to converge tothe correct structure for both loss functions when the sample size is 16 times the number of taxa.Although the topological distances even for smaller sample sizes are not too large, this simulationalso illustrates that, as expected, having a large number of replicates is better for this method. Thisobservation is partly the reason for concatenating different experiments in the yeast gene-familyanalysis of Section 3.

B Running Times in Gene Family Analysis

family p norm class n time gapABC Transporters 8 sav phy 13 0.49ABC Transporters 8 sav nonphy 148 0.66ABC Transporters 8 sav all 161 0.26ABC Transporters 8 fro phy 13 2.01ABC Transporters 8 fro nonphy 148 0.70

19

ABC Transporters 8 fro all 161 0.72ADP Ribosylation 7 sav phy 44 0.17ADP Ribosylation 7 sav nonphy 100 0.02ADP Ribosylation 7 sav all 144 0.07ADP Ribosylation 7 fro phy 44 0.05ADP Ribosylation 7 fro nonphy 100 0.09ADP Ribosylation 7 fro all 144 0.33Alpha Glucosidases 6 sav phy 20 0.02Alpha Glucosidases 6 sav nonphy 148 0.02Alpha Glucosidases 6 sav all 168 0.00Alpha Glucosidases 6 fro phy 20 0.11Alpha Glucosidases 6 fro nonphy 148 0.01Alpha Glucosidases 6 fro all 168 0.01DUP 10 sav phy 15 112.21DUP 10 sav nonphy 106 27.81DUP 10 sav all 121 19.91DUP 10 fro phy 15 34.86DUP 10 fro nonphy 106 294.61DUP 10 fro all 121 600.02 0.29%GTP Binding 11 sav phy 9 22.92GTP Binding 11 sav nonphy 152 55.05GTP Binding 11 sav all 161 63.36GTP Binding 11 fro phy 9 20.93GTP Binding 11 fro nonphy 152 600.02 0.55%GTP Binding 11 fro all 161 106.19HSP DnaK 10 sav phy 61 31.71HSP DnaK 10 sav nonphy 75 81.72HSP DnaK 10 sav all 136 26.49HSP DnaK 10 fro phy 61 21.60HSP DnaK 10 fro nonphy 75 412.33HSP DnaK 10 fro all 136 34.45Hexose Transport 18 sav phy 96 600.05 75.89%Hexose Transport 18 sav nonphy 12 600.02 68.78%Hexose Transport 18 sav all 108 600.02 76.78%Hexose Transport 18 fro phy 96 600.04 2.64%Hexose Transport 18 fro nonphy 12 600.08 7.39%Hexose Transport 18 fro all 108 600.11 4.93%Kinases 7 sav phy 31 0.65Kinases 7 sav nonphy 100 0.08Kinases 7 sav all 131 0.09Kinases 7 fro phy 31 1.04Kinases 7 fro nonphy 100 0.81Kinases 7 fro all 131 0.81Permeases 17 sav nonphy 97 600.04 76.92%Permeases 17 sav all 97 600.06 76.92%Permeases 17 fro nonphy 97 600.01 4.49%Permeases 17 fro all 97 600.03 4.49%Putative Helicases 11 sav nonphy 96 481.55

20

Putative Helicases 11 sav all 96 481.50Putative Helicases 11 fro nonphy 96 600.01 0.42%Putative Helicases 11 fro all 96 600.02 0.42%

Table 3: Run times for gene family analysis tree fitting. Eachrow corresponds to the MIP approximation problem for thegiven family and approximation norm. p is the size of thegene family, n is the number of replicates in the data matrix,and class indicates which class of experiments are included inthe data matrix. Time reported is CPU user time in seconds.For those MIPs reaching the 10 minute time limit, we reportthe relative optimality gap of the returned solution.

C Implementation Details

In this paper we use CPLEX 9.0 (Ilog, 2003) to solve the mixed-integer programs described above.This solver allows the user to specify a number of options to control the behavior of the branch-and-cut algorithm. Some of the options that we found to be very useful to solve these projectionproblems are the following:

1. MIP EMPHASIS: The default behavior in CPLEX is to balance the traversal of the search treeto both tighten the lower bound of the optimum and find integer-feasible solutions. Sincethe set of tree-structured covariance matrices is non-empty, we know there exists an integer-feasible solution. Therefore, we specify that the emphasis should be solely in tightening thelower bound.

2. VARSEL and NODESEL: These parameters determine the order in which the search tree istraversed. VARSEL determines which variables are branched on while NODESEL determines theorder in which nodes in the search tree are explored. We set VARSEL to strong branchingso that a small number of branches are explored quickly before deciding which one to take.We set NODESEL to best estimate where an estimate of the optimum value for integer-feasiblesolutions under this node is used to determine order.

3. DISJCUTS and FLOWCOVERS: These parameters controls how often disjunctive and flowcovercutting planes are generated. We set both to generate aggressively.

4. PROBE Probing is a preprocessing step where the logical implications of setting binary variablesto 1 or 0 are explored. We set this parameter to the maximum level of probing.

The determinant maximization Problem (6) is solved using the SDPT3 Tutuncu et al. (2003)semidefinite programming solver. Except for this problem, all experiments and analyses werecarried out in R (R Development Core Team, 2007), and many utilities of the ape package (Par-adis et al., 2004) were used. CPLEX was used through an interface to R written by the au-thors available at http://cran.r-project.org/web/packages/Rcplex/. An R package includ-ing the MIP projection solvers will be made available by the authors. Since CPLEX is propri-etary software, our published code will also allow the use of the Rsymphony interface (http://cran.r-project.org/web/packages/Rsymphony/index.html) to the SYMPHONY MILP solver(http://www.coin-or.org/SYMPHONY/).

21

Mean topological distance, NJ vs. MIP

log2(degrees of freedom/10)

Mea

n to

polo

gica

l dis

tanc

e

0

2

4

6

8

10

0 2 4 6 8

●

●

●

●● ● ● ● ● ●

Neighbor−JoiningMIP (sav norm)MIP (Frobenius norm)

● ● ●

Figure 6: Mean topological distance between estimated and true tree-structured covariance matri-ces.

22

estimating tree-structured covariance matrices via mixed...

Documents