consistency guarantees for permutation-based causal inference algorithms · 2017. 7. 25. ·...

33
CONSISTENCY GUARANTEES FOR GREEDY PERMUTATION-BASED CAUSAL INFERENCE ALGORITHMS LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER Abstract. Directed acyclic graph (DAG) models, are widely used to repre- sent complex causal systems. Since the basic task of learning a DAG model from data is NP-hard, a standard approach is greedy search over the space of DAGs or Markov equivalence classes of DAGs. Since the space of DAGs on p nodes and the associated space of Markov equivalence classes are both much larger than the space of permutations, it is desirable to consider permutation- based greedy searches. We here provide the first consistency guarantees, both uniform and high-dimensional, of a greedy permutation-based search. This search corresponds to a simplex-type algorithm over the edge-graph of a sub- polytope of the permutohedron, called the DAG associahedron. Every vertex in this polytope is associated with a DAG, and hence with a collection of per- mutations that are consistent with the DAG ordering. A walk is performed on the edges of the polytope maximizing the sparsity of the associated DAGs. We also show based on simulations that this permutation search is competitive with current approaches. 1. Introduction Bayesian networks, or directed acyclic graph (DAG) models, are widely used to model complex causal systems arising, for example, in computational biology, epidemiology, or sociology [7, 20, 24, 26]. Given a DAG G := ([p],A) with node set [p] := {1, 2,...,p} and arrow set A, a DAG model associates to each node i [p] of G a random variable X i . By the Markov property, G encodes a set of conditional independence (CI) relations X i ⊥⊥ X Nd(i)\ Pa(i) | X Pa(i) , where Nd(i) and Pa(i) respectively denote the nondesendants and parents of the node i in G. A joint probability distribution P on the nodes [p] is said to satisfy the Markov assumption (a.k.a. be Markov ) with respect to G if it entails these CI relations. A fundamental problem of causal inference is the following: Suppose we sample data from a distribution P that is Markov with respect to a DAG G * . From this data we infer a collection of CI relations C . Our goal is to recover the unknown DAG G * using C . Unfortunately, this problem is not well-defined since multiple DAGs can encode the same set of CI relations. Any two such DAGs are termed Markov equivalent, and they are said to belong to the same Markov equivalence class (MEC). Thus, our goal becomes to identify the Markov equivalence class M(G * ) of G * . The DAG model for G * is said to be identifiable if the MEC M(G * ) can be uniquely recovered from the set of CI relations C . The Markov assumption alone is not sufficient to Date : January 14, 2020. Key words and phrases. causal inference; Bayesian network; DAG; Gaussian graphical model; greedy search; pointwise and high-dimensional consistency; faithfulness; generalized permutohe- dron; DAG associahedron. arXiv:1702.03530v3 [math.ST] 11 Jan 2020

Upload: others

Post on 25-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • CONSISTENCY GUARANTEES FOR GREEDY

    PERMUTATION-BASED CAUSAL INFERENCE ALGORITHMS

    LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Abstract. Directed acyclic graph (DAG) models, are widely used to repre-

    sent complex causal systems. Since the basic task of learning a DAG modelfrom data is NP-hard, a standard approach is greedy search over the space of

    DAGs or Markov equivalence classes of DAGs. Since the space of DAGs on pnodes and the associated space of Markov equivalence classes are both much

    larger than the space of permutations, it is desirable to consider permutation-

    based greedy searches. We here provide the first consistency guarantees, bothuniform and high-dimensional, of a greedy permutation-based search. This

    search corresponds to a simplex-type algorithm over the edge-graph of a sub-

    polytope of the permutohedron, called the DAG associahedron. Every vertexin this polytope is associated with a DAG, and hence with a collection of per-

    mutations that are consistent with the DAG ordering. A walk is performed

    on the edges of the polytope maximizing the sparsity of the associated DAGs.We also show based on simulations that this permutation search is competitive

    with current approaches.

    1. Introduction

    Bayesian networks, or directed acyclic graph (DAG) models, are widely usedto model complex causal systems arising, for example, in computational biology,epidemiology, or sociology [7, 20, 24, 26]. Given a DAG G := ([p], A) with nodeset [p] := {1, 2, . . . , p} and arrow set A, a DAG model associates to each nodei ∈ [p] of G a random variable Xi. By the Markov property, G encodes a set ofconditional independence (CI) relations Xi ⊥⊥ XNd(i)\Pa(i) | XPa(i), where Nd(i)and Pa(i) respectively denote the nondesendants and parents of the node i in G.A joint probability distribution P on the nodes [p] is said to satisfy the Markovassumption (a.k.a. be Markov) with respect to G if it entails these CI relations. Afundamental problem of causal inference is the following: Suppose we sample datafrom a distribution P that is Markov with respect to a DAG G∗. From this datawe infer a collection of CI relations C. Our goal is to recover the unknown DAG G∗using C.

    Unfortunately, this problem is not well-defined since multiple DAGs can encodethe same set of CI relations. Any two such DAGs are termed Markov equivalent,and they are said to belong to the same Markov equivalence class (MEC). Thus,our goal becomes to identify the Markov equivalence classM(G∗) of G∗. The DAGmodel for G∗ is said to be identifiable if the MECM(G∗) can be uniquely recoveredfrom the set of CI relations C. The Markov assumption alone is not sufficient to

    Date: January 14, 2020.Key words and phrases. causal inference; Bayesian network; DAG; Gaussian graphical model;

    greedy search; pointwise and high-dimensional consistency; faithfulness; generalized permutohe-dron; DAG associahedron.

    arX

    iv:1

    702.

    0353

    0v3

    [m

    ath.

    ST]

    11

    Jan

    2020

  • 2 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    guarantee identifiability, and so additional identifiability assumptions have beenstudied, the most prominent being the faithfulness assumption [26]. Unfortunately,this assumption has been shown to be restrictive in practice [35]; consequently, itis desirable to develop DAG model learning algorithms that are consistent understrictly weaker assumptions than faithfulness.

    Since the space of all MECs of DAGs on p nodes grows super-exponentiallyin p [10], one way to perform causal inference is to use greedy approaches. Forexample, one popular algorithm is the Greedy Equivalence Search (GES ) [5, 16],which greedily maximizes a score, such as the Bayesian Information Criterion (BIC),over the space of all MECs on p nodes. A second approach to further increaseefficiency is to consider algorithms with a reduced search space, such as the spaceof all p! topological orderings of DAGs (i.e. permutations of {1, . . . , p}). Greedypermutation-based algorithms combine both of these heuristic approaches for DAGmodel learning.

    Bouckaert [2] presents a greedy permutation-based search that uses arrow rever-sals and deletions to produce a sparse DAG entailing a collection of observed CIrelations. In later studies, Singh and Valtorta [25] and Larrañaga et al. [13] presentDAG model learning algorithms that first greedily learn an ordering of the nodesand then recover a DAG that has this ordering as a topological order. In [25], theauthors use CI-based tests to identify an ordering, and then use a greedy heuristicknown as the K2 algorithm [6] to identify the DAG structure. In [13], the authorsuse genetic algorithm techniques to learn an ordering and then use K2 to learnan optimally sparse DAG. More recently, a greedy permutation-based search wasstudied by Teyssier and Koller [29] using a decomposable score such as the BIC.They showed via simulations that such an approach compares favorably to greedysearch in the space of DAGs in terms of computation time while being comparablein performance. However, unlike GES, all greedy permutation-based algorithmsdeveloped so far [2, 6, 13, 25, 29], rely on heuristics for sparse DAG recovery andare therefore not provably consistent, even under the faithfulness assumption.

    The main purpose of this paper is to provide the first consistency guarantees of agreedy permutation-based algorithm for causal structure discovery in the causallysufficient setting. This algorithm is a greedy version of the SP algorithm, but unlikethe (non-greedy) SP algorithm, the proposed algorithm scales to causal structurediscovery with hundreds of variables. In addition, we show that our greedy SPalgorithm is consistent under weaker assumptions than standardly assumed, whichtranslates into competitive performance as compared to current causal structurediscovery algorithms.

    The remainder of the paper is structured as follows: In Section 2, we recall somenecessary preliminaries. The permutations π of [p] form the vertices of a polytope,known as the permutohedron. Mohammadi et al. [17] constructed from a set ofCI relations a sub-polytope of the permutohedron, called the DAG associahedron,where each vertex π is associated to a DAG Gπ. A natural approach then is toperform a greedy SP algorithm; i.e. a simplex-type algorithm on the edge-graph ofthis polytope with graph sparsity as a score function. In Section 3, we formallyintroduce the greedy SP algorithm for which we will have two variants: ESP andTSP. In Section 4, we prove that both ESP and TSP are pointwise consistent understrictly weaker assumptions than faithfulness. We also show pointwise consistencyunder faithfulness for a variant of the algorithm in [29]. Collectively, these results

  • ORDERING-BASED CAUSAL INFERENCE 3

    are the first consistency guarantees for greedy permutation-based searches. In Sec-tion 5, we study uniform consistency of TSP in the Gaussian setting. We proposea strategy for efficiently finding a “good” starting permutation based on the mini-mum degree algorithm, a heuristic for finding sparse Cholesky decompositions. Wethen prove uniform consistency of TSP in the high-dimensional setting under amore restrictive faithfulness condition. In Section 6, we present simulations com-paring TSP with GES and other popular algorithms, and we observe that TSP iscompetitive in its performance.

    2. Background

    We refer the reader to Appendix A for graph theory definitions and notation.A fundamental result about DAG models is that the complete set of CI relationsimplied by the Markov assumption for G is given by the d-separation relationsin G [14, Section 3.2.2]; i.e., a distribution P is Markov with respect to G if andonly if XA ⊥⊥ XB | XC in P whenever A and B are d-separated in G given C.The faithfulness assumption asserts that all CI relations entailed by P are given byd-separations in G [26].Assumption 1 (Faithfulness Assumption). A distribution P satisfies the faithful-ness assumption with respect to a DAG G = ([p], A) if for any pair of nodes i, j ∈ [p]and any subset S ⊂ [p]\{i, j} we have that i ⊥⊥ j | S if and only if i is d-separatedfrom j given S in G.

    All DAG model learning algorithms assume the Markov assumption, i.e. theforward direction of the faithfulness assumption, and many of the classical algo-rithms also assume the converse. Unfortunately, the faithfulness assumption isvery sensitive to hypothesis testing errors for inferring CI statements from data,and almost-violations of faithfulness have been shown to be frequent [35]. A numberof relaxations of the faithfulness assumption have been suggested [22]. For example,restricted faithfulness is the weakest known sufficient condition for consistency ofthe popular PC-algorithm [26].

    Assumption 2 (Restricted Faithfulness Assumption). A distribution P satisfiesthe restricted faithfulness assumption with respect to a DAG G = ([p], A) if itsatisfies the two conditions:

    (1) (Adjacency Faithfulness) For all arrows i→ j ∈ A we have that Xi 6⊥⊥ Xj |XS for all subsets S ⊂ [p]\{i, j}.

    (2) (Orientation Faithfulness) For all unshielded triples (i, j, k) and all subsetsS ⊂ [p]\{i, k} such that i is d-connected to k given S, we have that Xi 6⊥⊥Xk | XS .

    By sacrificing computation time, some algorithms remain consistent under as-sumptions that further relax restricted faithfulness, an example being the sparsestpermutation (SP) algorithm [23]. Let Sp denote the space of all permutations oflength p. Given a set of CI relations C on [p], every permutation π ∈ Sp is associatedto a DAG Gπ as follows:

    πi → πj ∈ A(Gπ) ⇔ i < j and πi 6⊥⊥ πj | {π1, . . . , πj}\{πi, πj}.A DAG Gπ is known as a minimal I-MAP (independence map) with respect to C,since any DAG Gπ satisfies the minimality assumption with respect to C, i.e., any

  • 4 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    CI relation encoded by a d-separation in Gπ is in C and any proper subDAG of Gπencodes a CI relation that is not in C [19]. We will also refer to Gπ as a permutationDAG. The SP algorithm searches over all DAGs Gπ for π ∈ Sp and returns theDAG that maximizes the score

    score(C;G) :={−|G| if G is Markov with respect to C,−∞ otherwise,

    where |G| denotes the number of arrows in G. Raskutti and Uhler [23] showed thatthe SP algorithm is consistent under the sparsest Markov representation (SMR)assumption, which is strictly weaker than restricted faithfulness.

    Assumption 3 (SMR Assumption). A probability distribution P satisfies the SMRassumption with respect to a DAG G if it is Markov with respect to G and |G| < |H|for every DAG H to which P is Markov and which satisfies H /∈M(G).

    The downside to the SP algorithm is that it requires a search over all p! per-mutations of the node set [p]. A typical approach to accommodate a large searchspace is to pass to a greedy variant of the algorithm. To this end, we analyze greedyvariants of the SP algorithm in the coming section.

    3. Greedy SP algorithm

    The SP algorithm has a natural interpretation in the setting of discrete geometry.The permutohedron on p elements, denotedAp, is the convex hull in Rp of all vectorsobtained by permuting the coordinates of (1, 2, 3, . . . , p)T . The SP algorithm canbe thought of as searching over the vertices of Ap, since it considers the DAGs Gπfor each π ∈ Sp. Hence, a natural first step to reduce the size of the search space isto contract together all vertices of Ap that correspond to the same DAG Gπ. Thiscan be done via the following construction first presented by Mohammadi et al.[17].

    Two vertices of the permutohedron Ap are connected by an edge if and onlyif the permutations indexing the vertices differ by an adjacent transposition. Weassociate a CI relation to adjacent transpositions, and hence to each edge of Ap;namely πi ⊥⊥ πi+1 | {π1, . . . , πi−1} to the edge between

    (π1, . . . , πi, πi+1, . . . , πp)T and (π1, . . . , πi+1, πi, . . . , πp)

    T .

    In [17, Section 4], it is shown that given a set of CI relations C from a joint distri-bution P on [p], then contracting all edges in Ap corresponding to CI relations inC results in a convex polytope, which we denote by Ap(C). Note that Ap(∅) = Ap.Furthermore, if the CI relations in C form a graphoid, i.e., they satisfy the semi-graphoid properties

    (SG1) if i ⊥⊥ j | S then j ⊥⊥ i | S,(SG2) if i ⊥⊥ j | S and i ⊥⊥ k | {j} ∪ S, then i ⊥⊥ k | S and i ⊥⊥ j | {k} ∪ S,

    and the intersection property

    (INT) if i ⊥⊥ j | {k} ∪ S and i ⊥⊥ k | {j} ∪ S, then i ⊥⊥ j | S and i ⊥⊥ k | S,then it was shown in [17, Theorem 7.1] that contracting edges in Ap that correspondto CI relations in C is the same as identifying vertices of Ap that correspond to thesame DAG. The semigraphoid properties hold for any distribution. On the otherhand, the intersection property holds, for example, for strictly positive distributions.

  • ORDERING-BASED CAUSAL INFERENCE 5

    Algorithm 1: Edge SP (ESP)

    Input : A set of CI relations C on node set [p] and a starting permutationπ ∈ Sp.

    Output: A minimal I-MAP G.1 Compute the polytope Ap(C) and set G := Gπ.2 Using a depth-first search approach with root G along the edges of Ap(C),

    search for a minimal I-MAP Gτ with |G| > |Gτ |. If no such Gτ exists, returnG; else set G := Gτ and repeat this step.

    Another example of a graphoid is the set of CI relations C corresponding to all d-separations in a DAG. In that case Ap(C) is also called a DAG associahedron [17].The edge graph of the polytope Ap(C), where each vertex corresponds to a differentDAG, represents a natural search space for a greedy version of the SP algorithm.

    Through a closer examination of the polytope Ap(C), we arrive at two greedyversions of the SP algorithm: one based on the geometry of Ap(C) by walkingalong edges of the polytope and another based on the combinatorial description ofthe vertices by walking from DAG to DAG. These two greedy versions of the SPalgorithm are given in Algorithms 1 and 2.

    Both algorithms take as input a set of CI relations C and an initial permutationπ ∈ Sp. Beginning at the vertex Gπ of Ap(C), Algorithm 1 walks along an edge ofAp(C) to any vertex whose corresponding DAG has at most as many arrows as Gπ.Once it can no longer discover a sparser DAG, the algorithm returns the last DAGit visited (from which we deduce the corresponding MEC). Since this algorithm isbased on walking along edges of Ap(C), we call this greedy version the edge SPalgorithm, or ESP for short. The corresponding identifiability assumption can bestated as follows:

    Assumption 4 (ESP Assumption). A distribution P satisfies the ESP assumptionwith respect to a DAG G if it is Markov with respect to G and if Algorithm 1 returnsonly DAGs in M(G).

    Algorithm 1 requires computing the polytope Ap(C). This is inefficient, since anedge walk in a polytope only requires knowing the neighbors of a vertex. Next weprovide a graphical characterization of neighboring DAGs.

    In the following, we say that an arrow i → j in a DAG G is covered if Pa(i) =Pa(j) \ {i} and it is trivially covered if Pa(i) = Pa(j) \ {i} = ∅. In addition, we calla sequence of DAGs (G1,G2, . . . ,GN ) a weakly decreasing sequence if |Gi| ≥ |Gi+1|for all i ∈ [N − 1]. If Gi+1 is produced from Gi by reversing a covered arrow in Gi,then we will refer to this sequence as a weakly decreasing sequence of covered arrowreversals. Let Gπ and Gτ denote two adjacent vertices in a DAG associahedronAp(C). Let Ḡ denote the skeleton of G; i.e., the undirected graph obtained byundirecting all arrows in G. Then, as noted in [17, Theorem 8.3], Gπ and Gτ differby a covered arrow reversal if and only if Gπ ⊆ Gτ or Gτ ⊆ Gπ. In some instances,this fact gives a combinatorial interpretation of all edges of Ap(C). However, thisneed not always be true as demonstrated in Example 17 in Appendix A.

    The combinatorial description of some edges of Ap(C) via covered arrow reversalsmotivates Algorithm 2, a combinatorial greedy SP algorithm. Since this algorithm isbased on flipping covered arrows, we call this the triangle SP algorithm, or TSP for

  • 6 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Algorithm 2: Triangle SP (TSP)

    Input : A set of CI relations C on node set [p] and a starting permutationπ ∈ Sp.

    Output: A minimal I-MAP G.1 Set G := Gπ.2 Using a depth-first search approach with root G, search for a minimal I-MAPGτ with |G| > |Gτ | that is connected to G by a weakly decreasing sequence ofcovered arrow reversals. If no such Gτ exists, return G; else set G := Gτ andrepeat this step.

    short. Unlike Algorithm 1, this algorithm does not require computing the polytopeAp(C) and is does the version run in practice. Similar to Algorithm 1, we specifyan identifiability assumption in relation to Algorithm 2.

    Assumption 5 (TSP Assumption). A distribution P satisfies the TSP assumptionwith respect to a DAG G if it satisfies MA with respect to G and if Algorithm 2returns only DAGs in M(G).

    It is straightforward to verify that every covered arrow reversal in some minimalI-MAP Gπ with respect to C corresponds to some edge of the DAG associahedronAp(C). Consequently, if a distribution satisfies the TSP assumption then it alsosatisfies the ESP assumption.

    4. Consistency Guarantees and Identifiability Implications

    In this section, we prove that both versions of the greedy SP algorithm (ESP andTSP) are (pointwise) consistent (i.e., in the oracle-version as n→∞ the algorithmoutputs the true Markov equivalence class) under the faithfulness assumption. Wealso show that a version of TSP using the BIC score instead of graph sparsityis consistent under the faithfulness assumption. These results constitute the firstconsistency guarantees for permutation-based DAG model learning algorithms. Inaddition, we study the relationships between the different identifiability assump-tions encountered so far, namely faithfulness, restricted faithfulness, SMR, ESP,and TSP.

    4.1. Consistency of Greedy SP under faithfulness. First note that since theTSP assumption implies the ESP assumption, it is sufficient to prove pointwiseconsistency of the TSP version of Greedy SP. To prove this, we need to show thatfor given a set of CI relations C corresponding to d-separations in a DAG G∗, everyweakly decreasing sequence of covered arrow reversals ultimately leads to a DAGinM(G∗). Given two DAGs G and H, H is an independence map of G, denoted byG ≤ H, if every CI relation encoded by H holds in G (i.e. CI(G) ⊇ CI(H)). Thefollowing simple result, whose proof is given in the Appendix, reveals the main ideaof the proof.

    Lemma 6. A probability distribution P on the node set [p] is faithful with respectto a DAG G if and only if G ≤ Gπ for all π ∈ Sp.

    The goal is to prove that for any pair of DAGs such that Gπ ≤ Gτ , there is aweakly decreasing sequence of covered arrow reversals such that

    (Gτ = Gπ0 ,Gπ1 ,Gπ2 , . . . ,GπM = Gπ) .

  • ORDERING-BASED CAUSAL INFERENCE 7

    Our proof relies heavily on Chickering’s consistency proof of GES and, in particular,on his proof of a conjecture known as Meek’s conjecture.

    Theorem 7. [5, Theorem 4] Let G and H be any pair of DAGs such that G ≤ H.Let r be the number of arrows in H that have opposite orientation in G, and let mbe the number of arrows in H that do not exist in either orientation in G. Thereexists a sequence of at most r + 2m arrow reversals and additions in G with thefollowing properties:

    (1) Each arrow reversal is a covered arrow.(2) After each reversal and addition, the resulting graph G′ is a DAG and G′ ≤H.

    (3) After all reversals and additions G = H.Chickering [5] gave a constructive proof of this result via the APPLY-EDGE-

    OPERATION algorithm. For convenience, we will henceforth refer to the algorithmas the “Chickering algorithm.” The Chickering algorithm takes in an independencemap G ≤ H and adds an arrow to G or reverses a covered arrow in G to produce anew DAG G1 for which G ≤ G1 ≤ H. By Theorem 7, repeated applications of thisalgorithm produces a sequence of DAGs

    G = G0 ≤ G1 ≤ G2 ≤ · · · ≤ GN = H.We will call any sequence of DAGs produced in this fashion a Chickering sequence(from G to H). A quick examination of the Chickering algorithm reveals that therecan be multiple Chickering sequences from G to H. We are interested in identifyinga specific type of Chickering sequence in which the covered arrow reversals and edgeadditions correspond to steps between permutation DAGs in a weakly decreasingsequence.

    Given two DAGs Gπ ≤ Gτ , TSP proposes that there is a path along the edgesof Ap(C) corresponding to covered arrow reversals taking us from Gτ to Gπ, say(Gτ = Gπ0 ,Gπ1 ,Gπ2 , . . . ,GπM = Gπ) , for which |Gπj−1 | ≥ |Gπj | for all j = 1, . . . ,M .Recall that we call such a sequence of minimal I-MAPs satisfying the latter propertya weakly decreasing sequence. If such a weakly decreasing sequence exists from anyGτ to Gπ, then TSP must find it. By definition, such a path is composed of coveredarrow reversals and arrow deletions. Since these are precisely the types of movesused in the Chickering algorithm, then we must understand the subtleties of therelationship between independence maps relating the DAGs Gπ for a collection ofCI relations C and the skeletal structure of the Gπ. To this end, we will use thefollowing two definitions: A minimal I-MAP Gπ with respect to a graphoid C iscalled MEC-minimal if for all G ≈ Gπ and linear extensions τ of G we have thatGπ ≤ Gτ . Notice by [17, Theorem 8.1], it suffices to check only one linear extensionτ for each G. The minimal I-MAP Gπ is further called MEC-s-minimal if it isMEC-minimal and Gπ ⊆ Gτ for all G ≈ Gπ and linear extensions τ of G. We arenow ready to state the main proposition that allows us to verify consistency of TSPunder the faithfulness assumption.

    Proposition 8. Suppose that C is a graphoid and Gπ and Gτ are minimal I-MAPswith respect to C. Then

    (a) if Gπ ≈ Gτ and Gπ is MEC-s-minimal then there exists a weakly decreasingedgewalk from Gπ to Gτ along Ap(C). In particular, any Chickering sequenceconnecting Gτ and Gπ is a sequence of Markov equivalent minimal I-MAPs;

  • 8 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Algorithm 3: TSP using the BIC score

    Input : Observations X̂, initial permutation π.Output: Permutation π̂ with DAG Gπ̂.

    1 Set Ĝπ := argmaxG consistent with permutationπ

    BIC(G; X̂).

    2 Using a depth-first search approach with root π, search for a permutation τ

    with BIC(Ĝτ ; X̂) > BIC(Ĝπ; X̂) that is connected to π through a sequence ofpermutations (π1, · · · , πk) where each permutation πi is produced from πi−1by first doing a covered arrow reversal Ĝπi−1 and selecting a linear extensionπi of the DAG Ĝπi−1 . If no such Ĝτ exists, return Ĝπ; else set π := τ andrepeat.

    (b) if Gπ ≤ Gτ but Gπ 6≈ Gτ then there exists a minimal I-MAP Gτ ′ with respectto C satisfying Gτ ′ ≤ Gτ that is strictly sparser than Gτ and is connected toGτ by a weakly decreasing edgewalk along Ap(C).

    Theorem 9. The TSP and ESP algorithms are both pointwise consistent underthe faithfulness assumption.

    Proof. Since the TSP assumption implies the ESP assumption, it suffices to proveconsistency of the TSP algorithm. Suppose that C is a graphoid that is faithful tothe sparsest minimal I-MAP Gπ∗ with respect to C. By Lemma 6, we know thatGπ∗ ≤ Gπ for all π ∈ Sp. By (b) of Proposition 8, if Algorithm 2 is at a minimalI-MAP Gτ that is not in the same Markov equivalence class as G∗π, then we can takea weakly decreasing edgewalk along Ap(C) to reach a sparser minimal I-MAP Gτ ′satisfying Gπ∗ ≤ Gτ ′ ≤ Gτ . Following repeated applications of Proposition 8 (b), thealgorithm eventually returns a minimal I-MAP in the Markov equivalence class ofGπ∗ . In order for the algorithm to verify that it is in the correct Markov equivalenceclass, it needs to compute a minimal I-MAP Gτ for a linear extension τ for each DAGG ≈ Gπ∗ and check that it is not sparser than Gπ∗ . Proposition 8 (a) states thatany such minimal I-MAP Gτ is connected to Gπ∗ by a weakly decreasing edgewalkalong Ap(C). In other words, the Markov equivalence class of Gπ∗ is a connectedsubgraph of the edge graph of Ap(C), and hence the algorithm can verify that ithas reached the sparsest Markov equivalence class and terminate. Thus, TSP ispointwise consistent under the faithfulness assumption. �

    4.2. Consistency of the TSP algorithm using BIC under faithfulness. Wenow note that a version of TSP that uses the BIC score instead of graph sparsity isalso consistent under the faithfulness assumption. This version of TSP is presentedin Algorithm 3. Algorithm 3 is constructed in analogy to the ordering-based searchmethods studied by Teyssier and Koller [29].

    Theorem 10. Algorithm 3 is pointwise consistent under the faithfulness assump-tion.

    Theorem 10 is proven in Appendix B. It is based on the fact that the BIC is locallyconsistent [5, Lemma 7]. We note that Algorithm 3 differs from the ordering-basedsearch method proposed by Teyssier and Koller [29] in two main ways:

    (i) Algorithm 3 selects each new permutation by a covered arrow reversal inthe associated I-MAPs, and

  • ORDERING-BASED CAUSAL INFERENCE 9

    (ii) it uses a depth-first search approach instead of greedy hill climbing.

    In particular, our search guarantees that any independence map of minimal I-MAPSGπ ≤ Gτ are connected by a Chickering sequence. The proof of Theorem 10 thenfollows since |Gτ | < |Gπ| if and only if BIC(Gτ ; X̂) > BIC(Gπ; X̂), for any minimalI-MAPs Gπ and Gτ . However, since this fact does not hold for arbitrary DAGssatisfying MA with respect to a given distribution, the algorithm of Teyssier andKoller [29] lacks known consistency guarantees.

    4.3. Beyond faithfulness. We now examine the relationships between the ESP,TSP, SMR, faithfulness, and restricted faithfulness assumptions. All proofs for theresults in this section can be found in Appendix B.

    Theorem 11. The following hierarchy holds (all implications are strict):

    faithfulness =⇒ TSP =⇒ ESP =⇒ SMR.It is clear from the definition that restricted faithfulness is a significantly weaker

    assumption than faithfulness. In [23, Theorem 2.5] it was shown that the SMRassumption is strictly weaker than restricted faithfulness. In the following, wecompare restricted faithfulness to TSP and show that the restricted faithfulnessassumption is not weaker than the TSP assumption.

    Theorem 12. Let P be a semigraphoid w.r.t. a DAG G. If P satisfies the TSPassumption, then P satisfies adjacency faithfulness with respect to G. However,there exist distributions P such that P satisfies the TSP assumption with respect toa DAG G and P does not satisfy orientation faithfulness with respect to G.4.4. The problem of Markov equivalence. In Sections 4.1 and 4.3 we examinedwhen ESP and TSP return the true MEC M(G∗). While these results supplyidentifiability assumptions ensuring consistency of the algorithms, it is importantto note that the algorithms may still be quite inefficient since they may search overDAGs that belong to the same MEC. This is due to the fact that two DAGs inthe same MEC differ only by a sequence of covered arrow reversals [4, Theorem 2].Thus, the greedy nature of TSP can leave us searching through large portions ofMECs until we identify a sparser permutation DAG. In particular, in order to knowthat TSP has terminated, it must visit all members of the sparsest MEC M(G∗).

    To address this problem, Algorithm 4 provides a more cost-effective alternativethat approximates TSP. We call this algorithm Greedy SP, since this is the versionthat we run in practice (see Section 6). Greedy SP operates exactly like TSP, withthe exception that it bounds the search depth and number of runs allowed beforethe algorithm terminates. Recall that TSP searches for a weakly decreasing edge-walk from a minimal I-MAP Gπ to a minimal I-MAP Gτ with |Gπ| > |Gτ | via adepth-first search approach. In Greedy SP (Algorithm 4), if this search step doesnot produce a sparser minimal I-MAP after searching up to and including depthd, the algorithm terminates and returns Gπ. In [10], enumerative computations ofall possible MECs on p ≤ 10 nodes suggest that the average MEC contains fourDAGs. By the characterization of Markov equivalence via covered arrow flips givenin [4, Theorem 2], this suggests that a search depth of 4 is, on average, sufficient forescaping an MEC of minimal I-MAPs. This intuition is verified via simulations inSection 6. Greedy SP (Algorithm 4) also incorporates the possibility of restartingthe algorithm in order to try and identify a sparser DAG. Here, the parameter r

  • 10 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Algorithm 4: Greedy SP (TSP with bounds on search-depth and number ofruns)

    Input : A set of CI relations C on node set [p], and two positive integers dand r.

    Output: A minimal I-MAP Gπ.1 Set R := 0, and Y := ∅.2 while R < r do3 Select a permutation π ∈ Sn and set G := Gπ.4 Using a depth-first search approach with root G, search for a permutation

    DAG Gτ with |G| > |Gτ | that is connected to G by a weakly decreasingsequence of covered arrow reversals that is length at most d.

    5 if no such Gτ exists then6 set Y := Y ∪ {G}, R := R+ 1, and go to step 2.7 else8 set G := Gτ and go to step 4.9 end

    10 end

    11 Return the sparsest DAG Gπ in the collection Y .

    denotes the number of runs before the algorithm is required to output the sparsestDAG.

    5. Uniform Consistency

    In analogy to the adaptation of the SGS algorithm into the PC algorithm [26],in this section we show that minor adjustments can turn TSP into an algorithmthat is uniformly consistent in the high-dimensional Gaussian setting: Algorithm 5only tests conditioning sets made up of parent nodes of covered arrows; this featureturns out to be critical for high-dimensional consistency.

    Our proof parallels that of Kalisch and Bühlmann [11] for high-dimensional con-sistency of the PC algorithm. To give more context, Van de Geer and Bühlmann [33]analyzed `0-penalized maximum likelihood estimation for causal inference. Whilethey proved that the global optimum converges to a sparse minimal I-MAP, theirapproach in general does not converge to the data-generating DAG. More recently,it was shown that a variant of GES is consistent in the high-dimensional setting [18].Similarly to the latter approach, by assuming sparsity of the initial DAG, we obtainuniform consistency of greedy SP in the high-dimensional setting, i.e., it convergesto the data-generating DAG when the number of nodes p scales with n.

    Here, we let the dimension grow as a function of sample size; i.e. p = pn.Similarly, for the true underlying DAG and the data-generating distribution we letG∗ = G∗n and P = Pn, respectively. The assumptions under which we will guaranteehigh-dimensional consistency of Algorithm 5 are as follows:

    (A1) The distribution Pn is multivariate Gaussian and faithful to the DAG G∗nfor all n.

    (A2) The number of nodes pn scales as pn = O(na) for some 0 ≤ a < 1.(A3) Given an initial permutation π0, the maximal degree dπ0 of the correspond-

    ing minimal I-MAP Gπ0 satisfies dπ0 = O(n1−m) for some 0 < m ≤ 1.

  • ORDERING-BASED CAUSAL INFERENCE 11

    Algorithm 5: High-dimensional Greedy SP

    Input: Observations X̂, threshold τ , and initial permutation π0.Output: Permutation π̂ together with the DAG Ĝπ̂.

    1 Construct the minimal I-MAP Ĝπ0 from the initial permutation π0 and X̂;2 Perform Algorithm 2 with constrained conditioning sets, i.e., let i→ j be a

    covered arrow and let S = pa(i) = pa(j) \ {i}; perform the edge flip,i.e. i← j, and update the DAG by removing edges (k, i) for k ∈ S such that|ρ̂i,k|(S∪{j}\{k})| ≤ τ and edges (k, j) for k ∈ S such that |ρ̂j,k|(S\{k})| ≤ τ .

    (A4) There exists M < 1 and cn > 0 such that all non-zero partial correlationsρi,j|S satisfy |ρi,j|S | ≤ M and |ρi,j|S | ≥ cn where c−1n = O(n`) for some0 < ` < m/2.

    Analogous to the conditions needed in [11], assumptions (A1), (A2), (A3), and(A4) relate to faithfulness, the scaling of the number of nodes with the num-ber of observations, the maximum degree of the initial DAG, and bounds on theminimal non-zero and maximal partial correlations, respectively. In the Gauss-ian setting the CI relation Xj ⊥⊥ Xk | XS is equivalent to the partial correlationρj,k|S = corr(Xj , Xk | XS) equaling zero, and a hypothesis test based on Fischer’sz-transform can be used to test whether Xj ⊥⊥ Xk | XS . Combining these facts, wearrive at the following theorem.

    Theorem 13. Suppose that assumptions (A1) – (A4) hold and let the threshold τin Algorithm 5 be defined as τ := cn/2. Then there exists a constant c > 0 such

    that Algorithm 5 is consistent, i.e., it returns a DAG Ĝπ̂ that is in the same MECas G∗n, with probability at least 1 − O(exp(−cn1−2`)), where ` is defined to satisfyassumption (A4).

    Remark 14. As seen in the proof of Theorem 13, consistent estimation in the high-dimensional setting requires that we initialize the algorithm at a permutation sat-isfying assumption (A3). This assumption corresponds to a sparsity constraint. Inthe Gaussian oracle setting the problem of finding a sparsest DAG is equivalent tofinding the sparsest Cholesky decomposition of the inverse covariance matrix [23].Various heuristics have been developed for finding sparse Cholesky decompositions,the most prominent being the minimum degree algorithm [9, 31]. In Algorithm 6we provide a heuristic for finding a sparse minimal I-MAP Gπ that reduces to theminimum degree algorithm in the oracle setting as shown in Theorem 16.

    In the following we let G := (V,E) be an undirected graph. For a subset ofnodes S ⊂ V we let GS denote the vertex-induced subgraph of G with node set S.For k ∈ V we let adj(G, i) denote the nodes k ∈ V \ {i} such that {i, k} ∈ E. Wefirst show that Algorithm 6 is equivalent to the minimum degree algorithm [31] inthe oracle setting.

    Theorem 15. Let the data-generating distribution P be multivariate Gaussian withprecision matrix Θ. Then in the oracle-setting the set of possible output permuta-tions from Algorithm 6 is equal to the possible output permutations of the minimumdegree algorithm applied to Θ.

    The following result shows that Algorithm 6 in the non-oracle setting is alsoequivalent to the minimum degree algorithm in the oracle setting.

  • 12 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Algorithm 6: Neighbor-based minimum degree algorithm

    Input: Observations X̂, threshold τOutput: Permutation π̂ together with the DAG Ĝπ̂Set S := [p]; construct undirected graph ĜS with (i, j) ∈ ĜS if and only if|ρ̂i,j|(S\{i,j})| ≥ τ ;while S 6= ∅ do

    Uniformly draw node k from all nodes with the lowest degree in the graphĜS ;

    Construct ĜS\{k} by first removing node k and its adjacent edges; then

    update the graph ĜS\{k} as follows:

    for all i, j ∈ adj(ĜS , k) : if (i, j) is not an edge in ĜS , add (i, j);else: (i, j) is an edge in ĜS\{k} iff |ρ̂i,j|S\{i,j,k}| ≥ τ ;

    for all i, j /∈ adj(ĜS , k) : (i, j) is an edge in ĜS\{k} iff (i, j) is an edge in ĜS .Set π̂(k) := |S| and S := S \ {k}.

    end

    Output the minimal I-MAP Ĝπ̂ constructed from π̂ and X̂.

    Theorem 16. Suppose that assumptions (A1), (A2), and (A4) hold, and let thethreshold τ in Algorithm 5 be defined as τ := cn/2. Then with probability at least1−O(exp(−cn1−2`)) the output permutation from Algorithm 6 is contained in thepossible output permutations of the minimum degree algorithm applied to Θ.

    6. Simulations

    In this section, we describe our simulation results, for which we used the Rlibrary pcalg [12]. Our simulation study used linear structural equation models

    (a) p = 10, λ = 0.001 (b) p = 10, λ = 0.01 (c) p = 10, λ = 0.1

    Figure 1. Expected neighborhood size versus proportion of con-sistently recovered Markov equivalence classes based on 100 simu-lations for each expected neighborhood size on DAGs with p = 10nodes, edge weights sampled uniformly in [−1,−0.25] ∪ [0.25, 1],and λ-values 0.1, 0.001 and 0.001.

  • ORDERING-BASED CAUSAL INFERENCE 13

    with Gaussian noise:

    (X1, . . . , Xp)T = ((X1, . . . , Xp)A)

    T + ε,

    where ε ∼ N (0, Ip) with Ip being the p × p identity matrix and A = [aij ]pi,j=1is, without loss of generality, an upper-triangular matrix of edge weights withaij 6= 0 if and only if i → j is an arrow in the underlying DAG G∗. For eachsimulation study, we generated 100 realizations of a p-node random Gaussian DAGmodel on an Erdös-Renyi graph for different values of p and expected neighbor-hood sizes (i.e. edge probabilities). The edge weights aij were sampled uniformlyin [−1,−0.25] ∪ [0.25, 1], ensuring that the edge weights are bounded away from0. In the first set of simulations, we analyzed the oracle setting, where we haveaccess to the true underlying covariance matrix Σ. In the remaining simulations,n samples were drawn from the distribution induced by the Gaussian DAG modelfor different values of n and p. In the oracle setting, the CI relations were com-puted by thresholding the partial correlations using different thresholds λ. For thesimulations with n samples, we estimated the CI relations by applying Fisher’sz-transform and comparing the p-values derived from the z-transform with a sig-nificance level α. In the oracle and low-dimensional settings, GES was simulatedusing the standard BIC score function [5]. In the high-dimensional setting, we usedthe `0-penalized maximum likelihood estimation scoring function [18, 33].

    Figure 1 compares the proportion of consistently estimated DAGs in the oraclesetting for greedy SP with number of runs r ∈ {1, 5, 10} and depth d ∈ {1, 4, 5,∞},the PC algorithm, and the GES algorithm. The number of nodes in these simula-tions is p = 10, and we consider the different λ-values: 0.1, 0.01 and 0.001 for thePC algorithm and Greedy SP. Note that we run GES with n = 100, 000 samplessince there is no oracle version for GES.

    As expected, we see that increasing the number of runs for Greedy SP results ina consistently higher rate of model recovery. In addition, for each fixed number ofruns Greedy SP with search depth d = 4 performs similarly to search depth d =∞,in line with the observation that the average MEC has 4 elements, as discussed inSection 4.4.

    For each run of the algorithm, we also recorded the structural Hamming distance(SHD) between the true and the recovered MECs. Figure 2 shows the average SHDversus the expected neighborhood size of the true DAG. While Figure 1 demon-strates that Greedy SP with search depth d = 4 and multiple runs learns the trueMEC at a notably higher rate than the PC and GES algorithms when λ is chosensmall, Figure 2 shows that, for small values of d and r, when Greedy SP learns thewrong DAG it is further off from the true DAG than the PC algorithm. On theother hand, it appears that this trend only holds for Greedy SP with a relativelysmall search depth and few runs. That is, increasing the value of these parametersensures that the wrong DAG learned by Greedy SP will consistently be closer tothe true DAG than that learned by the PC algorithm.

    We then compared the recovery performance of Greedy SP to the SP, GES, PC,SGS, and MMHC algorithms. MMHC [32] is a hybrid method that first estimatesa skeleton through CI testing and then performs a hill-climbing search to orientthe edges. We fixed the number of nodes to be p = 8 (due to the computationallimitations of the SP algorithm) and considered sample sizes n = {1, 000, 10, 000}.We analyzed the performance of GES using the standard BIC score along withAlgorithm 4 and the PC algorithm for α = {0.01, 0.001, 0.0001}. To compensate for

  • 14 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    (a) p = 10, λ = 0.001 (b) p = 10, λ = 0.01 (c) p = 10, λ = 0.1

    Figure 2. Expected neighborhood size versus structural Ham-ming distance between the true and recovered Markov equivalenceclasses based on 100 simulations for each expected neighborhoodsize on DAGs with p = 10 nodes, edge weights sampled uniformlyin [−1,−0.25] ∪ [0.25, 1], and λ-values 0.1, 0.01 and 0.001.

    the trade-off between computational efficiency and estimation performance, GreedySP was run using r = 10 with search depth d = 4. Figure 3 shows that the SPand greedy SP algorithms achieve the best performance compared to all otheralgorithms. Since for computational reasons the SP algorithm cannot be applied tographs with over 10 nodes, we conclude that greedy SP is the preferable approachfor most applications.

    In the remainder of this section, we analyze the performance of Algorithm 5in the sparse high-dimensional setting. For comparision, we only compare theperformance of Algorithm 5 with methods that have high-dimensional consistencyguarantees, namely the PC algorithm [11] and GES [18, 33]. The initial permutationof Algorithm 6 and its associated minimal I-MAP are used as a starting pointin Algorithm 5 (“high-dim greedy SP”). To better understand the influence ofaccurately selecting an initial minimal I-MAP on the performance of Algorithm 5,we also considered the case when the moral graph of the data-generating DAG isgiven as prior knowledge (“high-dim greedy SP on moral graph”). In analogy to thepassage from Algorithm 2 to Algorithm 4 for the sake of computational feasibility,we similarly conducted our high-dimensional simulations using Algorithm 5 with asearch depth of d = 1 and r = 50.

    Figure 4 compares the skeleton recovery of high-dimensional greedy SP (Algo-rithm 5) with the PC algorithm and GES, both without (“high-dim greedy SP”,“high-dim PC” and “high-dim GES”) and with prior knowledge of the moral graph(“high-dim greedy SP on moral graph”, “high-dim PC on moral graph” and “high-dim GES on moral graph”). We used the ARGES-CIG algorithm of [18] to runGES with knowledge of the moral graph.

    The number of nodes in our simulations is p = 100, the number of samplesconsidered is n = 300, and the neighborhood sizes used are s = 0.2, 1 and 2. Wevaried the tuning parameters of each algorithm; namely, the significance level αfor the PC and the greedy SP algorithms and the penalization parameter λn forGES. We then reported the average number of true positives and false positivesfor each tuning parameter in the ROC plots shown in Figure 4. The result shows

  • ORDERING-BASED CAUSAL INFERENCE 15

    Expected neighborhood size

    Prop

    ortio

    n of

    con

    sist

    ent s

    imul

    atio

    ns

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0 ●

    ●● ●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 1000 samples, alpha 0.0001

    (a) n = 1, 000, α = 0.0001

    Expected neighborhood size

    Prop

    ortio

    n of

    con

    sist

    ent s

    imul

    atio

    ns

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0 ●●

    ● ●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 1000 samples, alpha 0.001

    (b) n = 1, 000, α = 0.001

    Expected neighborhood size

    Prop

    ortio

    n of

    con

    sist

    ent s

    imul

    atio

    ns

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    ●●

    ● ●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 1000 samples, alpha 0.01

    (c) n = 1, 000, α = 0.01

    Expected neighborhood size

    Prop

    ortio

    n of

    con

    sist

    ent s

    imul

    atio

    ns

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0 ●●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 10000 samples, alpha 0.0001

    (d) n = 10, 000, α = 0.0001

    Expected neighborhood size

    Prop

    ortio

    n of

    con

    sist

    ent s

    imul

    atio

    ns

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0 ●●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 10000 samples, alpha 0.001

    (e) n = 10, 000, α = 0.001

    Expected neighborhood sizePr

    opor

    tion

    of c

    onsi

    sten

    t sim

    ulat

    ions

    0 1 2 3 4 5 6 7

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    ●●

    ● SPgreedy SPGESMMHCPCSGS

    8 nodes, 10000 samples, alpha 0.01

    (f) n = 10, 000, α = 0.01

    Figure 3. Expected neighborhood size versus proportion of con-sistently recovered skeletons based on 100 simulations for eachexpected neighborhood size on DAGs with p = 8 nodes, samplesize n = 1, 000 and 10, 000, edge weights sampled uniformly in[−1,−0.25] ∪ [0.25, 1], and α-values 0.01, 0.001 and 0.0001.

    that, unlike the low-dimensional setting, although greedy SP is still comparable tothe PC algorithm and GES in the high-dimensional setting, GES tends to achievea slightly better performance in some of the settings.

    7. Discussion

    In this paper we proposed and examined greedy permutation-based DAG modellearning algorithms that arise as greedy variants of the SP-algorithm. The twomain algorithms analyzed were the ESP (Algorithm 1) and TSP (Algorithm 2) al-gorithms. We additionally studied a version of TSP using the BIC score function(Algorithm 3) and a high-dimensional version of TSP (Algorithm 5). Unlike ESP,TSP does not require computing the polytope on which the greedy search is per-formed and hence can scale to larger graphs. Computational costs can further bereduced by limiting the search depth (Algorithm 4 - the Greedy SP algorithm).

    The main goal of this paper was to establish the first known consistency guaran-tees for greedy permutation-based DAG model learning algorithms. To do this, inSection 4, we examined consistency guarantees for Algorithms 1 and 2. We showedthey are pointwise consistent under the faithfulness assumption, thereby makingthem the first permutation-based DAG model learning algorithms with consistencyguarantees. We further established that Algorithm 3, which is closely related to the

  • 16 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    (a) s = 0.2, n = 300 (b) s = 1, n = 300 (c) s = 2, n = 300

    (d) s = 0.2, n = 300 (e) s = 1, n = 300 (f) s = 2, n = 300

    Figure 4. ROC plot for skeleton recovery both with and withoutmoral graph based on 100 simulations on DAGs with p = 100nodes, expected neighborhood size s = 0.2, 1 and 2, sample sizen = 300, and edge weights sampled uniformly in [−1,−0.25] ∪[0.25, 1].

    greedy permutation search of Teyssier and Koller [29], is also pointwise consistentunder the faithfulness assumption. We then proved that Algorithm 5 is uniformlyconsistent under faithfulness. Additionally, we described the relationship betweenthe identifiability assumptions ESP, TSP, SMR, faithfulness, and restricted faith-fulness via Theorems 11 and 12.

    We then compared the greedy permutation-based algorithms studied here withthe state-of-the-art algorithms via simulations. In the oracle and low-dimensionalsettings, we found that Greedy SP with d = 4 and r = 5 tends to outperformboth GES and the PC algorithm. Furthermore, in the high-dimensional settingGreedy SP performs comparably to the PC algorithm and GES. Similar to thePC algorithm and in contrast to GES, Greedy SP is nonparametric and conse-quently does not require the Gaussian assumption. In future work it would beinteresting to combine the greedy SP algorithm with kernel-based CI tests [8, 30],that are better able to deal with non-linear relationships and non-Gaussian noise,and analyze its performance on non-Gaussian data. Furthermore, we believe thatgreedy permutation-based causal inference approaches could provide new avenuesfor causal inference in a variety of settings. Extensions of Greedy SP to the set-ting where a mix of observational and interventional data is available has recently

  • ORDERING-BASED CAUSAL INFERENCE 17

    been presented in [36, 37, 28]. Extensions of Greedy SP to the setting with latentvariables are also starting to be studied [1]. In addition, it would be interesting toextend the greedy SP algorithm to the setting with cyclic graphs.

    Finally, recall that a typical reason for passing to a greedy DAG model learningalgorithm from a non-greedy algorithm is to more efficiently work through a largesearch space. This reasoning applies when we consider passing from the SP al-gorithm, which has a search space of size p!, to the greedy SP algorithm, whichgreedily searches over equivalence classes of the p! permutation DAGs that corre-spond to the nodes of the DAG associahedron Ap(C) described in Section 3. Thegreedy SP algorithm discussed in this paper therefore operates in a reduced searchspace as compared to the original SP algorithm. Moreover, the moves in the greedySP algorithm are a strict subset of the moves used by the algorithm of Teyssier andKoller [29], and this subset explicitly excludes moves that are guaranteed not toimprove the value of the score function. Therefore, it seems likely that greedy SPperforms with computational efficiency comparable or favorable to the algorithmof Teyssier and Koller [29], which was shown to be more efficient than GES. Whilethe main focus of the present paper was to provide the first consistency guaranteesfor greedy permutation-based algorithms, these observations suggest, as a futureresearch topic, to explicitly compare the computational efficiency of the algorithmsdiscussed in Section 6.

    Acknowledgements

    Liam Solus was partially supported by an NSF Mathematical Sciences Postdoc-toral Research Fellowship (DMS - 1606407). Caroline Uhler was partially supportedby NSF (DMS-1651995), ONR (N00014-17-1-2147 and N00014-18-1-2765), IBM, aSloan Fellowship and a Simons Investigator Award. The authors are grateful toRobert Castelo for helpful discussions.

    APPENDIX

    Appendix A. Background Material and an Example

    Here, we provide some definitions from graph theory and causal inference thatwe will use in the coming proofs. Given a DAG G := ([p], A) with node set [p] :={1, 2, . . . , p} and arrow set A, we associate to the nodes of G a random vector(X1, . . . , Xp) with a probability distribution P. An arrow in A is an ordered pairof nodes (i, j) which we will often denote by i→ j. A directed path in G from nodei to node j is a sequence of directed edges in G of the form i→ i1 → i2 → · · · → j.A path from i to j is a sequence of arrows between i and j that connect the twonodes without regard to direction. The parents of a node i in G is the collectionPaG(i) := {k ∈ [p] : k → i ∈ A}, and the ancestors of i, denoted AnG(i), is thecollection of all nodes k ∈ [p] for which there exists a directed path from k to i inG. We do not include i in AnG(i). The descendants of i, denoted DeG(i), is theset of all nodes k ∈ [p] for which there is a directed path from i to k in G, and thenondescendants of i is the collection of nodes NdG(i) := [p]\(DeG(i) ∪ {i}). Whenthe DAG G is understood from context we write Pa(i), An(i), De(i), and Nd(i),for the parents, ancestors, descendants, and nondescendants of i in G, respectively.The analogous definitions and notation will also be used for any set S ⊂ [p]. If twonodes are connected by an arrow in G then we say they are adjacent. A triple of

  • 18 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    nodes (i, j, k) is called unshielded if i and j are adjacent, k and j are adjacent, buti and k are not adjacent. An unshielded triple (i, j, k) forms an immorality if it isof the form i→ j ← k. In any triple (shielded or not) with arrows i→ j ← k, thenode j is called a collider. Given disjoint subsets A,B,C ⊂ [p] with A ∩ B = ∅,we say that A is d-connected to B given C if there exist nodes i ∈ A and j ∈ Bfor which there is a path between i and j such that every collider on the path is inAn(C) ∪ C and no non-collider on the path is in C. If no such path exists, we sayA and B are d-separated given C.

    A.1. Example. We now provide an example demonstrating that not all edges ofa DAG associahedron need to correspond to a covered arrow reversal.

    Example 17. An example of a DAG associahedron containing an edge that does notcorrespond to a covered arrow reversal in either DAG labeling its endpoints canbe constructed as follows: Let Gπ∗ denote the left-most DAG depicted in Figure 5,and let C denote those CI relations implied by the d-separation statements for Gπ∗ .Then for the permutations π = 15432 and τ = 15342, the DAGs Gπ and Gτ labela pair of adjacent vertices of Ap(C) since π and τ differ by the transposition of 3and 4. This adjacent transposition corresponds to a reversal of the arrow betweennodes 3 and 4 in Gπ and Gτ . However, this arrow is not covered in either minimalI-MAP. We further note that this example shows that not all edges of Ap(C) can bedescribed by covered arrow reversals even when C is faithful to the sparsest minimalI-MAP, Gπ∗ .

    Appendix B. Proofs for Results on the Pointwise Consistency of theGreedy SP Algorithms

    B.1. Proof of Lemma 6. Suppose first that C is not faithful to G and take anyconditional independence statement i ⊥⊥ j | K that is not encoded by the d-separation statements in G. Take π to be any permutation in which K �π i �π

    1

    2

    3

    45 1

    2

    3

    45 1

    2

    3

    45

    Gπ∗ Gπ Gτ

    Figure 5. An edge of a DAG associahedron that does not corre-spond to a covered edge flip. The DAG associahedron Ap(C) is con-structed for the CI relations implied by the d-separation statementsfor Gπ∗ with π∗ = 15234. The DAGs Gπ and Gτ with π = 15432and τ = 15342 correspond to adjacent vertices in Ap(C), connectedby the edge labeled by the transposition of 3 and 4. The arrowbetween nodes 3 and 4 is not covered in either DAG Gπ or Gτ .

  • ORDERING-BASED CAUSAL INFERENCE 19

    Algorithm 7: APPLY-EDGE-OPERATION

    Input : DAGs G and H where G ≤ H and G 6= H.Output: A DAG G′ satisfying G′ ≤ H that is given by reversing an edge in G

    or adding an edge to G.1 Set G′ := G.2 While G and H contain a node Y that is a sink in both DAGs and for which

    PaG(Y ) = PaH(Y ), remove Y and all incident edges from both DAGs.3 Let Y be any sink node in H.4 If Y has no children in G, then let X be any parent of Y in H that is not a

    parent of Y in G. Add the edge X → Y to G′ and return G′.5 Let D ∈ DeG(Y ) denote the (unique) maximal element from DeG(Y ) withinH. Let Z be any maximal child of Y in G such that D is a descendant of Zin G.

    6 If Y → Z is covered in G, reverse Y → Z in G′ and return G′.7 If there exists a node X that is a parent of Y but not a parent of Z in G, then

    add X → Z to G′ and return G′.8 Let X be any parent of Z that is not a parent of Y . Add X → Y to G′ and

    return G′.

    j � [p]\(K ∪ {i, j}). Then G 6≤ Gπ since i ⊥⊥ j | K is encoded by the d-separationsof Gπ but not by the d-separations of G.

    Conversely, suppose P is faithful to G. By [19, Theorem 9, Page 119], we knowthat P satisfies the Markov Assumption with respect to Gπ for any π ∈ Sn. So anyconditional independence relation encoded by Gπ holds for P, which means it alsoholds for G. Thus, G ≤ Gπ. �

    To prove that TSP is consistent under faithfulness we require a number of lemmaspertaining to the steps of the Chickering algorithm. For the convenience of thereader, we recall the Chickering algorithm in Algorithm 7.

    Lemma 18. Suppose G ≤ H such that the Chickering algorithm has reached step5 and selected the arrow Y → Z in G to reverse. If Y → Z is not covered in G,then there exists a Chickering sequence(

    G = G0,G1,G2, . . . ,GN ≤ H)

    in which GN is produced by the reversal of Y → Z, and for all i = 1, 2, . . . , N − 1,the DAG Gi is produced by an arrow addition via step 7 or 8 with respect to thearrow Y → Z.Proof. Until the arrow Y → Z is reversed, the set DeG(Y ) and the node choiceD ∈ DeG(Y ) remain the same. This is because steps 7 and 8 only add parents to Yor Z that are already parents of Y or Z, respectfully. Thus, we can always choosethe same Y and Z until Y → Z is covered. �

    For an independence map G ≤ H, the Chickering algorithm first deletes all sinksin G that have precisely the same parents in H, and repeats this process for theresulting graphs until there is no sink of this type anymore. This is the purpose

    of step 2 of the algorithm. If the adjusted graph is G̃, the algorithm then selects a

  • 20 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    sink node in G̃, which, by construction, must have less parents than the same nodein H and/or some children. The algorithm then adds parents and reverses arrowsuntil this node has exactly the same parents as the corresponding node in H. Thefollowing lemma shows that this can be accomplished one sink node at a time. Theproof is clear from the statement of the algorithm.

    Lemma 19. Let G ≤ H. If Y is a sink node selectable in step 3 of the Chickeringalgorithm then we may always select Y each time until it is deleted by step 2.

    We would like to see how the sequence of graphs produced in Chickering’s algo-rithm relates to the DAGs Gπ for a set of CI relations C. In particular, we wouldlike to see that if Gπ ≤ Gτ for permutations π, τ ∈ Sp, then there is a sequence ofmoves given by Chickering’s algorithm that passes through a sequence of minimalI-MAPs taking us from Gπ to Gτ . To do so, we require an additional lemma relatingindependence maps and minimal I-MAPs. To state this lemma we need to considerthe two steps within Algorithm 7 in which arrow additions occur. We now recallthese two steps:

    (i) Suppose Y is a sink node in G ≤ H. If Y is also a sink node in G, then choosea parent X of Y in H that is not a parent of Y in G, and add the arrowX → Y to H.

    (ii) If Y is not a sink node in G, then there exists an arrow Y → Z in G that isoriented in the opposite direction in H. If Y → Z is covered, the algorithmreverses it. If Y → Z is not covered, there exists (in G) either(a) a parent X of Y that is not a parent of Z, in which case, the algorithm

    adds the arrow X → Z.(b) a parent X of Z that is not a parent of Y , in which case, the algorithm

    adds the arrow X → Y .

    Lemma 20. Let C be a graphoid and Gπ ≤ Gτ with respect to C. Then the commonsink nodes of Gπ and Gτ all have the same incoming arrows. In particular, theChickering algorithm needs no instance of arrow additions (i) to move from Gπ toGτ .

    Proof. Suppose on the contrary that there exists some sink node Y in Gπ and thereis a parent node X of Y in Gτ that is not a parent node of Y in Gπ. Since Yis a sink in both permutations, then there exists linear extensions π̂ and τ̂ of thepartial orders corresponding to Gπ and Gτ for which Y = π̂p and Y = τ̂p. By [17,Theorem 7.4], we know that Gπ = Gπ̂ and Gτ = Gτ̂ . In particular, we know thatX 6⊥⊥ Y | [p]\{X,Y } in Gτ̂ and X ⊥⊥ Y | [p]\{X,Y } in Gπ̂. However, this is acontradiction, since both of these relations cannot simultaneously hold. �

    B.2. Proof of Proposition 8. To prove Proposition 8 we must first prove a fewlemmas. Throughout the remainder of this section, we use the following notation:Suppose that G ≤ H for two DAGs G and H and that

    C = (G0 := G,G1,G2, . . . ,GN := H)is a Chickering sequence from G to H. We let πi ∈ Sp denote a linear extensionof Gi for all i = 0, 1, . . . , N . For any DAG G let CI(G) denote the collection of CIrelations encoded by the d-separation statements in G.

  • ORDERING-BASED CAUSAL INFERENCE 21

    Lemma 21. Suppose that Gτ is a minimal I-MAP of a graphoid C. Suppose alsothat G ≈ Gτ and that G differs from Gτ only by a covered arrow reversal. If π is alinear extension of G then Gπ is a subDAG of G.Proof. Suppose that G is obtained from Gτ by the reversal of the covered arrowx→ y in Gτ . Without loss of generality, we assume that τ = SxyT and π = SyxTfor some disjoint words S and T whose letters are collectively in bijection with theelements in [p]\{x, y}. So in Gπ, the arrows going from S to T , x to T , and y to Tare all the same as in Gτ . However, the arrows going from S to x and S to y maybe different. So, to prove that Gπ is a subDAG of G we must show that for eachletter s in the word S

    (1) if s→ x /∈ Gτ then s→ x /∈ Gπ, and(2) if s→ y /∈ Gτ then s→ y /∈ Gπ.To see this, notice that if s → x /∈ Gτ , then s → y /∈ Gτ since x → y is covered inGτ . Similarly, if s → y /∈ Gτ then s → x /∈ Gτ . Thus, we know that s ⊥⊥ x | S\sand s ⊥⊥ y | (S\s)x are both in the collection C. It then follows from (SG2) thats ⊥⊥ x | (S\s)y and s ⊥⊥ y | S\s are in C as well. Therefore, Gπ is a subDAG ofG. �

    Lemma 22. Let C be a graphoid and letC = (G0 := Gπ,G1,G2, . . . ,GN := Gτ )

    be a Chickering sequence from a minimal I-MAP Gπ of C to another Gτ . If, forsome index 0 ≤ i < N , Gi is obtained from Gi+1 by deletion of an arrow x→ y inGi+1 then x→ y is not in Gπi+1 .Proof. Let πi+1 = SxTyR be a linear extension of Gi+1 for some disjoint words S,T , and R whose letters are collectively in bijection with the elements in [p]\{x, y}.Since Gπ∗ ≤ Gi ≤ Gi+1 then

    C ⊇ CI(Gπ) ⊇ CI(Gi) ⊇ CI(Gi+1).We claim that x ⊥⊥ y | ST ∈ CI(Gi) ⊆ C. Therefore, x→ y cannot be an arrow inGπi+1 .

    First, since Gi is obtained from Gi+1 by deleting the arrow x → y, then πi+1is also a linear extension of Gi. Notice, there is no directed path from y to x inGi, and so it follows that x and y are d-separated in Gi by PaGi(y). Therefore,x ⊥⊥ y | PaGi(y) ∈ CI(Gi). Notice also that PaGi(y) ⊂ ST and any path in Gibetween x and y lacking colliders uses only arrows in the subDAG of Gi inducedby the vertices S ∪ T ∪ {x, y} = [p]\R. Therefore, x ⊥⊥ y | ST ∈ CI(Gi) as well.It follows that x ⊥⊥ y | ST ∈ C, and so, by definition, x → y is not an arrow ofGπi+1 . �

    Lemma 23. Suppose that C is a graphoid and Gπ is a minimal I-MAP with respectto C. Let

    C = (G0 := Gπ,G1,G2, . . . ,GN := Gτ )be a Chickering sequence from Gπ to another minimal I-MAP Gτ with respect toC. Let i be the largest index such that Gi is produced from Gi+1 by deletion of anarrow, and suppose that for all i+ 1 < k ≤ N we have Gπk = Gk. Then Gπi+1 is aproper subDAG of Gi+1.

  • 22 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    Proof. By Lemma 21, we know that Gπi+1 is a subDAG of Gi+1. This is becauseπi+1 is a linear extension of Gi+1 and Gi+1 ≈ Gi+2 = Gπi+2 and Gi+1 differs fromGi+2 only by a covered arrow reversal. By Lemma 22, we know that the arrowdeleted in Gi+1 to obtain Gi is not in Gπi+1 . Therefore, Gπi+1 is a proper subDAGof G. �

    Using these lemmas, we can now give a proof of Proposition 8.

    B.3. Proof of Proposition 8. To see that (i) holds, notice since Gπ ≈ Gτ then bythe transformational characterization of Markov equivalence given in [4, Theorem2], we know there exists a Chickering sequence

    C := (G0 := Gπ,G1,G2, . . . ,GN := Gτ )for which G0 ≈ G1 ≈ · · · ≈ GN and Gi is obtained from Gi+1 by the reversal of acovered arrow in Gi+1 for all 0 ≤ i < N . Furthermore, since Gπ is MEC-s-minimal,and by Lemma 21, we know that for all 0 ≤ i ≤ N

    Gi ⊇ Gπi ⊇ Gπ.However, since Gi ≈ Gπ and Gπi is a subDAG of Gi, then Gi = Gπi for all i. Thus,the desired weakly decreasing edgewalk along Ap(C) is

    (Gπ = Gπ0 ,Gπ1 , . . . ,GπN−1 ,GπN = Gτ ).To see that (ii) holds, suppose that Gπ ≤ Gτ but Gπ 6≈ Gτ . Since Gπ ≤ Gτ , there

    exists a Chickering sequence from Gπ to Gτ that uses at least one arrow addition.By Lemmas 18 and 19 we can choose this Chickering sequence such that it resolvesone sink at a time and, respectively, reverses one covered arrow at a time. Wedenote this Chickering sequence by

    C := (G0 := Gπ,G1,G2, . . . ,GN := Gτ ).Let i denote the largest index for which Gi is obtained from Gi+1 by deletion ofan arrow. Then by our choice of Chickering sequence we know that Gk is obtainedfrom Gk+1 by a covered arrow reversal for all i < k < N . Moreover, πi = πi+1, andso Gπi = Gπi+1 . Furthermore, by Lemma 21 we know that Gπk is a subDAG of Gkfor all i < k ≤ N .

    Suppose now that there exists some index i + 1 < k < N such that Gπk is aproper subDAG of Gk. Without loss of generality, we pick the largest such index.It follows that for all indices k < ` ≤ N , Gπ` = G` and that

    Gk+1 ≈ Gk+2 ≈ · · · ≈ GN = Gτ .Thus, by [4, Theorem 2], there exists a weakly decreasing edgewalk from Gτ to Gk+1on Ap(C). Since we chose the index k maximally then Gk is obtained from Gk+1by a covered arrow reversal. Therefore, Gπk and Gπk+1 are connected by an edge ofAp(C) indexed by a covered arrow reversal. Since |Gk| = |Gk+1| = |Gπk+1 | and Gπkis a proper subDAG of Gk, then the result follows.

    On the other hand, suppose that for all indices i+1 < k ≤ N , we have Gπk = Gk.Then this is precisely the conditions of Lemma 23, and so it follows that Gπi+1 isa proper subDAG of Gi+1. Since Gi+1 is obtained from Gi+2 by a covered arrowreversal, the result follows. �

  • ORDERING-BASED CAUSAL INFERENCE 23

    B.4. Proof of Theorem 10. The proof is composed of two parts. We first provethat for any permutation π, in the limit of large n, Ĝπ is a minimal I-MAP of Gπ.We prove this by contradiction. Suppose Ĝπ 6= Gπ. Since the BIC is consistent, inthe limit of large n, Ĝπ is an I-MAP of the distribution. Since Ĝπ and Gπ sharethe same permutation and Gπ is a minimal I-MAP, then Gπ ⊂ Ĝπ. Suppose nowthat there exists (i, j) ∈ Ĝπ such that (i, j) 6∈ Gπ. Since Gπ is a minimal I-MAP,we obtain that i ⊥⊥ j | PaGπ (j). Since the BIC is locally consistent, it follows thatBIC(Gπ, X̂) > BIC(Ĝπ, X̂).

    Now we prove that for any two permutations τ and π where Gτ is connected toGπ by precisely one covered arrow reversal, in the limit of large n,

    BIC(Gτ ; X̂) > BIC(Gπ; X̂)⇔ |Gτ | < |Gπ|,and

    BIC(Gτ ; X̂) = BIC(Gπ; X̂)⇔ |Gτ | = |Gπ|.It suffices to prove

    |Gτ | = |Gπ| ⇒ BIC(Gτ ; X̂) = BIC(Gπ; X̂) (B.1)and

    |Gτ | < |Gπ| ⇒ BIC(Gτ ; X̂) > BIC(Gπ; X̂). (B.2)Eq. B.1 is easily seen to be true using [4, Theorem 2] as Gπ and Gτ are equivalent.For Eq. B.2, by Theorem 7, since Gτ ≤ Gπ there exists a Chickering sequence fromGτ to Gπ with at least one edge addition and several covered arrow reversals. Forthe covered arrow reversals, BIC remains the same since the involved DAGs areequivalent. For the edge additions, the score necessarily decreases in the limit oflarge n due to the increase in the number of parameters. This follows from theconsistency of the BIC and the fact that DAGs before and after edge additions areboth I-MAPs of P. In this case, the path taken in the TSP algorithm using theBIC score is the same as in the original TSP algorithm. Since the TSP algorithm isconsistent, it follows that the TSP algorithm with the BIC score is also consistent.�

    B.5. Proof of Theorem 11. It is quick to see that

    faithfulness =⇒ TSP =⇒ ESP =⇒ SMR.The first implication is given by Corollary 9, and the latter three are immediateconsequences of the definitions of the TSP, ESP, and SMR assumptions. Namely,the TSP, ESP, and SMR assumptions are each defined to be precisely the condi-tion in which Algorithm 2, Algorithm 1, and the SP Algorithm are, respectively,consistent. The implications then follow since each of the algorithms is a refinedversion of the preceding one in this order. Hence, we only need to show the strictimplications. For each statement we identify a collection of CI relations satisfyingthe former identifiability assumption but not the latter. For the first implicationconsider the collection of CI relations

    C = {1 ⊥⊥ 5 | {2, 3}, 2 ⊥⊥ 4 | {1, 3}, 3 ⊥⊥ 5 | {1, 2, 4},1 ⊥⊥ 4 | {2, 3, 5}, 1 ⊥⊥ 4 | {2, 3}}.

    The sparsest DAG Gπ∗ with respect to C is shown in Figure 6. To see that Csatisfies the TSP assumption with respect to Gπ∗ , we can use computer evaluation.

  • 24 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    1 2

    3

    4

    5

    Figure 6. A sparsest DAG w.r.t. the CI relations C given in theproof of Theorem 11.

    To see that it is not faithful with respect to G∗π, notice that 1 ⊥⊥ 5 | {2, 3} and1 ⊥⊥ 4 | {2, 3, 5} are both in C, but they are not implied by G∗π. We also remarkthat C is not a semigraphoid since the semigraphoid property (SG2) applied to theCI relations 1 ⊥⊥ 5 | {2, 3} and 1 ⊥⊥ 4 | {2, 3, 5} implies that 1 ⊥⊥ 5 | {2, 3, 4}should be in C.

    For the second implication consider the collection of CI relations

    D = {1 ⊥⊥ 2 | {4}, 1 ⊥⊥ 3 | {2}, 2 ⊥⊥ 4 | {1, 3}}

    and initialize Algorithm 2 at the permutation π := 1423. A sparsest DAG Gπ∗with respect to C is given in Figure 7(a), and the initial permutation DAG Gπ isdepicted in Figure 7(b). Notice that the only covered arrow in Gπ is 1 → 4 , andreversing this covered arrow produces the permutation τ = 4123; the correspondingDAG Gτ is shown in Figure 7(c). The only covered arrows in Gτ are 4 → 1 and4 → 2. Reversing 4 → 1 returns us to Gπ, which we already visited, and reversing4 → 2 produces the permutation σ = 2143; the associated DAG Gσ is depictedin Figure 7(d). Since the only DAGs connected to Gπ and Gτ via covered arrowflips have at least as many edges as Gπ (and Gτ ), then Algorithm 2 is inconsistent,and so the TSP assumption does not hold for C. On the other hand, we can verifycomputationally that Algorithm 1 is consistent with respect to D, meaning thatthe ESP assumption holds.

    Finally, for the last implication consider the collection of CI relations

    E = {1 ⊥⊥ 3 | {2}, 2 ⊥⊥ 4 | {1, 3}, 4 ⊥⊥ 5},

    and the initial permutation π = 54321. The initial DAG Gπ and a sparsest DAGGπ∗ are depicted in Figures 8(a) and (b), respectively. It is not hard to check thatany DAG Gτ that is edge adjacent to Gπ is a complete graph. Thus, the SMRassumption holds for E but not the ESP assumption. �

    1 2

    34

    1 2

    34

    1 2

    34

    1 2

    34

    (a) (b) (c) (d)

    Figure 7. The four permutation DAGs with respect to the CIrelations D described in the proof of Theorem 11.

  • ORDERING-BASED CAUSAL INFERENCE 25

    1 2

    34

    5

    1 2

    34

    5

    (a) (b)

    Figure 8. The initial permutation DAG and the sparest permu-tation DAG with respect to the CI relations E described in theproof of Theorem 11.

    B.6. Proof of Theorem 12. Let P be a semigraphoid, and let C denote the CIrelations entailed by P. Suppose for the sake of contradiction that Algorithm 2is consistent with respect to C, but P fails to satisfy adjacency faithfulness withrespect to a sparsest DAG G∗π. Then there exists some CI relation i ⊥⊥ j | S in Csuch that i → j is an arrow of G∗π. Now let π be any permutation respecting theconcatenated ordering iSjT where T = [p] \ ({i, j} ∪ S). Then our goal is to showthat any covered arrow reversal in Gπ that results in a permutation DAG Gτ withstrictly fewer edges than Gπ must satisfy the condition that i → j is not an arrowin Gτ .

    First, we consider the possible types of covered arrows that may exist in Gπ. Tolist these, it will be helpful to look at the diagram depicted in Figure 9. Notice firstthat we need not consider any trivially covered arrows, since such edge reversals donot decrease the number of arrows in the permutation DAGs. Any edge i → S ori→ T is trivially covered, so the possible cases of non-trivially covered arrows areexactly the covered arrows given in Figure 10. In this figure, each covered arrowto be considered is labeled with the symbol ?. Notice that the claim is triviallytrue for cases (1) – (4); i.e., any covered arrow reversal resulting in edge deletionsproduces a permutation DAG Gτ for which i→ j is not an arrow of Gτ .

    Case (5) is also easy to see. Recall that π = is1 · · · skjt1 · · · tm where S :={s1, . . . , sk} and T := {t1, . . . , tk}, and that reversing the covered arrow in case(5) results in an edge deletion. Since s → t is covered, then there exists a linearextension τ of Gπ such that s and t are adjacent in τ . Thus, either j precedes boths and t or j follows both s and t in τ . Recall also that by [17, Theorem 7.4] we

    i

    j

    S

    T

    Figure 9. This diagram depicts the possible arrows between thenode sets {i}, {j}, S, and T for the minimal I-MAP Gπ consideredin the proof of Theorem 12.

  • 26 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    i

    t t0

    j

    t t0

    i

    s s0

    s

    s0

    t t0

    i

    s t

    s

    s0

    j t

    s0

    s00

    s j

    ? ? ? ?

    ? ? ?

    (1) (2) (3) (4)

    (5) (6) (7)

    Figure 10. The possible non-trivially covered arrows between thenode sets {i}, {j}, S, and T for the permutation DAG Gπ consideredin the proof of Theorem 12 are labeled with the symbol ?. Here,we take s, s′, s′′ ∈ S and t, t′ ∈ T .

    known Gτ = Gπ. Thus, reversing the covered arrow s→ t in Gτ = Gπ does not addin i→ j.

    To see the claim also holds for cases (6) and (7), we utilize the semigraphoidproperty (SG2). It suffices to prove the claim for case (6). So suppose that reversingthe ?-labeled edge j → t from case (6) results in a permutation DAG with fewerarrows. We simply want to see that i → j is still a non-arrow in this new DAG.Assuming once more that π = is1 · · · skjt1 · · · tm, by [17, Theorem 7.4] we can,without loss of generality, pick t := t1. Thus, since i ⊥⊥ j | S and j → t is covered,then i ⊥⊥ t | S ∪ {j}. By the semigraphoid property (SG2), we then know thati ⊥⊥ j | S ∪ {t}. Thus, the covered arrow reversal j ← t produces a permutationτ = is1 · · · skt1jt2 · · · tm, and so i → j is not an arrow in Gτ . This completes allcases of the proof.

    To complete the proof, we provide an example of a distribution P that satis-fies TSP but not orientation faithfulness. Consider any probability distributionentailing the CI relations

    C = {1 ⊥⊥ 3, 1 ⊥⊥ 5 | {2, 3, 4}, 4 ⊥⊥ 6 | {1, 2, 3, 5}, 1 ⊥⊥ 3 | {2, 4, 5, 6}}.(for example, C can be faithfully realized by a regular Gaussian). From left-to-right,we label these CI relations as c1, c2, c3, c4. For the collection C, a sparsest DAGGπ∗ is depicted in Figure 11. Note that since there is no equally sparse or sparserDAG that is Markov with respect to P then P satisfies the SMR assumption withrespect to Gπ∗ . Notice also that the CI relation c4 does not satisfy the orientationfaithfulness assumption with respect to Gπ∗ . Moreover, if Gπ entails c4, then thesubDAG on the nodes π1, . . . , π5 forms a complete graph. Thus, by [4, Theorem2], we can find a sequence of covered arrow reversals preserving edge count suchthat after all covered arrow reversals, π5 = 6. Then transposing the entries π5π6produces a permutation τ in which c3 holds. Therefore, the number of arrows inGπ is at least the number of arrows in Gτ . Even more, Gτ is an independence mapof Gπ∗ , i.e., Gπ∗ ≤ Gτ . So by Proposition 8, there exists a weakly decreasing edgewalk (of covered arrow reversals) along Ap(C) taking us from Gτ to G∗π. Thus, weconclude that P satisfies TSP, but not orientation faithfulness. �

  • ORDERING-BASED CAUSAL INFERENCE 27

    1 2

    3

    45

    6

    Figure 11. A sparsest DAG for the CI relations C considered inthe proof of Theorem 12.

    Appendix C. Proofs for Results on the Uniform Consistency of theGreedy SP Algorithm

    To prove Theorem 13, we require a pair of lemmas, the first of which showsthat the conditioning sets in the TSP algorithm can be restricted to parent sets ofcovered arrows.

    Lemma 24. Suppose that the data-generating distribution P is faithful to G∗. Thenfor any permutation π and any covered arrow i→ j in Gπ it holds that

    (a) i ⊥⊥ k | (S′ ∪ {j}) \ {k} if and only if i ⊥⊥ k | (S ∪ {j}) \ {k},(b) j ⊥⊥ k | S′ \ {k} if and only if j ⊥⊥ k | S \ {k},

    for all k ∈ S, where S is the set of common parent nodes of i and j, and S′ = {a :a γ] ≤ O(n− |S|) ∗ Φ, where

    Φ =

    (exp

    ((n− 4− |S|) log

    (4− (γ/L)24 + (γ/L)2

    ))+ exp(−C2(n− |S|))

    ),

  • 28 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    where C2 is some constant such that 0 < C2 Φ−1(1− α/2), when zi,j|S = 0;√n− |S| − 3|ẑi,j|S | ≤ Φ−1(1− α/2), when zi,j|S 6= 0.

    Choosing αn = 2(1− Φ(n1/2cn/2)) it follows under assumption (A4) thatP[Ei,j|S ] ≤ P[|ẑi,j|S − zi,j|S | > (n/(n− |S| − 3))1/2cn/2].

    Now, by (A2) we have that |S| ≤ p = O(na). Hence it follows thatP[Ei,j|S ] ≤ P[|ẑi,j|S − zi,j|S | > cn/2].

    Then, Lemma 25 together with the fact that log( 4−δ2

    4+δ2 ) ∼ −δ2/2 as δ → 0, implythat

    P[Ei,j|S ] ≤ O(n− |S|) exp(−c′(n− |S|)c2n) ≤ O(exp(log n− cn1−2`)

    )(C.1)

  • ORDERING-BASED CAUSAL INFERENCE 29

    for some constants c, c′ > 0. Since the DAG estimated using Algorithm 5 is notconsistent when at least one of the partial correlation tests is not consistent, thenthe probability of inconsistency can be estimated as follows:

    P[an error occurs in Algorithm 5] ≤ P

    ⋃i,j,S∈Kπ̂∪Lπ̂

    Ei,j|S

    ≤ |Kπ̂ ∪ Lπ̂|

    (sup

    i,j,S∈Kπ̂∪Lπ̂P(Ei,j|S)

    ).

    (C.2)

    Next note that assumption (A3) implies that the size of the set adj(Gπ0 , i) ∪adj(Gπ0 , j) is at most dπ0 . Therefore, |Kπ0 | ≤ p2 · dπ0 · 2dπ0 and |Lπ0 | ≤ p2. Thus,we see that

    |Kπ̂ ∪ Lπ̂| ≤ |Kπ̂|+ |Lπ̂| ≤ (2dπ0 · dπ0 + 1)p2.Therefore, the left-hand-side of inequality (C.2) is upper-bounded by

    (2dπ0 · dπ0 + 1)p2(

    supi,j,S∈Kπ̂∪Lπ̂

    P(Ei,j|S)).

    Combining this observation with the upper-bound computed in (C.1), we obtainthat the left-hand-side of (C.2) is upper-bounded by

    (2dπ0 · dπ0 + 1)p2O(exp(log n− cn1−2l)) ≤O(exp(dπ0 log 2 + 2 log p+ log dπ0 + log n− cn1−2`)).

    By assumptions (A3) and (A4) it follows that n1−2` dominates all terms in thisbound. Thus, we conclude that

    P[estimated DAG is consistent] ≥ 1−O(exp(−cn1−2`)).�

    The proof of Theorem 15 is based on the following lemma.

    Lemma 26. Let P be a distribution on [p] that is faithful to a DAG G, and letPS denote the marginal distribution on S ⊂ [p]. Let GS be the undirected graphicalmodel corresponding to PS, i.e., the edge {i, j} is in GS if and only if ρi,j|(S\{i,j}) 6=0. Then GS\{k} can be obtained from GS as follows:

    (1) for all i, j ∈ adj(GS , k), if {i, j} is not an edge in GS, then add {i, j}.Otherwise, {i, j} is an edge of GS\{k} if and only if |ρi,j|S\{i,j,k}| 6= 0.

    (2) for all i, j /∈ adj(GS , k), {i, j} is an edge of GS\{k} if and only if {i, j} isan edge in GS.

    Proof. First, we prove:

    For i, j 6∈ adj(GS , k) : (i, j) is an edge in GS\{k} iff (i, j) is an edge in GS .Suppose at least one of i or j are not adjacent to node k in GS . Without loss ofgenerality, we assume i is not adjacent to k in GS ; this implies that ρi,k|S\{i,k} = 0.To prove the desired result we must show that

    ρi,j|S\{i,j} = 0⇔ ρi,j|S\{i,j,k} = 0.To show this equivalence, first suppose that ρi,j|S\{i,j} = 0 but ρi,j|S\{i,j,k} 6= 0.This implies that there is a path P between i and j through k such that nodesi and j are d-connected given S \ {i, j, k} and d-separated given S \ {i, j}. This

  • 30 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    implies that k is a non-collider along P . Define Pi as the path connecting i and kin the path P and Pj the path connecting j and k in P . Then the nodes i and jare d-connected to k given S \{i, k} and S \{j, k} respectively, by using Pi and Pj .Since j is not on Pi, clearly i and k are also d-connected given S \ {i, j, k} throughPi, and the same holds for j.

    Conversely, suppose that ρi,j|S\{i,j,k} = 0 but ρi,j|S\{i,j} 6= 0. Then there existsa path P that d-connects nodes i and j given S\{i, j}, while i and j are d-separatedgiven S \ {i, j, k}. Thus, one of the following must occur:

    (1) k is a collider on the path P , or(2) Some node ` ∈ an(S \ {i, j}) \ an(S \ {i, j, k}) is a collider on P .

    For case (2), there must exist a path: ` → · · · → k that d-connects ` and k givenS \ {i, j, k} and ` 6∈ S. Such a path exists since ` is an ancestor of k and notan ancestor of all other nodes in S \ {i, j, k}. So in both cases i and k are alsod-connected given S \ {i, j, k} using a path that does not containing the node j.Hence, i and k are also d-connected given S \ {i, k}, a contradiction.

    Next, we prove for i, j ∈ adj(GS , k), if (i, j) is not an edge in GS , then (i, j) isan edge in GS\{k}. Since i ∈ adj(GS , k), there exists a path Pi that d-connects iand k given S \ {i, k}, and similar for j. Using the same argument as the above,i and j are also d-connected to k using Pi and Pj , respectively, given S \ {i, j, k}.Defining P as the path that combines Pi and Pj , then k must be a non-collideralong P as otherwise i and j would be d-connected given S \ {i, j}, in which case iand j would also be d-connected given S \ {i, j, k}, and (i, j) would be an edge inGS\{k}. �

    C.2. Proof of Theorem 15. In the oracle setting, there are two main differencesbetween Algorithm 6 and the minimum degree algorithm. First, Algorithm 6 usespartial correlation testing to construct a graph, while the minimum degree algo-rithm uses the precision matrix Θ. The second difference is that Algorithm 6 onlyupdates based on the partial correlations of neighbors of the tested nodes.

    Let ΘS denote the precision matrix of the marginal distribution over the variables{Xi : i ∈ S}. Since the marginal distribution is Gaussian, the (i, j)-th entry of ΘSis nonzero if and only if ρi,j|S\{i,j} 6= 0. Thus, to prove that Algorithm 6 and theminimum degree algorithm are equivalent, it suffices to show the following: LetGS be an undirected graph with edges corresponding to the nonzero entries of ΘS .Then for any node k, the graph GS\{k} constructed as defined in Algorithm 6 hasedges corresponding to the nonzero entries of ΘS\{k}. To prove that this is indeedthe case, note that by Lemma 26, if GS is already estimated then nodes i and jare connected in GS\{k} if and only if ρi,j|S\{i,j,k} 6= 0. Finally, since the marginaldistribution over S is multivariate Gaussian, the (i, j)-th entry of ΘS\{k} is non-zeroif and only if ρi,j|S\{i,j,k} 6= 0. �

    C.3. Proof of Theorem 16. Let Poracle(π̂) denote the probability that π̂ is outputby Algorithm 6 in the oracle-setting, and let Nπ̂ denote the number of partialcorrelation tests that had to be performed. Then Nπ̂ ≤ O(pd2π̂), where dπ̂ is themaximum degree of the corresponding minimal I-MAP Gπ̂. Therefore, using the

  • ORDERING-BASED CAUSAL INFERENCE 31

    same arguments as in the proof of Theorem 13, we obtain:

    P[π̂ is generated by Algorithm 6]

    ≥ Poracle(π̂)P[all hypothesis tests for generating π̂ are consistent]

    ≥ Poracle(π̂)(

    1−O(pd2π̂) sup(i,j,S)∈Nπ̂

    P(Ei,j|S)

    ),

    ≥ Poracle(π̂)(1−O(exp(2 log dπ̂ + log p+ log n− c′n1−2`))

    ),

    ≥ Poracle(π̂)(1−O(exp(−cn1−2`))

    ).

    Let Π denote the set of all possible output permutations of the minimum degreealgorithm applied to Θ. Then

    P[Algorithm 6 outputs a permutation in Π]

    ≥∑π̂∈Π

    P[π̂ is output by Algorithm 6],

    ≥ 1−O(exp(−cn1−2`)),which completes the proof. �

    References

    [1] Bernstein, D. I., Saeed, B., Squires, C., & Uhler, C. (2019). Ordering-based causal structure learning in the presence of latent variables. Preprintavailable at arXiv:1910.09014.

    [2] Bouckaert, R. R. (1992). Optimizing causal orderings for generating DAGsfrom data. Proceedings of the Eighth International Conference on Uncertaintyin Artificial Intelligence. Morgan Kaufmann Publishers Inc.

    [3] Castelo, R. & Kocka, T. (2003). On inclusion-driven learning of Bayesiannetworks. Journal of Machine Learning Research 4.Sep: 527-574.

    [4] Chickering, D. M. (1995). A transformational characterization of equiva-lent Bayesian network structures. Proceedings of the Eleventh Conference onUncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc.

    [5] Chickering, D. M. (2002). Optimal structure identification with greedysearch. Journal of Machine Learning Research: 507-554.

    [6] Cooper, G. F. & Herskovits, E. (1992). A Bayesian method for the induc-tion of probabilistic networks from data. Machine Learning 9.4: 309-347.

    [7] Friedman, N., Linial, M., Nachman, I., & Peter, D. (2000). UsingBayesian networks to analyze expression data. Journal of Computational Biol-ogy 7: 601–620.

    [8] Fukumizu, K., Gretton, A., Sun, X., & Schölkopf, B. (2008). Kernelmeasures of conditional dependence. Advances in Neural Information ProcessingSystems.

    [9] George, A. (1973). Nested dissection of a regular finite element mesh. SIAMJournal on Numerical Analysis 10.2: 345-363.

    [10] Gillispie, S. B. & Perlman, M. D. (2001). Enumerating Markov equiv-alence classes of acyclic digraph models. Proceedings of the Seventeenth Con-ference on Uncertainty in Artificial Intelligence. Morgan Kaufmann PublishersInc.

  • 32 LIAM SOLUS, YUHAO WANG, AND CAROLINE UHLER

    [11] Kalisch, M. & Bühlmann, P. (2007). Estimating high-dimensional directedacyclic graphs with the PC-algorithm. Journal of Machine Learning Research8 (2007): 613-636.

    [12] Kalisch, M., M”achler, M., Colombo, D., Maathuis, M. H., &Bühlmann, P. (2011). Causal inference using graphical models with the Rpackage pcalg. Journal of Statistical Software 47: 1-26.

    [13] Larrañaga, P., Kuijpers, C. M. H., Murga, R. H., & Yurramendi,Y. (1996). Learning Bayesian network structures by searching for the bestordering with genetic algorithms. IEEE Transactions on Systems, Man, andCybernetics-Part A: Systems and Humans 26.4: 487-493.

    [14] Lauritzen, S. L. (1996). Graphical Models Oxford University Press.[15] Lemiere, J., Meganck, S., Cartella, F., & Liu, T. (2012). Conservative

    independence-based causal structure learning in absence of adjacency faithful-ness. International Journal of Approximate Reasoning 53: 1305-1325.

    [16] Meek, C. (1997). Graphical Models: Selecting Causal and Statistical Models.Diss. PhD thesis, Carnegie Mellon University

    [17] Mohammadi, F., Uhler, C., Wang, C., & Yu, J. (2016). Generalizedpermutohedra fr