synthesizing transformations on hierarchically structured datamuse uprising.mp3 adele skyfall.mp3...

14
Synthesizing Transformations on Hierarchically Structured Data Navid Yaghmazadeh UT Austin [email protected] Christian Klinger UT Austin [email protected] Isil Dillig UT Austin [email protected] Swarat Chaudhuri Rice University [email protected] Abstract This paper presents a new approach for automatically synthesizing transformations on hierarchically structured data, such as Unix di- rectories and XML documents. We consider a general abstraction for such data, called hierarchical data trees (HDTs), and present a novel algorithm for synthesizing HDT transformations from input- output examples. The key insight underlying our algorithm is to reduce the task of synthesizing tree transformations to that of infer- ring path transformations. We also present a novel algorithm based on SMT solving and decision tree learning for synthesizing path transformers. We have implemented the proposed ideas in a sys- tem called HADES and used it for synthesizing transformations on Unix directories and XML documents. Our evaluation shows that HADES is able to automatically synthesize a variety of interesting transformations that we have collected from online forums. 1. Introduction Much of the data that users deal with today are inherently hierar- chical in nature. Some common examples of such hierarchically- structured data include, but are not limited to, the following: File systems: Modern operating systems represent the data stored on disk as a tree, where directories represent internal nodes and files correspond to leaves. XML documents: In XML documents, data is organized as a tree structure, where each subtree is identified by a pair of start and end tags. HDF files: Many kinds of scientific documents are stored as HDF files that follow a tree structure. In this format, so-called groups correspond to internal nodes of the tree, and datasets represent leaves. Hierarchical spreadsheets: Several varieties of data organiza- tion software (e.g., TreeSheets [3], Smartsheet [2]), allow users to construct so-called hierarchical spreadsheets where data is arranged and visualized in a tree-like format. Given the ubiquity of such hierarchical data, a common prob- lem that end-users face is to perform various kinds of tree trans- formation tasks on this data. For instance, consider the following motivating scenarios: Given a directory tree called Music with subdirectories for different musical genres, a user wants to re-organize her music files so that Music has subdirectories for different classes of files (e.g., mp3, wma), and each such subdirectory has further subdirectories that correspond to genres (Rock, Jazz etc.). Given an XML file, a user wants to convert the name attribute associated with a person tag (e.g., <person name=...> ... </person>) to a nested element within the person tag (e.g., <person><name>... </name></person>. Since manually performing these kinds of transformations on large datasets is tedious and error-prone, an attractive alternative is to implement a program, such as a bash script or XSLT pro- gram, to automate the desired task. Motivated by these scenarios, this paper proposes a new algorithm, and its implementation in a system called HADES 1 , for automatically synthesizing hierarchical data transformations from input-output examples. Our algorithm operates on a general abstraction of hierarchical data, called hi- erarchical data trees (HDTs). The algorithm does not place any restrictions on the depth or fanout of the hierarchy and is able to synthesize a rich class of tree transformations that commonly arise in real-world data manipulation tasks, such as restructuring of the data hierarchy and modifying data and metadata. From a technical perspective, a key idea underlying our ap- proach is to reduce the problem of synthesizing tree transforma- tions to that of learning path transformations. Specifically, rather than directly synthesizing a tree transformation, our algorithm rep- resents hierarchically structured data as a set of paths and learns a transformation that allows us to convert each path in the input tree to a corresponding path in the output tree. After learning such a path transformation, our approach synthesizes a program that gen- erates all paths in the input tree, applies the transformation to each path, and then reconstructs the corresponding output tree. As a re- sult, the scripts generated by our technique can be more complex than those that operate directly on trees. However, a key advantage of this approach is that it simplifies the synthesis task and allows us to handle a broad class of hierarchical data transformations in a practical manner. Another important aspect of our approach is a new algorithm for learning path transformations. Given a set of input-output paths, our approach synthesizes a simplest path transformer minimizing the number of branches in the synthesized program. To learn such a path transformer, our approach alternates between two phases called unification and classification. Given a set S of pairs of input and output paths, the goal of unification is to determine whether there exists a conditional-free program that achieves the desired transformation on S. By contrast, the goal of classification is to learn concise predicates that differentiate one unifiable subset of examples from another. Our new synthesis algorithm reduces unification to an SMT solving problem and uses decision tree learning for performing classification. We have implemented our technique in a system called HADES, which provides a language-agnostic backend for synthesizing HDT transformations. In principle, HADES can be used to generate code in any domain-specific language (DSL), provided it has been plugged into our infrastructure. Our current implementation pro- vides two DSL front-ends, one that generates bash scripts for Unix directories, and another that outputs XSLT [4] code for performing XML transformations. To summarize, this paper makes the following contributions: 1 stands for HierArchical Data transformation Engine and Synthesizer

Upload: others

Post on 10-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Synthesizing Transformations onHierarchically Structured Data

Navid YaghmazadehUT Austin

[email protected]

Christian KlingerUT Austin

[email protected]

Isil DilligUT Austin

[email protected]

Swarat ChaudhuriRice [email protected]

AbstractThis paper presents a new approach for automatically synthesizingtransformations on hierarchically structured data, such as Unix di-rectories and XML documents. We consider a general abstractionfor such data, called hierarchical data trees (HDTs), and present anovel algorithm for synthesizing HDT transformations from input-output examples. The key insight underlying our algorithm is toreduce the task of synthesizing tree transformations to that of infer-ring path transformations. We also present a novel algorithm basedon SMT solving and decision tree learning for synthesizing pathtransformers. We have implemented the proposed ideas in a sys-tem called HADES and used it for synthesizing transformations onUnix directories and XML documents. Our evaluation shows thatHADES is able to automatically synthesize a variety of interestingtransformations that we have collected from online forums.

1. IntroductionMuch of the data that users deal with today are inherently hierar-chical in nature. Some common examples of such hierarchically-structured data include, but are not limited to, the following:

• File systems: Modern operating systems represent the datastored on disk as a tree, where directories represent internalnodes and files correspond to leaves.• XML documents: In XML documents, data is organized as a

tree structure, where each subtree is identified by a pair of startand end tags.• HDF files: Many kinds of scientific documents are stored as

HDF files that follow a tree structure. In this format, so-calledgroups correspond to internal nodes of the tree, and datasetsrepresent leaves.• Hierarchical spreadsheets: Several varieties of data organiza-

tion software (e.g., TreeSheets [3], Smartsheet [2]), allow usersto construct so-called hierarchical spreadsheets where data isarranged and visualized in a tree-like format.

Given the ubiquity of such hierarchical data, a common prob-lem that end-users face is to perform various kinds of tree trans-formation tasks on this data. For instance, consider the followingmotivating scenarios:

• Given a directory tree called Music with subdirectories fordifferent musical genres, a user wants to re-organize her musicfiles so that Music has subdirectories for different classes offiles (e.g., mp3, wma), and each such subdirectory has furthersubdirectories that correspond to genres (Rock, Jazz etc.).• Given an XML file, a user wants to convert the name attribute

associated with a person tag (e.g., <person name=...> ...</person>) to a nested element within the person tag (e.g.,<person><name>... </name></person>.

Since manually performing these kinds of transformations onlarge datasets is tedious and error-prone, an attractive alternativeis to implement a program, such as a bash script or XSLT pro-gram, to automate the desired task. Motivated by these scenarios,this paper proposes a new algorithm, and its implementation in asystem called HADES1, for automatically synthesizing hierarchicaldata transformations from input-output examples. Our algorithmoperates on a general abstraction of hierarchical data, called hi-erarchical data trees (HDTs). The algorithm does not place anyrestrictions on the depth or fanout of the hierarchy and is able tosynthesize a rich class of tree transformations that commonly arisein real-world data manipulation tasks, such as restructuring of thedata hierarchy and modifying data and metadata.

From a technical perspective, a key idea underlying our ap-proach is to reduce the problem of synthesizing tree transforma-tions to that of learning path transformations. Specifically, ratherthan directly synthesizing a tree transformation, our algorithm rep-resents hierarchically structured data as a set of paths and learns atransformation that allows us to convert each path in the input treeto a corresponding path in the output tree. After learning such apath transformation, our approach synthesizes a program that gen-erates all paths in the input tree, applies the transformation to eachpath, and then reconstructs the corresponding output tree. As a re-sult, the scripts generated by our technique can be more complexthan those that operate directly on trees. However, a key advantageof this approach is that it simplifies the synthesis task and allowsus to handle a broad class of hierarchical data transformations in apractical manner.

Another important aspect of our approach is a new algorithmfor learning path transformations. Given a set of input-output paths,our approach synthesizes a simplest path transformer minimizingthe number of branches in the synthesized program. To learn sucha path transformer, our approach alternates between two phasescalled unification and classification. Given a set S of pairs ofinput and output paths, the goal of unification is to determinewhether there exists a conditional-free program that achieves thedesired transformation on S. By contrast, the goal of classificationis to learn concise predicates that differentiate one unifiable subsetof examples from another. Our new synthesis algorithm reducesunification to an SMT solving problem and uses decision treelearning for performing classification.

We have implemented our technique in a system called HADES,which provides a language-agnostic backend for synthesizing HDTtransformations. In principle, HADES can be used to generatecode in any domain-specific language (DSL), provided it has beenplugged into our infrastructure. Our current implementation pro-vides two DSL front-ends, one that generates bash scripts for Unixdirectories, and another that outputs XSLT [4] code for performingXML transformations.

To summarize, this paper makes the following contributions:

1 stands for HierArchical Data transformation Engine and Synthesizer

Page 2: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Music

Jazz

screenplay.f ac

Rock

help.ogg Muse

uprising.f ac

Pop

Adele

skyfall.mp3

Music

f ac

Jazz

screenplay.f ac

Rock

Muse

uprising.f ac

ogg

Rock

help.ogg

mp3

Jazz

screenplay.mp3

Rock

Muse

uprising.mp3

Pop

Adele

skyfall.mp3

Original directory Desired directory

l

l

l l

l

Figure 1. Input-output directories for motivating example

• We present a method, embodied in a system called HADES,for synthesizing transformations on hierarchically structureddata. The class of programs that can be synthesized by our ap-proach are useful to end-users in a variety of different scenarios,such as those involving file systems, XML files, or hierarchicalspreadsheets.• We show how to reduce the problem of synthesizing tree trans-

formations to the simpler task of synthesizing path transforma-tions. We also prove the soundness and relative completeness ofthis approach under certain lighweight restrictions on the input-output trees.• We present a new synthesis algorithm based on SMT solving

and decision tree learning for synthesizing a broad class of pathtransformations.• We empirically demonstrate the applicability of our approach

by using HADES to synthesize bash scripts and XSLT programsfor various tasks obtained from online forums. Our evaluationshows that HADES is practical, both in terms of running timeand the input required from end-users.

2. OverviewIn this section, we illustrate our approach using a motivating ex-ample from the file system domain. Consider a user, Bob, who hasa large collection of music files organized by genres. Specifically,Bob has a Music folder containing subdirectories such as Rock,Classical etc., and each of these folders in turn contains a collectionof music files and subfolders (e.g., one subfolder for each band).Furthermore, suppose that Bob’s music files come in three differentformats: mp3, ogg, and flac. However, since not every music playersupports all of these different formats, Bob wants to categorize hismusic based on file type while also maintaining the original organi-zation based on genres. In addition, since few applications can playmusic files in flac format, Bob also wants to convert all his flac filesto mp3 and keep both the original as well as the converted files.

We now describe how Bob can use the HADES system forsynthesizing a bash script that performs the exact transformationhe wants. To use HADES, Bob starts by contructing the small input-output example shown in Figure 1. Specifically, the left hand sideof Figure 1 corresponds to the input directory, and the right handside shows the desired output directory. As an example, considerthe file called screenplay.flac under the Jazz subfolder inthe original directory. In the output directory, there are two filescalled screenplay.flac and screenplay.mp3. The latterone appears in the mp3 subdirectory under the Jazz folder, while

screenplay.flac appears under the Jazz subfolder of theflac directory.

Given this input, HADES first converts each of the input and out-put directories to an intermediate representation called a hierarchi-cal data tree (HDT) and then constructs a new set of input-outputexamples E where each example e ∈ E represents a path transfor-mation. Specifically, for the directories shown in Figure 1, HADESgenerates the path transformation examples E shown in Figure 2.Each path transformation example e ∈ E consists of a pair of paths(p1, p2) where p1 is a root-to-leaf path in the input HDT and p2 isa corresponding path in the output HDT. We represent paths as asequence of pairs (l, d) where l is the label associated with a nodein the HDT and d is the corresponding data. In this particular ap-plication domain, labels corresponds to directory/file names, anddata includes information about permissions, owner, file type etc.Observe that each path in the input HDT may appear in multipleexamples of E due to duplication; for instance, input path p1 inFigure 2 appears in both examples E1 and E2.

Once HADES constructs the path transformation examples E , itthen proceeds to synthesize a path transformation function f suchthat p′ ∈ f(p) for every example (p, p′) ∈ E . Specifically, a pathtransformer is a function that takes an input path and computes aset of output paths. Note that path transformers return a set ratherthan a single path because we allow duplication as well as deletion.

In the HADES system, the synthesis of path transformers con-sists of two phases: In the first phase, we partition the input ex-amples into sets of unifiable groups, and in the second phase, weperform classification by finding a predicate that differentiates eachinput path in one partition from other inputs paths in the remainingpartitions. As mentioned in Section 1, we perform unification usingSMT solving and classification using decision tree learning.

Going back to our example, HADES partitions examples E intotwo groups P1,P2, where P1 contains {E1, E3, E4, E6} and P2

contains {E2, E5}. For partition P1, we infer the unifier χ1:

concat ( map Id subpath(x, 1, 1),map ExtOf subpath(x, size(x), size(x)),map Id subpath(x, 2, size(x))

)

where subpath(x, t, t′) yields a subpath of x between indices tand t′, Id is the identity function, and ExtOf yields the extension(i.e., file type) for a given node. Similarly, for partition P2, weinfer the following unifier χ2, where FlacToMp3 is a function forconverting flac files to mp3:

Page 3: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

E1 : p1 = [(Music, dm), (Jazz, dj), (screenplay, ds)] 7→ p′1 = [(Music, dm), (flac, df ), (Jazz, dj), (screenplay, ds)]E2 : p1 = [(Music, dm), (Jazz, dj), (screenplay, ds)] 7→ p′′1 = [(Music, dm), (mp3, dmp), (Jazz, dj), (screenplay, d′s)]E3 : p2 = [(Music, dm), (Rock, dr), (help, dh)] 7→ p′2 = [(Music, dm), (ogg, do), (Rock, dr), (help, dh)]E4 : p3 = [(Music, dm), (Rock, dr), (Muse, dms), (uprising, du)] 7→ p′3 = [(Music, dm), (flac, df ), (Rock, dr), (Muse, dms), (uprising, du)]E5 : p3 = [(Music, dm), (Rock, dr), (Muse, dms), (uprising, du)] 7→ p′′3 = [(Music, dm), (mp3, dmp), (Rock, dr), (Muse, dms), (uprising, d′u)]E6 : p4 = [(Music, dm), (Pop, dp), (Adele, da), (skyfall, dsf )] 7→ p′4 = [(Music, dm), (mp3, dmp), (Pop, dp), (Adele, da), (skyfall, dsf )]

Figure 2. Path transformation examples constructed by HADES

srcDir=$1for inputFile in \$srcDir/* doelems=$(split $inputFile)size=$(SizeOf $elems)output=concat($inputElems[0],$(Ext $elems[$size-1]),

subList(1, $size-1, $elems))outputPaths+=\$outputif {[[ $(Ext $inputFile) == flac]]} then

output=concat($elems[0], "mp3",subList(1, $size-2, $elems),$(convertFormat $elems[$size-1]))

outputPaths+=$outputfi

donemakeDirectories $outputPaths

Figure 3. Bash script synthesized by HADES

concat ( map Id subpath(x, 1, 1), “mp3”,map Id subpath(x, 2, size(x)− 1),map FlacToMp3 subpath(x, size(x), size(x))

)

Next, we find a predicate for each partition by performingclassification. Specifically, the input paths for partition P1 are{p1, p2, p3, p4}, and the ones for P2 are {p1, p3}. Since all in-put paths of E are also in P1, we infer the classifier φ1 : true forP1. On the other hand, for partition P2, a simplest predicate thatdifferentiates {p1, p3} from {p2, p4} is φ2 : ext = “flac”. Hence,the overall path transformer inferred by our algorithm is:

λx. (χ1; if(ext = “flac”) then χ2)

The final step of our algorithm is to synthesize a tree transforma-tion using this inferred path transformer. For this purpose, HADESgenerates the (pseudo-) bash script shown in Figure 3. Effectively,the synthesized script iterates over every file f in the input direc-tory, applies the synthesized path transformer to f , and obtains anew set of output paths. Then, it creates the new output directoryby calling a pre-defined function called makeDirectories which cre-ates a new directory structure from the output paths.

Finally, going back to our motivating end-user scenario, Bobcan now apply the script synthesized by HADES to his very largemusic collection and obtain the desired transformation.

3. Hierarchical Data TreesIn this section, we introduce hierarchical data trees (HDT) whichour system uses as the canonical representation for various kindsof semi-structured data such as file system directories and XMLdocuments. We will also prove some important properties of HDTsthat will be useful later.

Definition 1. (Hierarchical data tree) Assume a universe Id oflabels for tree nodes and a universe Dat of data. A hierarchicaldata tree T is a rooted tree represented as a quadruple (V,E, L,D)where V is a set of nodes, and E is a set of directed edges. The

labeling function L : V → Id assigns a label to each node v ∈ V .The data store D : V → Dat maps each node v ∈ V to the dataassociated with v.

An important point about hierarchical data trees is that thelabeling function L does not need to be one-to-one. That is, fornodes v and v′, it is possible that L(v) = L(v′). With slightabuse of notation, we write L(V ) to mean the multi-set {` | v ∈V ∧ L(v) = `}. Similarly, we write L(E) to denote the multi-set:

{(`, `′) | (v, v′) ∈ E ∧ L(v) = ` ∧ L(v′) = `′}

Example 1. Each directory in a file system can be viewed as a hier-archical data tree where vertices correspond to files and directories,and an edge from v to v′ indicates that v is the parent directory ofv′. The labeling function assigns, to each node v, the name of thefile or directory associated with v. The data store D assigns inter-nal nodes (i.e., directories) to the metadata associated with that di-rectory (e.g., permissions, creation date etc.). Similarly, D assignseach leaf node (i.e., file) to its content and metadata.

Example 2. An XML file can also be viewed as a hierarchical datatree where nodes corresponds to XML elements, and an edge fromv to v′ indicates that the element represented by v′ is nested directlyinside element v. The labeling function L assigns a label (s, i) toeach element v, where s is the name of the tag associated with vand i indicates that v is the i’th element with tag s under v’s parent.The data store D maps each element v to its attributes.

Notation. Given a hierarchical data tree T = (V,E, L,D), wewrite root(T ) to indicate the unique root of the tree. For a givennode v, we write parent(v) to denote the unique node v′ such that(v′, v) ∈ E. We define children(v) as {v′ | (v, v′) ∈ E}. Finally,we write subtree(T, v) to denote the subtree of T rooted at nodev, and we write isLeaf(T, v) to indicate that v is a leaf node in T .

Definition 2. (Well-formedness) We say that a hierarchical datatree T = (V,E, L,D) is well-formed iff no two sibling verticeshave the same label, i.e.:

∀v, v′. ((parent(v) = parent(v′) ∧ v 6= v′) ⇒ L(v) 6= L(v′))

In later sections of this paper, we will assume that all hierarchi-cal data trees are well-formed, and we will use the term “tree” tomean a well-formed hierarchical data tree. We note that our well-formedness assumption is a lightweight restriction that applies tomany real-world domains. For example, file system directories sat-isfy the well-formedness assumption because there cannot be twofiles or directories with the same name under the same directory.XML documents also satisfy this assumption because the order inwhich tags appear in a document is significant; hence, we can as-sign two different labels to sibling elements with the same tag name(recall the labeling function from Example 2).

Definition 3. (Path) A path p in a hierarchical data tree T =(V,E, L,D) is a sequence [(`1, d1), . . . , (`k, dk)] such that:

• `1 = L(r) and d1 = D(r), where r is the root of T• `k = L(v) and dk = D(v), where v is a leaf of T

Page 4: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

A

B B

C D

A

B

C D

T1 T2

Figure 4. Example to motivate well-formedness

• For every i ∈ [1, k), there exists an edge (v, v′) ∈ E such thatL(v) = `i, L(v′) = `i+1, D(v) = di, and D(v′) = di+1.

Given a path p = [(`1, d1), . . . , (`k, dk)], we write p[i].` andp[i].d to indicate `i and di respectively. The set of paths in atree T is denoted by paths(T ), and we will write pathTo(T, v) todenote a path starting at the root of T and ending in v. Note thatpathTo(T, v) ends in (`, d) such that L(v) = ` and D(v) = d.

Proposition 1. Let p be a path in a well-formed hierarchical datatree T = (V,E, L,D). There exists a unique v ∈ V such thatpathTo(T, v) = p.

Observe that this proposition follows immediately from thedefinition of well-formedness. We now define what it means fortwo HDTs to be equivalent:

Definition 4. (Equivalence of HDTs) Let T1 = (V1, E1, L1, D1)with root v1 and T2 = (V2, E2, L2, D2) with root node v2. Wesay that T1 is equivalent to T2, written T1 ≡ T2, iff the followingconditions hold:

1. L1(v1) = L2(v2) and D1(v1) = D2(v2)2. L1(children(v1)) = L2(children(v2))3. For every v′1 ∈ children(v1) and v′2 ∈ children(v2) such thatL1(v′1) = L2(v′2), then subtree(T1, v

′1) ≡ subtree(T2, v

′2).

Intuitively, two hierarchical data trees T1 and T2 are equivalentif they are indistinguishable from each other with respect to thelabeling functions (L1, L2) and data stores (D1, D2). A very im-portant property of well-formed hierarchical data trees is that theirequivalence can also be stated in terms of paths:

Theorem 1. 2 Let T = (V,E, L,D) and T ′ = (V ′, E′, L′, D′) betwo well-formed hierarchical data trees. Then, T ≡ T ′ if and onlyif paths(T ) = paths(T ′).

This theorem is important for our approach because the pro-grams we synthesize construct the output tree from a set of paths.In particular, this theorem states that a given set of paths uniquelydefines a well-formed hierarchical data tree. In contrast, if we liftthe well-formedness assumption, then a set of paths does not de-fine a unique hierarchical data tree. This point is illustrated by thefollowing example.

Example 3. Consider the two trees shown in Figure 4, where theletters on each node indicate their label, and assume that all nodesstore the same data d. For both trees, paths(T ) is the set:

{[(A, d), (B, d), (C, d)], [(A, d), (B, d)(D, d)]}

However, observe that the left tree T1 violates the well-formednessassumption because A has two children with label B. As thisexample illustrates, a set of paths does not uniquely define an HDTif we remove the well-formedness assumption. Fortunately, well-formedness guarantees that we can always reconstruct an HDTfrom its constituent paths.

2 Proofs of all non-trivial statements are given in the appendix.

1: procedure SYNTHESIZE(set 〈Tree, Tree〉 E)

2: Input: Set of examples E , consisting of pairs of HDTs3: Output: Synthesized program P

4: if (!CHECKEXAMPLES(E)) then5: return ⊥;6: E ′ := FURCATE(E);7: f := INFERPATHTRANS(E ′);8: P := CODEGEN(f );9: return P ;

Figure 5. High-level structure of our synthesis algorithm

4. Synthesizing Trees from PathsIn this section, we describe our algorithm for synthesizing HDTtransformations given an appropriate path transformer.

Overview. As summarized in Figure 5, the high-level structure ofour synthesis algorithm consists of four steps. First, given a set ofinput-output HDTs E , we start by verifying that every example in Econforms to a certain unambiguity restriction required by our algo-rithm; this is done by calling a function called CHECKEXAMPLESat line 4. Next, we furcate the input-output trees E into a set E ′ ofpath transformation examples. Specifically, each example e ∈ E ′maps a path p in some input tree T to a “corresponding” path p′ inT ′ for some (T, T ′) ∈ E . Now, given the furcated examples E ′, weinvoke a function called INFERPATHTRANS to learn an appropri-ate path transformer f such that p′ ∈ f(p) for every (p, p′) ∈ E ′.Finally, CODEGEN generates a program that performs the desiredtree transformation by applying f to each path in the input tree andthen constructing the output tree from the new set of paths.

In what follows, we explain each of these steps in more detail,leaving the INFERPATHTRANS procedure to Section 5.

Requirements on examples. Our approach is parametric on anotion of correspondence between paths in the input and outputtrees. Let Π be the universe of paths in all possible HDTs. Acorrespondence relation is a binary relation ∼⊆ Π×Π.

Given a set of input-output examples E , let us define E .in to bethe input trees in E (i.e., E .in = {T | (T, T ′) ∈ E}). Similarly,let E .out to be the set of ouput trees in E . Our synthesis algorithmexpects the user-provided examples E to obey a certain semanticunambiguity criterion, defined as follows:

Unambiguity: For every p′ ∈ paths(E .out), there exists aunique p such that p ∼ p′ where p ∈ paths(E .in) and (p, p′) ∈paths(T )× paths(T ′) for some (T, T ′) ∈ E .

In other words, unambiguity requires that, for every output pathp′, we can find exactly one input path p such that p ∼ p′ andp, p′ belong to the same input-output example. This unambiguitycriterion is enforced by the function CHECKEXAMPLES used atline 4 of the SYNTHESIZE algorithm.

The correspondence relation ∼ can be defined in many naturalways. Specifically, the HADES system allows the user to markpaths in the input-output examples as corresponding. However,HADES also comes with a default definition of ∼ that is adequatein many practical settings. Let p = [(`1, d1), . . . , (`k, dk)] andp′ = [(`′1, d

′1), . . . , (`′m, d

′m)] be two different paths. By default,

we define p ∼ p′ if `k = `′m (i.e., p and p′ end in leaves withthe same label). When the default definition of ∼ is used, HADESuses a refinement of the unambiguity criterion that can be checkedsyntactically and efficiently:

R1. Let (T, T ′) ∈ E . For every leaf v′ of T ′, there exists a leaf v ofT such that L(v′) = L(v).

Page 5: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Furcate

Furcate

Splice

Splice

Figure 6. Schematic illustration of our approach.

R2. If v1, v2 are leaves in E .in where v1 6= v2, thenL(v1) 6= L(v2).

It is easy to see that syntactic requirements R1 and R2 aresufficient to guarantee unambiguity. In particular, requirement R1ensures existence, while R2 guarantees uniqueness.

Furcation. Given an unambigous set of examples E , our algo-rithm furcates them into a set of path transformation examples E ′.Specifically, a pair of paths (p, p′) ∈ E ′ iff:

p ∈ paths(T ), p′ ∈ paths(T ′), and p ∼ p′

for some (T, T ′) ∈ E . If some input path p ∈ paths(T ) does nothave a corresponding output path p′ ∈ paths(T ′), then E ′ alsocontains (p,⊥).

Note that E ′ may contain multiple examples that have path pas an input — i.e., it is possible that (p, p′) and (p, p′′) are bothin E ′. For instance, this can happen if some leaf in the input treehas been duplicated in the corresponding output tree. However,due to the unambiguity requirement, it is not possible that thereare multiple examples in E ′ that have the same output path. Thatis, if (p, p′) ∈ E ′, then there does not exist another example(p′′, p′) ∈ E ′ where p 6= p′′.

Given a set of path transformation examples E ′, we will writeinputs(E ′) (resp. outputs(E ′)) to denote the input paths (resp. out-put paths) in E ′. That is, p ∈ inputs(E ′) iff there exists an example(p, ) ∈ E ′ and p′ ∈ outputs(E ′) iff there exists some ( , p′) ∈ E ′.Example 4. Figure 2 shows the result of furcating the input-outputexample from Figure 1.

Path transformers. The next step in our synthesis algorithm is tolearn a path transformation function f from the furcated examplesE ′. A path transformer is a function that takes as input a path andreturns a set of output paths. Specifically, any path transformerf returned by INFERPATHTRANS at line 7 of the SYNTHESIZEprocedure must satisfy the following requirement:

∀p ∈ inputs(E ′). (p′ ∈ f(p)⇔ (p, p′) ∈ E ′)Note that when an input path p does not have a correspondingoutput path p′, this requirement implies that f(p) = {⊥}. In theremainder of this section, we will assume the existence of an oraclethat can synthesize such path transformers, and we will revisit thistopic in Section 5.

Code generation using splicing. Once we learn a path trans-former f , the last step of our algorithm is to generate code for thesynthesized program. For this purpose, we first define a splicingoperation: Given a set S of paths, SPLICE(S) yields a well-formedtree T such that paths(T ) = S. Recall from Theorem 1 that theresult of splicing must be unique.

The program generated by CODEGEN is shown in Figure 7.The procedure TRANSFORM synthesized by our algorithm takesas input any well-formed tree T and returns the result of applyingthe desired transformation to T . In particular, given input tree T ,

1: procedure TRANSFORM(Tree T )

2: Input: Well-formed tree T3: Output: Result of desired transformation

4: S := paths(T );5: S′ := ∅;6: for all p ∈ S do7: Π := f(p);8: if Π 6= {⊥} then9: INSERT(S′,Π);

10: T ′ := SPLICE(S′);11: return T ′;

Figure 7. Synthesized program

TRANSFORM constructs the output tree T ′ as follows:

SPLICE({p′ | p′ ∈ f(p) ∧ p ∈ paths(T ) ∧ p′ 6= ⊥})

Summary. Figure 6 gives a schematic summary our approach:Our synthesis algorithm furcates the input-output examples andlearns a path transformer f . On the other hand, the synthesizedalgorithm furcates the given input tree, applies the path transformerf , and splices it back to obtain the output tree.

Soundness and completeness. We now state soundness and com-pleteness results about our synthesis algorithm.

Theorem 2. (Soundness) Let E be a set of examples satisfying theunambiguity requirement, and let E ′ be the output of FURCATE(E).Then, ∀(T, T ′) ∈ E . TRANSFORM(T ) ≡ T ′.

In other words, soundness states that, if we apply the synthe-sized program P to any tree T such that (T, T ′) is one of the inputsto SYNTHESIZE, then P (T ) will be equivalent to T ′.

Theorem 3. (Relative completeness) Let E be an unambiguousset of examples, and let E ′ be the output of FURCATE(E). Then,there exists a path transformer f such that

∀p ∈ inputs(E ′). (p′ ∈ f(p)⇔ (p, p′) ∈ E ′)In other words, relative completeness states that we do not lose

expressive power by synthesizing programs from path transfor-mations rather than directly synthesizing transformations on well-formed HDTs.

5. Synthesizing Path TransformationsWe now describe the INFERPATHTRANS algorithm for learningpath transformers from a set of input-output paths E . Each example(p, p′) ∈ E consists of an input path p and an output path p′, andour goal is to learn a path transformer f satisfying the property:

∀p ∈ inputs(E).(p′ ∈ f(p)⇔ (p, p′) ∈ E)

Language. We first introduce a small but sufficiently expressivelanguage over which we describe path transformers. As shown inFigure 8, a path transformer π takes as input a path x and returns aset of paths Π = {χ1, . . . , χn}. Specifically, a path transformer πhas the syntax

λx. {φ1 → χ1 ⊕ . . .⊕ φn → χn}where each φi is a path condition that evaluates to true or false, andeach χi is a path term describing an output path. The semantics ofthis construct is that χi ∈ Π iff φi evaluates to true. We refer to thenumber of (φi, χi) pairs in π as the arity of the path transformer.

Path terms χ used in the path transformer are formed by con-catanating different subpaths τi(x) where each τi is a so-called

Page 6: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Path transformer π := λx. {φ1 → χ1 ⊕ . . .⊕ φn → χn}Path term χ := concat(τ1(x), . . . , τn(x))Segment transformer τ := λx. map F subpath(x, t1, t2)Index term t := b · size(x) + cPath cond φ := Pi(x) | φ1 ∧ φ2 | φ1 ∨ φ2 | ¬φMapper F := λx. int | λx. fi(x)

| λx. if(ϕi(x)) then fi(x) else fj(x)

Figure 8. Language for expressing path transformers.

segment transformer. A segment transformer τ is of the formλx.map F subpath(x, t1, t2) and applies function F to a sub-path of x starting at index t1 and ending at index t2 (inclusive).For brevity, we will sometimes abbreviate the segment transformerλx.map F subpath(x, t1, t2) using the notation 〈t1, t2, F 〉. Notethat a segment transformer is a kind of looping construct that it-erates over a consecutive range of elements in x. Indices in seg-ment transformers are specified using index terms t of the formb · size(x) + c where b is either 0 or 1, c is an integer, and size(x)denotes the number of elements in path x. 3 Mapper functionsF either return constants or apply pre-defined functions fi to theirinput. For instance, in the file system domain, such predefined func-tions include procedures for changing file permission or convertingone file type to another (e.g., jpg to png). Mapper functions F canalso contain if statements if(ϕi(x)) then fi(x) else fj(x) whereeach ϕi is drawn from a family of pre-defined predicate templates(e.g., for checking file type).

Overview. We now give a high-level overview of our algorithmfor learning path transformers. As illustrated in Figure 9, our algo-rithm consists of three key components, namely partitioning, uni-fication, and classification. The goal of partitioning is to determinethe smallest arity path transformer π that can be used to describethe desired transformation and divide the examples E into groupsof unifiable subsets. We say that a set of examples E∗ is unifiable ifoutputs(E∗) can be represented using the same path term χ∗, andwe refer to χ∗ as the unifier for E∗. Our algorithm represents eachpartition Pi as a triple 〈Ei, χi, φi〉 where Ei is a unifiable set ofexamples, χi is their unifier, and φi is a predicate distinguishing Eifrom the other examples.

The partitioning component of our algorithm is based on enu-merative search that tries different hypotheses. In our algorithm, ahypothesis corresponds to a fixed-size partitioning of examples Einto k disjoint groups E1, . . . , Ek. For each hypothesis, we invokethe unification module to check whether each of E1, . . . , Ek is unifi-able, and, if so, what their unifier is. Hence, the unification compo-nent can be viewed as an oracle for confirming a given hypothesis.If the unification procedure confirms the feasibility of a specifichypothesis, we then invoke the classification module to find a pred-icate that differentiates each partition from all other partitions. 4

Since our algorithm repeatedly invokes the unification compo-nent to confirm or refute a specific hypothesis, we need an efficientmechanism for finding unifiers. Towards this goal, our algorithmrepresents each input-output example using a compact numeric rep-resentation and invokes an SMT solver to determine the existenceof a unifier. Furthermore, we can obtain the unifier χi associatedwith a set of examples Ei by getting a satisfying assignment to anSMT formula. This approach allows our algorithm to find unifiers

3 Our implementation actually allows a richer class of index terms contain-ing expressions of the form indexOf(x, e) where x is a path and e is anelement in this path. However, we ignore indexOf expressions here to sim-plify the technical presentation.4 In our approach, such a predicate is always guaranteed to exist; hence,there is no need for backtracking.

1: procedure INFERPATHTRANS(set E)

2: Input: A set of path transformation examples E3: Output: Synthesized path transformer

4: . Phase I: Partition into unifiable subsets5: for i=1; i≤ |E|; i++ do6: Φ := PARTITION(∅, E , i);7: if Φ 6= ∅ then break;

8: . Phase II: Learn classifier for each partition9: for all Pi in Φ do

10: Ei := Pi.E ;11: φi := CLASSIFY(Ei, E);12: Pi.φ := φi;

13: π := PTCODEGEN(Φ);14: return π;

Figure 10. Algorithm for learning path transformations

with a single SMT query rather than explicitly exploring searchspaces of exponential size.

The last key ingredient of our algorithm for synthesizing pathtransformers is classification. Given a set of examples E1, . . . , Ek,the goal of classification is to infer a predicate φi for each Ei suchthat φi evaluates to true for each p ∈ inputs(Ei) and evaluates tofalse for each p′ ∈ inputs(E) − inputs(Ei). For this purpose, weuse the ID3 algorithm for learning a small decision tree and thenextract a formula describing all positive examples in this tree.

InferPathTrans algorithm. Figure 10 presents pseudo-code forthe INFERPATHTRANS procedure based on this discussion. Thealgorithm consists of two phases: In the first phase, we partitionexamples E into a set Φ of unifiable groups, and, in the secondphase, we infer classifiers for each partition. Specifically, lines 5-7try to partition E into i disjoint groups by invoking the PARTITIONprocedure. Hence, after this loop, set Φ is a partitioning of E intothe minimum number of unifiable groups. In the second phase(lines 9-12), we then infer a classifier φi for each partition Pi inΦ. Finally, we use a procedure called PTCODEGEN to generate apath transformer π from the partitions. In what follows, we describepartitioning, unification, and classification in more detail.

5.1 PartitioningFigure 11 shows the partitioning algorithm used in INFERPATH-TRANS. The recursive PARTITION procedure takes as input a setof examples E1 that are part of the same partition, the remainingexamples E2, and number of partitions k. The base case of the al-gorithm is when k = 1: In this case, we try to unify all examplesin E1 ∪ E2, and, if this is not possible, we return failure (i.e., ∅).

In the recursive case (lines 9–17), we try to grow the currentpartition E1 by adding one or more of the remaining examples fromE2. The algorithm always maintains the invariant that elements inE1 are unifiable. Hence, we try to add an element e ∈ E2 to E1(line 10), and if the resulting set is not unifiable, we give up andtry a different element (lines 11–12). Since we know that E1 ∪ {e}is unifiable, we now see if it is possible to partition the remainingexamples E2−{e} into k−1 unifiable sets; this corresponds to therecursive call at line 13. If this is indeed possible (i.e., the recursivecall does not return ∅), we have found a way to partition E1 ∪ E2into k different partitions and return success (line 15).

Now, if the remaining examples E2 cannot be partitioned intok − 1 unifiable sets, we try to shrink E2 by growing E1. Hence, therecursive call at line 16 looks for a partitioning of examples whereone of the partitions contains at least E1∪{e}. If this recursive call

Page 7: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Partitioning

Unification

Enumerativesearch

SMT solver

Classification

Decision treelearning

Examples

Pathtransf.code

Figure 9. Schematic overview of algorithm for learning path transformers

1: procedure PARTITION(set E1, set E2, int k)

2: Input: Current partition E1, remaining examples E2, and3: number of partitions k4: Output: Set of partitions Φ

5: if k = 1 then . Base case6: χ := UNIFY(E1 ∪ E2);7: if χ = null then return ∅;8: return {P(E1 ∪ E2, χ)};

9: for all e ∈ E2 do . Recursive case10: χ := UNIFY(E1 ∪ {e});11: if χ = null then12: continue;13: Φ := PARTITION(∅, E2 − {e}, k − 1);14: if Φ 6= ∅ then15: return Φ ∪ {P(E1 ∪ {e}, χ)};16: Φ := PARTITION(E1 ∪ {e}, E2 − {e}, k);17: if Φ 6= ∅ then return Φ;18: return ∅;

Figure 11. Partitioning Algorithm. We use the notation P(E , χ) todenote a partition with examples E and their unifier χ.

also does not succeed, then we move on and consider the scenariowhere partition E1 does not contain the current element e.

Observe that PARTITION(∅, E , k) effectively explores all possi-ble ways to partition examples E into k unifiable subsets. However,since many subsets of E are typically not unifiable, the algorithmdoes not exhibit its worst-case O(kn) behavior in practice.

5.2 UnificationWe now describe the UNIFY procedure for determining if a setof examples E have a unifier. Since the unification algorithm isinvoked many times during partitioning, we want to ensure thatUNIFY is efficient in practice. To achieve this goal, we formulateunification as a symbolic constraint solving problem rather thanperforming explicit search. However, in order to reduce unificationto SMT solving, we first need to represent each input-output exam-ple in a so-called summarized form that uses a numerical represen-tation to describe each path transformation.

Intuitively, a summarized example represents a path transforma-tion as a permutation of the elements in the input path. For example,if some element e in the output path has the same label as the k’thelement in the input path, then we represent e using numerical valuek. On the other hand, if element e does not have a correspondingelement with the same label in the input path, summarization usesa so-called “dictionary” D to map e to a numerical value that isoutside of the index range of the input paths. More formally, wedefine example summarization as follows:

Definition 5. (Example summarization) Let E be a set of exam-ples where L denotes the set of labels used in E , and let F be theset of pre-defined functions allowed in the path transformer. LetD : (L∪F)→ {i | i ∈ Z∧ i > m} be an injective function wherem is the maximum path length in inputs(E) 5. Given an example(p1, p2) ∈ E , the summarized form of (p1, p2) is a pair (n, σ)where n is the length of path p1 and σ is a sequence such that:

σi =

(j, p1[j].d→ p2[i].d) if ∃j.p2[i].` = p1[j].`(D(f∗) + j, ⊥ → p2[i].d) else if ∃j.p2[i].` = f∗(p1[j])(D(p2[i].`), ⊥ → p2[i].d) otherwise

We illustrate summarization using a few examples:

Example 5. Consider the example (p1, p2) such that p1 is the path[(A, r), (B, r), (C, r)], and p2 = [(C,w), (A, r), (B, r)] whereA,B,C are labels corresponding to directory names, and r and windicate read/write permission. The summarized example is (3, σ)where σ is [(3, r 7→ w), (1, r 7→ r), (2, r 7→ r)]. The first elementin σ is (3, r 7→ w) because C, which is the first element in the out-put path, is at index 3 in the input path, and its corresponding datais mapped from r to w. Note that the sequence [3, 1, 2] describeshow elements in the input have been permuted to obtain the output.

Example 6. Now, consider the same input path p1 from Example 5,but a different output path p′2 = [(A, r), (B, r), (New, r)]. Also,suppose that D(New) = 1000 (i.e., our “dictionary” assigns value1000 to foreign element New). In this case, the summarized exam-ple is (3, σ′) where σ′ = [(1, r 7→ r), (2, r 7→ r), (1000,⊥ 7→r)]. Note that New is mapped to 1000 using case 3 of Definition 5.

Example 7. Consider the input path [(A,⊥), (B, pdf)] and theoutput path [(A,⊥), (pdf,⊥), (B, pdf)]. Also, suppose that wehave a built-in function called ExtOf that allows us to retrievethe type (extension) of a given file. In this case, the summarizedexample is (2, σ) where σ is given by:

[(1,⊥ 7→ ⊥), (D(ExtOf) + 2,⊥ 7→ ⊥), (2, pdf 7→ pdf)]

Note that label pdf in the output list is mapped to D(ExtOf) + 2because it corresponds to the extension for element at index 2 inthe input list (case 2 of Definition 5). In general, adding index j toD(f∗) in case 2 of Definition 5 has two advantages: First, it allowsus to encode the path element to which f∗ was applied. Second,if the same function is applied to contiguous elements in the inputpath, then we will obtain a contiguous range of numbers, which wecan then coalesce into a loop construct. 6

5 For reasons that will become clear later, we also require that D satisfies∀`, `′.` 6= `′ ⇒ |D(`)−D(`′)| ≥ m. While this is not necessary for thecorrectness, it makes our unification algorithm more efficient.6 In the general case, note that the j index used in Definition 5 may not beunique, in which case our algorithm considers all possible summarizations.However, we have never observed this to be the case in practice, and wemake the uniqueness assumption to simplify our presentation.

Page 8: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Given a summarized example e = (n, [(i1,m1), . . . , (in,mn)]),we write indices(e) to denote [i1, . . . , in]. For instance, in Exam-ple 5, we have indices(e) = [3, 1, 2].

The next step in our unification algorithm is to consolidate con-secutive indices in the summarized example. Towards this goal, wedefine the consolidated form of an example as follows:

Definition 6. (Consolidated form) Given a summarized examplee = (n, σ), we say that e∗ is a consolidated form of e iff it is of theform (n, σ∗) where σ∗ = [〈b1, e1,M1〉, . . . , 〈bk, ek,Mk〉] and:

• indices(e) = [b1, . . . , e1, . . . , bk, . . . , ek]• For all i, [bi, bi+1, . . . , ei] is a contiguous sublist of indices(e)• Mk =

⋃j{mj | bk ≤ ij ≤ ek ∧ σj = (ij ,mj)}

Intuitively, consolidated form can coalesce consecutive indicesin a summarized example. Note that the consolidation of a sum-marized example is not unique because we are allowed but not re-quired to coalesce consecutive indices.

Example 8. Consider the summarized example from Example 5,which has the following two consolidated forms:

e∗1 = (3, [〈3, 3, {r 7→ w}〉, 〈1, 1, {r 7→ r}〉, 〈2, 2, {r 7→ r}〉])e∗2 = (3, [〈3, 3, {r 7→ w}〉, 〈1, 2, {r 7→ r}〉])

Note that, in e∗2, we have coalesced the two contiguous elementsfrom the input path into the same element 〈1, 2, {r 7→ r}〉.

Given a consolidated example e∗ = (n, σ∗) where σ∗ is[〈b1, e1,M1〉, . . . , 〈bk, ek,Mk〉], we define len(e∗) to be n andsegments(e∗) to be k. We also write begin(e∗, j) to indicate bj ,end(e∗, j) for ej , and data(e∗, j) for Mj . Note that σ∗ can beviewed as a concatanation of concrete path segments of the form〈c, c′,M〉 where c and c′ are the start and end indices for the cor-responding path segment respectively.

Before we continue, let us notice the similarity between seg-ment transformers 〈t, t′, F 〉 7 in the language from Figure 8, andeach concrete path segment 〈c, c′,M〉 in a consolidated example.Specifically, observe that a concrete path segment can be viewedas a concrete instantiation of a segment transformer 〈b ∗ size(x) +c, b′ ∗ size(x) + c′, F 〉 where each of the terms b, c, b′, c′, and Fare substituted by concrete values. In fact, this is no coincidence:The key insight underlying our unification algorithm is to use theconcrete path segments in the consolidated examples to solve forthe unknown terms in segment transformers using an SMT solver.

We are now ready to explain the details of our unification al-gorithm, which is presented in Figure 12. Given a set of examplesE , the UNIFY algorithm starts by computing the summarized ex-amples E ′ at line 5 and generates all possible consolidated formsfor each e′i ∈ E ′ (line 6). Now, since we don’t know which con-solidated form is the “right” one, we need to consider all possiblecombinations of consolidated forms of the examples. Hence, set Λobtained at line 6 corresponds to the Cartesian product of the con-solidated form of examples E ′.

Next, the algorithm enumerates all possible candidate unifiersχ of increasing arity, where arity refers to the number of path seg-ments being concatenated. Based on the grammar of our language(recall Figure 8), we know that a path term χ of arity k has the formconcat(τ1(x), . . . , τk(x)) and each τi is a segment transformer de-scribed by the following template:

〈bi · size(x) + ci , b′i · size(x) + c′i , Fi〉

Hence, the hypothesis χ at line 10 of the algorithm is a templatizedunifier whose unknown coefficients will later be inferred if theexamples can indeed be unified using an arity-k path term.

7 Recall that 〈t, t′, F 〉 is an abbreviation for λx.map F subpath(x, t, t′).

1: procedure UNIFY(set E)

2: Input: A set of path transformation examples E3: Output: A unifier χ if it exists, null otherwise

4: . Convert examples to consolidated form

5: E ′ := {e′ | e′ = SUMMARIZE(e) ∧ e ∈ E};6: Λ := {(e∗1, . . . , e∗n) | e∗i ∈ CONSOLIDATE(e′i) ∧ e′i ∈ E ′};

7: . Generate candidate unifiers χ of increasing size

8: for k=1 to maxSize(outputs(E)) do9: τi := 〈bi · size(x) + ci , b

′i · size(x) + c′i , Fi〉;

10: χ := concat(τ1(x), . . . , τk(x));

11: . Check if χ unifies some E∗ ∈ Λ

12: for all E∗ ∈ Λ do13: if (∃e∗i ∈ E∗. segments(e∗i ) 6= k) then14: continue;

15: . Use SMT solver to check if χ unifies E∗

16: ϕie∗ := (bi · len(e∗) + ci = begin(e∗, i));17: ψie∗ := (b′i · len(e∗) + c′i = end(e∗, i));18: φ :=

∧1≤i≤k

∧e∗∈E∗ (ϕie∗ ∧ ψie∗ );

19: if UNSAT(φ) then continue;20: σ := SATASSIGN(φ);

21: σ′ := UNIFYMAPPERS(E∗);22: if σ′ = null then continue;

23: return SUBSTITUTE(χ, σ ∪ σ′);

24: return null;

Figure 12. Unification algorithm

After generating a hypothesis χ, we proceed to confirm orrefute this hypothesis using the set of consolidated examples in Λ.Specifically, if there exists some set E∗ of consolidated examplesin Λ for which χ is a unifier, this means that χ is also a unifier forthe original examples E . Now, consider some set of consolidatedexamples E∗ in Λ. For χ to be a unifier for E∗, every example ine∗i ∈ E∗ must contain exactly k segments because χ has arity k. Ifthis condition is not met (line 13), we know that χ is not a unifierfor E∗, and we move on to a different combination of consolidatedexamples in the next iteration of the inner loop (lines 12-23).

On the other hand, if all examples in E∗ are comprised of ksegments, we proceed to check whether it is possible to instantiatethe unknown coefficients b1, b′1, c1, c′1, . . . , bk, b′k, ck, c

′k in χ in a

way that is consistent with the concrete path segments in E∗. Now,consider the i’th concrete path segment in consolidated example e∗

and the i’th abstract path segment 〈bi · size(x) + ci, b′i · size(x) +

c′i, Fi〉 in hypothesis χ. Clearly, if our hypothesis is correct, itshould be possible to instantiate the unknown coefficients in a waythat satisfies:

bi · len(e∗) + ci = begin(e∗, i) ∧ b′i · len(e∗) + c′i = end(e∗, i)

since the size of this example is len(e∗) and the begin and end in-dices for the path segment are given by begin(e∗, i) and end(e∗, i)respectively. Hence, the formula φ generated at line 18 asserts thatit is possible to find instantiations for the unknown coefficients inhypothesis χ for examples E∗. Now, to check whether our hypoth-esis is correct, we simply query the satisfiability of φ. If it is un-satisfiable (line 19), we reject the hypothesis for the current set of

Page 9: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

examples E∗ and look for a different combination of consolidatedexamples satisfying χ.

If, however, φ is satisfiable, we have found an instantiation ofthe unknown coefficients, which is given by the satisfying assign-ment σ obtained at line 20. Now, the only question that remains tobe answered is whether we can also find an instantiation of the un-known functions F1, . . . , Fk used in χ. For this purpose, we use afunction called UNIFYMAPPERS which tries to find mapper func-tions Fi unifying all the different Mi’s from the examples. Sincethe UNIFYMAPPERS procedure is based on straightforward enu-merative search, we do not describe it in detail. In particular, sincethe language of Figure 8 only allows a finite set of pre-defined datatransformers fi and predicates φi, UNIFYMAPPERS enumerates –in increasing order of complexity– all possible functions belongingto the grammar of mapper functions in Figure 8.

Example 9. For the motivating example from Section 2, our unifi-cation algorithm takes the following steps to determine the unifierχ1 for partition P1: First, it generates the summarized examplesshown in Figure 13. Next, it generates all possible consolidatedforms for each example and constructs set Λ as their cartesian prod-uct. The algorithm then considers hypotheses of increasing size, butrejects those with arity 1 and 2 since all examples contain at least3 segments. Now, consider a hypothesis χ of arity 3 and the set ofconsolidated examples E∗ shown in Figure 14. We generate the for-mula φ shown in Figure 15 and get a satisfying assignment, whichresults in the following instantiation of χ:

[〈1, 1, F1〉, 〈D(ExtOf)+size(x),D(ExtOf)+size(x), F2〉, 〈2, size(x), F3〉]

Finally, UNIFYMAPPERS searches for instantiations of F1, F2,and F3 satisfying all data mappers in the examples. For F1 and F3,it returns the Identity function, and for, F2, it extracts the functionextOf from the segment transformer coefficients. As a result, weobtain the unifier χ1 presented in Section 2.

Discussion. We now briefly discuss why we use constraint solv-ing instead of enumerative search for finding the coefficients usedin each hypothesis. Recall that each index term in our language isof the form b · size(x) + c where b is either 0 or 1 and c is aninteger, where |c|must be in the range [0, size(x)]. Hence, for a hy-pothesis of arity k, the size of the search space that we would needto explore is (2 · 2 · maxSize(inputs(E)))2k. Since solving linearequalities using an SMT solver is, in practice, much more efficientthan searching through spaces of this size, performing unificationusing constraint solving makes our approach much more practical.

5.3 ClassificationWe now turn to the last missing piece of our algorithm for learningpath transformations. Recall that the second phase of our INFER-PATHTRANS procedure needs to find a predicate differentiating agiven partition from all other partitions. Specifically, given exam-ples E and a partition Pi with examples Ei ⊆ E and a unifier χi,the goal of classification is to find a predicate φi such that:

(1) ∀p ∈ inputs(Ei). (φi[p/x] ≡ true)(2) ∀p ∈ (inputs(E)− inputs(Ei)). (φi[p/x] ≡ false)

In other words, we want to find a predicate φi that evaluates totrue for all inputs in Ei and to false for all other inputs in E . As apractical matter, we also want each φi to be as simple as possiblebecause simpler predicates tend to generalize better and lead tomore general programs (i.e., the principle of Occam’s razor).

A key insight underlying our algorithm is that finding a predi-cate φi with these properties is precisely the familiar classificationproblem in machine learning. Hence, to find such a predicate φi,we first extract the relevant features for each path and then use theID3 decision tree learning algorithm.

Feature extraction. To use decision tree learning for classifica-tion, we need to represent each input path using a finite set of dis-crete features. In our HADES system, the set of features used torepresent a given path is domain-specific and is defined separatelyfor each application domain (e.g., UNIX directories or XML doc-uments). As an example, some of the features we use for the filesystem domain include the following:

• The file type (e.g., mp3, pdf) associated with the leaf• A boolean value indicating whether the path contains a certain

file or directory• The permissions associated with the leaf• A boolean value indicating whether the path contains a direc-

tory owned by “root”

In addition to these domain-specific attributes, one additionalfeature we always include, regardless of the application domain, isthe label associated with the leaf. In particular, since we require thateach leaf in the input tree has a unique label, the inclusion of leaflabels as a feature guarantees that a classifier always exists. Givena path p, we write α(p) to denote the feature vector for p, and wewrite αf (p) to denote the value of feature f for path p.

Decision tree learning. We now explain how we use decisiontree learning to infer a predicate distinguishing paths Π1 froma different set of paths Π2. Here, Π1 corresponds to inputs(Ei)where Ei is the set of examples for Pi, and Π2 corresponds toinputs(E) − inputs(Ei). Given these sets Π1 and Π2 and a set offeatures F , we use the ID3 algorithm [29] to construct a decisiontree TD with the following properties:

• Each leaf in TD is labeled as Π1 or Π2

• Each internal node of TD is labeled with a feature f ∈ F• Each directed edge (f, f ′, `) from node f to f ′ is annotated

with a label ` that corresponds to a possible value of feature f• Let (f1, f2, `1), . . . (fn,Πi, `n) be a root-to-leaf path in TD .

Then, for every p ∈ Π1 ∪Π2, the following condition holds:( n∧i=1

αfi(p) = li)⇔ p ∈ Πi

We note that the ID3 algorithm also attemps to minimize thesize of the constructed decision tree. However since it uses a greedyapproach based on information gain, it only guarantees local –butnot global– optimality.

Once we construct such a decision tree TD with the afore-mentioned properties, identifying a predicate φ differentiating Π1

from Π2 is very simple. Specifially, given a root-to-leaf path π =〈(f1, f2, `1), . . . (fn,Π1, `n)〉 ending in a leaf with label Π1, letϕ(π) denote the formula:

ϕ(π) :

n∧i=1

fi = li

Assuming Π1 corresponds to inputs(Ei) for somePi with examplesEi and π2 is inputs(E)− inputs(Ei), the following DNF formula φigives us a classifier for partition Pi:

φi :∨

π∈pathTo(TD,Π1)

ϕ(π)

In particular, observe that φi evaluates to true for any path p ∈ Π1

because at least one of the disjuncts of φi will evaluate to true. Incontrast, φi evaluates to false for any p′ ∈ Π2 because all disjunctsevaluate to false when φi is obtained from a valid decision tree TD .

Page 10: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

E ′1 : (3, [(1, dm 7→ dm), (D(ExtOf) + 3,⊥ 7→ df ), (2, dj 7→ dj), (3, ds 7→ ds)])E ′3 : (3, [(1, dm 7→ dm), (D(ExtOf) + 3,⊥ 7→ do), (2, dr 7→ dr), (3, dh 7→ dh)])E ′4 : (4, [(1, dm 7→ dm), (D(ExtOf) + 4,⊥ 7→ df ), (2, dr 7→ dr), (3, dms 7→ dms), (4, du 7→ du)])E ′6 : (4, [(1, dm 7→ dm), (D(ExtOf) + 4,⊥ 7→ dmp), (2, dp 7→ dp), (3, da 7→ da), (4, dsf 7→ dsf )])

Figure 13. Summarization of partition P1 of the motivating example.

E∗1 : (3, [〈1, 1, {dm 7→ dm}〉, 〈D(ExtOf) + 3,D(ExtOf) + 3, {⊥ 7→ df}〉, 〈2, 3, {dj 7→ dj , ds 7→ ds}〉])E∗3 : (3, [〈1, 1, {dm 7→ dm}〉, 〈D(ExtOf) + 3,D(ExtOf) + 3, {⊥ 7→ do}〉, 〈2, 3, {dr 7→ dr, dh 7→ dh}〉])E∗4 : (4, [〈1, 1, {dm 7→ dm}〉, 〈D(ExtOf) + 4,D(ExtOf) + 4, {⊥ 7→ df}〉, 〈2, 4, {dr 7→ dr, dms 7→ dms, du 7→ du}〉])E∗6 : (4, [〈1, 1, {dm 7→ dm}〉, 〈D(ExtOf) + 4,D(ExtOf) + 4, {⊥ 7→ dmp}〉, 〈2, 4, {dp 7→ dp, da 7→ da, dsf 7→ dsf}〉])

Figure 14. A set of consolidated examples E∗ for partition P1

φ :b1 · 3 + c1 = 1 ∧ b′1 · 3 + c′1 = 1 ∧ b1 · 4 + c1 = 1 ∧ b′1 · 4 + c′1 = 1 ∧b2 · 3 + c2 = D(ExtOf) + 3 ∧ b′2 · 3 + c′2 = D(ExtOf) + 3 ∧ b2 · 4 + c2 = D(ExtOf) + 4 ∧ b′2 · 4 + c′2 = D(ExtOf) + 4 ∧b3 · 3 + c3 = 2 ∧ b′3 · 3 + c′3 = 3 ∧ b3 · 4 + c3 = 2 ∧ b′3 · 4 + c′3 = 4

Figure 15. (Simplified) formula φ to check the satisfiability of hypothesis χ1 on set E∗.

Example 10. Recall the motivating example from Section 2 whereinputs(E) = {p1, p2, p3, p4}. Let us now consider how we finda predicate for partition P2. Here, we have Π1 = inputs(E2) ={p1, p3} and Π2 = inputs(E) − inputs(E2) = {p2, p4}. Afterrunning the ID3 algorithm, we obtain the following decision tree:

extOf

flac ogg

mp3

Hence, we extract the classifier φ2 : ExtOf(x) = flac.

5.4 Soundness of Path Transformation SynthesisWe conclude this section by stating a key property of the INFER-PATHTRANS algorithm.

Theorem 4. Let f be a function synthesized by INFERPATHTRANSusing path transformation examples E . Then,

∀p ∈ inputs(E). (p′ ∈ f(p)⇔ (p, p′) ∈ E)

This theorem is very important for the soundness of our algo-rithm for synthesizing tree transformations (recall Section 4). Inparticular, without this property, we cannot guarantee that, for aninput-output example (T1, T2), applying TRANSFORM to T1 willactually yield T2.

6. ImplementationWe have implemented the proposed algorithm for synthesizingHDT transformations in a tool called HADES, which currently con-sists of ∼ 9, 500 lines of C++ code (including our implementationof the ID3 decision tree learning algorithm). The only external toolused by HADES is the Z3 SMT solver [10].

As mentioned in Section 1, the core of HADES is the domain-agnostic synthesis backend for learning HDT transformations fromexamples. The synthesis back-end accepts input-output examplesin the form of hierarchical data trees and emits path transformationfunctions in an intermediate language (recall Figure 8).

In order to allow our system to be easily extensible to newdomains, HADES also provides an interface for domain-specificplug-ins (front-ends). Our current implementation incorporates twosuch plug-ins, one for XML transformations in the XSLT scriptinglanguage and another one for file system transformations as bashscripts. Our XML front-end uses the tinyxml2 library for parsingXML documents.

HADES can be extended to new domains by implementing plug-ins that implement the following functionality:

1. Represent input-output examples as HDTs

2. Use the synthesized path transformer to emit tree transforma-tion code in the target language

3. Specify the domain-specific functions that the path transformeris allowed to use — note that such domain-specific functionscorrespond to the pre-defined fi’s from Figure 8

4. Define a feature extraction function as well as any domain-specific features to be used for decision tree learning (recallSection 5.3)

Assuming a plug-in can provide this functionality for domain D,HADES can immediately be used for synthesizing D-specific hier-archical data transformations.

7. EvaluationWe evaluated HADES by using it to synthesize a collection of 36real-world data transformations in the file system and XML do-mains. Our examples mostly come from on-line forums such asStackoverflow and bashscript.org. In addition, some of the exam-ples in the file system domain were suggested to us by teachingassistants who needed to write bash scripts for various tasks in un-dergraduate computer science courses at our institution.

In addition to assessing whether HADES can successfully syn-thesize these benchmarks, we also want to evaluate the following:

• Performance: How long does HADES take to synthesize thedesired transformation?• Complexity: How complex are the path transformations synthe-

sized by HADES, both in terms of lines of code as well as othermetrics (e.g., number of loops)?• Usability: How hard is it for users to work with HADES? How

many input-output examples do users need to provide beforeHADES learns the desired transformation?

In order to answer these questions, we performed a user studyin which we asked a group of students —some computer sciencemajors, others from different disciplines— to work with HADESand synthesize a script that performs a specific task chosen fromFigure 16. We emphasize that none of the students in our studyhad prior knowledge of HADES; furthermore, no participant in ourstudy is familiar with program synthesis research.

Page 11: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Benchmarks Time Script User

Description

Tota

l(s)

Bra

nche

s

Segm

ents

LO

C

Iter

atio

ns

Exa

mpl

es

Dep

th

File

Syst

em

F1 Categorize .csv files based on their group 0.03 1 2 47 2 2 3F2 Make all script files executable 0.01 1 2 51 1 2 2F3 Copy all text and bash files to directory temp 0.05 2 3 56 2 4 3F4 Append last 3 directory names to the file name and delete those directories 0.02 1 3 50 2 2 6F5 Put files in directories based on modification year/month/day 0.02 1 4 52 1 2 4F6 Copy files without extension into the “NoExtension” directory 0.05 2 3 56 2 7 4F7 Archive each directory to a tarball with modification month in its name 0.01 1 1 50 1 2 2F8 Make files in “DoNotModify” directory read-only 0.12 2 4 83 2 4 3F9 Convert .mp3, .wma, and .m4a files to .ogg 0.02 1 1 47 1 3 3F10 Change group of text files in “Public” directory to “everyone” 0.06 2 2 77 1 3 2

F11 Change directory structure from contest/Sub/uid/pid/sid/ tocontest/Sub/pid/sid-uid/ 0.03 1 4 52 2 3 6

F12 Convert .zip archives to tarballs 0.08 2 2 77 1 3 3F13 Organize all files based on their extensions 1.85 4 10 148 1 9 3F14 Append modification date to the file name 0.01 1 2 48 1 2 3F15 Convert pdf files to swf files 0.03 2 2 55 2 2 2F16 Delete files which are not modified last month 0.03 2 2 54 2 5 2F17 Convert Video files to audio files and put them in “Audio” directory 0.01 1 2 49 1 3 5F18 Convert xml files ≥ 1kB to text files 0.09 2 2 77 2 4 2

F19 In directory “game”: append “largest” to name of largest file, append “big” toname of xml files ≥ 1kB 17.94 3 5 110 2 6 3

F20 Extract tarballs to a directory with name combined of file and parentdirectory name 0.36 2 3 83 2 3 3

F21 Append parent directory name to each .c file and copy them in “MOSS”directory 0.04 2 3 56 2 4 4

F22 Keep all files older than 5 days 0.59 2 2 54 3 7 2F23 Copy each file to the directory created with its file name 0.02 1 2 47 1 3 5F24 Archive directories which are not older than a week 0.11 2 2 80 3 5 2

XM

L

X1 Add style=’bold’ attribute to parent element of each text 0.02 1 2 145 2 3 5X2 Merge elements with attribute “status” and put that attribute in their children 0.02 1 2 169 3 3 5X3 Remove all attributes 0.01 1 1 115 2 3 3X4 Change the root element of xml 0.02 1 2 222 1 2 3X5 Remove 3rd element and put all nested elements inside it under its parent 0.02 1 2 129 1 3 4X6 Create a table which maps each text to its parent element tag 0.09 2 10 497 1 6 4

X7 Remove all texts in element “done” and put all remaining texts to “todo”elements 0.05 2 3 241 4 8 3

X8 Generate HTML drop down list from a XML list storing the data 0.02 1 4 443 1 3 3X9 Rename a set of element tags to standard HTML tags 0.02 1 4 356 1 2 3

X10 Move attribute “class” to the first element and categorize based on “class” 0.18 2 6 229 1 6 3

X11 Categorize elements based on “tag” attribute and put each class under anelement with valid HTML tag 5.55 3 9 517 2 6 3

X12 Delete all elements with tag “p” 0.05 2 1 122 1 5 4

Figure 16. File System and XML Benchmarks

Setup. In our user study, we gave each participant six specificbenchmarks to synthesize. In particular, we asked the participantsto come up with a set of input-output examples until HADES syn-thesized the desired transformation. Since many of the participantsare not familiar with bash or XSLT scripts, we also provided themwith test cases. In particular, we asked the participants to verify thecorrectness of the synthesized script by running it on the providedtests and confirming that it produces the desired result. The partic-ipants were asked to provide HADES with a new set of examples—distinct from those in our test suite — if the synthesized scriptdid not produce the desired output.

Summary of results. Figure 16 summarizes the results of ourevaluation for the file system and XML domains. The column la-beled “Description” provides a brief summary of each benchmark,and “Time” reports synthesis time in seconds. The column labeled“Script” gives some statistics about the synthesized script, and thecolumn labeled “User” provides some important data related to userinteraction. We now describe each of these aspects in more detail.

Performance. To evaluate performance, we utilized the examplesprovided by non-expert users from our user study. 8 All perfor-mance experiments are conducted on a MacBook Pro with 2.6 GHzIntel Core i5 processor and 8 GB of 1600 MHz DDR3 memory run-ning OS X version 10.10.3.

The column labeled “Time” in Figure 16 reports the total syn-thesis time in seconds. This time includes all aspects of synthe-sis, including conversion of examples to HDTs and emission ofreal bash or XSLT code. On average, HADES takes 0.90 secondsto synthesize a directory transformation and 0.51 seconds to syn-thesize an XML transformation. Across all benchmarks, HADESis able to synthesize 91.6% of the benchmarks in under 1 secondand 97.2% of the benchmarks in under 10 seconds. Observe thatno single benchmark takes more than 18 seconds to synthesize. Webelieve these results demonstrate that HADES is practical and canbe used in a real-world setting.

8 When there were multiple rounds of interaction with the user, we used theexamples from the last round.

Page 12: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

Complexity. The column labeled “Script” in Figure 16 reportsvarious statistics about the script synthesized by HADES. In par-ticular, the column labeled “branches” reports the number of top-level branches in the synthesized program (i.e., the arity of the pathtransformer). The column labeled “Segments” reports the numberof loops (i.e., number of segment transformers). Finally, the col-umn labeled “LOC” gives the number of lines of code for the syn-thesized path transformer. Note that the whole script synthesizedby HADES is actually significantly larger; the statistics here only in-clude the path transformation portion (recall that HADES also emitscode for furcation and splicing). As summarized by this data, thescripts synthesized by HADES are fairly complex: They contain be-tween 1-4 branches, 1-10 loops, and between 47-517 lines of codefor the path transformer. Furthermore, since our algorithm alwaysgenerates a simplest path transformer, the reported statistics givea lower bound on the complexity of the required path transforma-tions.

Usability. The last section of Figure 16 reports various statisticsfrom the user study. Specifically, the column labeled “Iteration” re-ports the number of rounds of tool-user interaction until HADESwas able to synthesize the desired script (i.e., pass all test cases).The column labeled “Examples” summarizes the number of fur-cated examples in the input-output trees provided by the user. Fi-nally, the column labeled “Depth” gives the maximum depth of theinput-output trees. We remark that the number of examples reportedhere is not a lower bound on the number of examples that must beprovided by the user. In particular, some users choose to providemore examples than necessary to increase their confidence or re-duce the number of back-and-forth interactions with the tool.

We believe these results demonstrate that HADES is quite user-friendly: 88.8% of the benchmarks require only 1-2 rounds of userinteraction, with no task requiring more than 4 rounds. Further-more, 72.2% of the tasks require less than 5 examples, and no taskrequires more than 9. Finally, the depth of the trees provided bythe user are typically very small. By providing example trees withdepth 3.2 on average, users are able to obtain scripts that work ontrees of unbounded depth.

8. Related workIn recent years, program synthesis has received much attentionfrom the programming languages community. Many of these ap-proaches require either a programmer-provided “template” [7, 33,34] or a complete logical specification [21] of the target program.In contrast, our method performs synthesis from examples.

There is a rapidly expanding literature on synthesis techniquesthat require only input-output examples. Many approaches in thisarea are focused on non-hierarchical data types like numbers [32],strings [15], tables [6, 16]. However, there is also an emerging bodyof work on the synthesis of programs over recursive data struc-tures [5, 13, 23, 28]. Of these approaches, Perelman et al. [28] givea method based on optimized enumerative search for constructingprogram synthesizers for user-defined domain-specific languages(DSLs). Le and Gulwani [23] present a framework called FlashEx-tract where programs made from higher-order combinators aresynthesized using a custom deductive procedure. Albarghouthi etal. [5] uses goal-directed enumerative search to synthesize first-order programs over recursive data structures. Feser et al. [13] of-fers a method to synthesize higher-order functional programs overrecursive data structures using a combination of heuristics for gen-eralization from examples, deduction and cost-directed enumera-tion. Osera and Zdancewic [27] study a similar problem and offera solution based on type-directed deduction.

One difference between these approaches and ours is algorith-mic. Unlike these approaches, our algorithm is based on a combi-

nation of SMT solving and decision tree learning. Also, these ap-proaches aim to synthesize programs written in general program-ming languages or user-defined DSLs. In contrast, our approach isfocused on synthesis of tree transformations and utilizes particu-lar properties of such transformations. This tighter focus allows ushandle tree transformation benchmarks whose complexity exceedsbenchmarks considered in prior work. For example, our method isable to synthesize transformations that alter the hierarchical struc-ture of an input tree. So far as we know, such benchmarks fall out-side the scope of prior approaches to example-driven synthesis.

Example-guided synthesis has a long history in the artificial in-telligence community [19, 24, 25]. We build on the tradition ofinductive programming [18, 20, 22], where the goal is to synthe-size functional or logic programs from examples. Work here fallsin two categories: those that inductively generalize examples intofunctions or relations [20], and those that search for implementa-tions that fit the examples [17, 26]. These approaches are all basedon various combinations of enumeration and deduction. In contrast,our approach is based on SMT solving and decision tree learning.

Our subroutine for synthesizing path transformations has par-allels with a strategy followed in Gulwani’s approach to synthe-sizing string transformations [15]. Like our algorithm, Gulwani’stechnique also uses a combination of partitioning and unification.However, the algorithmic details are very different. For example,Gulwani’s approach maintains a DAG representation of all stringtransformations that fit a set of examples. In contrast, we use anumerical representation of the input-output examples and reduceunification to SMT solving.

Also related to this work are algorithms [9, 11, 12] for learningfinite-state tree transducers [30] from examples. Applications ofsuch algorithms include learning of XML queries and machinetranslation. Our approach is more general than these approaches,in that we handle trees whose nodes contain data of arbitrary typesand do not impose a priori limitations on changes to paths in aninput tree. However, this generality comes at a cost: our approachlacks the crisp complexity guarantees that some of these automata-theoretic algorithms possess [11].

Decision trees are a popular data structure in machine learningand data mining. They have also found applications in programanalysis, for example in precondition learning for procedures [31],identification of latent code errors [8], and repair of programs thatmanipulate relational databases [14]. So far as we know, we are thefirst to use decision tree learning in example-driven synthesis.

9. ConclusionWe have presented an algorithm for synthesizing hierarchical datatransformations from examples. The central idea of our approachis to reduce the generation of tree transformations to the synthesisof transformations on paths of the tree. The path transformationsare synthesized using a novel combination of decision tree learningand SMT solving. The reduction from tree to path transformationssimplifies the underlying synthesis algorithm, while also allowingus to handle a richer class of tree transformations, including thosethat change the data hierarchy.

In the longer run, approaches like ours can be embedded intoend-user programming tools such as Apple’s Automator [1], whichoffers visual abstractions for everyday scripting tasks. In Automa-tor, the visual notation is only an alternative syntax for determin-istic, imperative programs, and this limits the complexity of pro-grams that users can produce. HADES, which allows users to gen-erate complex prgrams from simple examples, is a plausible way ofbroadening the scope of such tools.

Page 13: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

References[1] Automator. http://automator.us.[2] Smartsheet. https://www.smartsheet.com.[3] Treesheets. http://strlen.com/treesheets.[4] XSL transformations 2.0. http://www.w3.org/TR/xslt20/.[5] A. Albarghouthi, S. Gulwani, and Z. Kincaid. Recursive program

synthesis. In CAV, pages 934–950, 2013.[6] D. W. Barowy, S. Gulwani, T. Hart, and B. G. Zorn. Flashrelate: ex-

tracting relational data from semi-structured spreadsheets using exam-ples. In PLDI, pages 218–228, 2015.

[7] T. A. Beyene, S. Chaudhuri, C. Popeea, and A. Rybalchenko. Aconstraint-based approach to solving games on infinite graphs. InPOPL, pages 221–234, 2014.

[8] Y. Brun and M. D. Ernst. Finding latent code errors via machinelearning over program executions. In ICSE, pages 480–490, 2004.

[9] J. Carme, R. Gilleron, A. Lemay, and J. Niehren. Interactive learningof node selecting tree transducer. Machine Learning, 66(1):33–67,2007.

[10] L. De Moura and N. Bjørner. Z3: An efficient smt solver. In Toolsand Algorithms for the Construction and Analysis of Systems, pages337–340. Springer, 2008.

[11] F. Drewes and J. Hogberg. Learning a regular tree language froma teacher. In Developments in Language Theory, pages 279–291.Springer, 2003.

[12] J. Eisner. Learning non-isomorphic tree mappings for machine trans-lation. In Proceedings of Meeting on Association for ComputationalLinguistics, pages 205–208, 2003.

[13] J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structuretransformations from input-output examples. In PLDI, pages 229–239, 2015.

[14] D. Gopinath, S. Khurshid, D. Saha, and S. Chandra. Data-guidedrepair of selection statements. In ICSE, pages 243–253. ACM, 2014.

[15] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, pages 317–330. ACM, 2011.

[16] W. R. Harris and S. Gulwani. Spreadsheet table transformations fromexamples. In PLDI, pages 317–328. ACM, 2011.

[17] S. Katayama. Efficient exhaustive generation of functional programsusing monte-carlo search with iterative deepening. In Pacific RimInternational Conference on Artificial Intelligence, pages 199–210,2008.

[18] E. Kitzelmann. Inductive programming: A survey of program synthe-sis techniques. In Approaches and Applications of Inductive Program-ming, Third International Workshop, pages 50–73, 2009.

[19] E. Kitzelmann. Analytical inductive functional programming. InLogic-Based Program Synthesis and Transformation, pages 87–102.Springer, 2009.

[20] E. Kitzelmann and U. Schmid. Inductive synthesis of functionalprograms: An explanation based generalization approach. Journal ofMachine Learning Research, 7:429–454, 2006.

[21] V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Complete functionalsynthesis. In PLDI, pages 316–329, 2010.

[22] N. Lavrac and S. Dzeroski. Inductive logic programming. In WLP,pages 146–160. Springer, 1994.

[23] V. Le and S. Gulwani. Flashextract: A framework for data extractionby examples. In PLDI, page 55, 2014.

[24] H. Lieberman. Your wish is my command: Programming by example.Morgan Kaufmann, 2001.

[25] A. Menon, O. Tamuz, S. Gulwani, B. Lampson, and A. Kalai. Amachine learning framework for programming by example. In ICML,pages 187–195, 2013.

[26] R. Olsson. Inductive functional programming using incremental pro-gram transformation. Artif. Intell., 74(1):55–8, 1995.

[27] P. Osera and S. Zdancewic. Type-and-example-directed program syn-thesis. In PLDI, pages 619–630, 2015.

[28] D. Perelman, S. Gulwani, D. Grossman, and P. Provost. Test-drivensynthesis. In PLDI, page 43, 2014.

[29] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

[30] W. C. Rounds. Mappings and grammars on trees. Theory of Comput-ing Systems, 4(3):257–287, 1970.

[31] S. Sankaranarayanan, S. Chaudhuri, F. Ivancic, and A. Gupta. Dy-namic inference of likely data preconditions over predicates by treelearning. In Proceedings of the 2008 international symposium on Soft-ware testing and analysis, pages 295–306. ACM, 2008.

[32] R. Singh and S. Gulwani. Synthesizing number transformations frominput-output examples. In Computer Aided Verification, pages 634–651. Springer, 2012.

[33] A. Solar-Lezama, L. Tancau, R. Bodık, S. A. Seshia, and V. A.Saraswat. Combinatorial sketching for finite programs. In ASPLOS,pages 404–415, 2006.

[34] S. Srivastava, S. Gulwani, and J. S. Foster. Template-based programverification and program synthesis. STTT, 15(5-6):497–518, 2013.

Page 14: Synthesizing Transformations on Hierarchically Structured DataMuse uprising.mp3 Adele skyfall.mp3 Original directory Desired directory l l l l Figure 1. Input-output directories for

AppendixProof of Theorem 1. Part 1,⇒. Suppose T ≡ T ′. We show thatpaths(T ) = paths(T ′). Since T ≡ T ′, we have height(T ) =height(T ′) = h. Let v = root(T ) and v′ = root(T ′). SinceT ≡ T ′, we have L(v) = L(v′) = ` and D(v) = D′(v′) = d.The proof proceeds using induction on h. For the base case, weconsider h = 1. In this case, V = {v}, V ′ = {v′}, E = E′ = ∅.Hence, paths(T ) = paths(T ′) = (`, d). For the inductive case,let h = k + 1 where k ≥ 1. Let C = children(v) and letC′ = children(v′). Since T ≡ T ′, L(C) = L′(C′), and since Tand T ′ are well-formed, there exists a one-to-one correspondencef : C → C′ such that: f(vc) = v′c iffL(vc) = L′(v′c). Usingthis fact and condition (3) of Definition 4, we know that, for everyvc ∈ children(v) = C, subtree(T, vc) ≡ subtree(T ′, f(vc)).Now, observe that:

paths(T ) =⋃vc∈C

{((`, d), pi) | pi ∈ paths(subtree(T, vc))}

and paths(T ′) is equal to:⋃f(vc)∈C′

{((`, d), p′i) | p′i ∈ paths(subtree(T ′, f(vc)))}

Using the inductive hypothesis, we have paths(T ) = paths(T ′).

Part 2,⇐. Suppose paths(T ) = paths(T ′), and let v = root(T )and v′ = root(T ′) and L(v) = `, L(v′) = `′, D(v) =d,D(v′) = d′. We show that T ≡ T ′. First, observe thatpaths(T ) = paths(T ′) implies height(T ) = height(T ′). Theproof proceeds by induction on h. When h = 1, paths(T ) ={(`, d)} and paths(T ′) = {(`′, d′)}. Since paths(T ) = paths(T ′),this implies ` = `′ and d = d′. Hence, T ≡ T ′. For the inductivecase, let h = k + 1 where k ≥ 1. Let C = children(v) and letC′ = children(v′). Now, we have:

paths(T ) =⋃vi∈C

{((`, d), pi) | pi ∈ paths(subtree(T, vi))}

and

paths(T ′) =⋃

v′i∈C′

{((`′, d′), p′i) | p′i ∈ paths(subtree(T ′, v′i))}

Since paths(T ) = paths(T ′), this implies ` = `′ and d =d′ and, since T, T ′ are well-formed, for each vi ∈ C, theremust be a one-to-one correspondence f : C → C′ such thatv′i = f(vi) iff paths(subtree(T, vi)) = paths(subtree(T ′, v′i)).Now, for any pair (vi, f(vi)), we have L(vi) = L′(f(vi)) be-cause paths(subtree(T, vi)) = paths(subtree(T ′, v′i)). Notethat this implies condition (2) of Definition 4. Furthermore, sincepaths(subtree(T, vi)) = paths(subtree(T ′, f(vi))), the induc-tive hypothesis implies subtree(T, vi) ≡ subtree(T, v′i). Hence,condition (3) of Definition 4 is also satisfied.

Proof of Theorem 2. Suppose TRANSFORM(T ) = T ∗. We willshow paths(T ∗) = paths(T ′), which implies T ∗ ≡ T ′ by Theo-rem 1. First, we show that, if p′ ∈ paths(T ′), then p′ ∈ paths(T ∗).By the unambiguity criterion, there exists a unique p ∈ paths(T )such that p ∼ p′. By the correctness requirement for path trans-former f , we know p′ ∈ f(p) and p′ 6= ⊥ since p′ ∈ paths(T ′).Hence, p′ ∈ S′ in Figure 7. Since T ∗ = SPLICE(S′) and SPLICEguarantees that paths(T ∗) = S′, we also have p′ ∈ paths(T ∗).Now, we show that, if p∗ ∈ paths(T ∗), then p∗ ∈ paths(T ′).Since p∗ ∈ paths(T ∗), there must exist a p ∈ paths(T ) such thatp∗ ∈ f(p). Since p ∈ inputs(E ′), correctness of f implies thereexists some (p, p∗) ∈ E ′. Now, suppose p∗ 6∈ paths(T ′). Since(p, p∗) ∈ E ′ but p∗ 6∈ paths(T ′), there are two possibilities: (i) Ei-ther p∗ = ⊥, or (ii) there is some other (T1, T2) ∈ E that results in

p∗ getting added to E ′. Now, (i) is not possible because paths(T ∗)cannot contain ⊥, and (ii) is not possible due to the unambiguityrequirement.

Proof of Theorem 3. Given a path p, let O(p) be the set:

{p′ | (T, T ′) ∈ E ∧ p ∈ paths(T ) ∧ p′ ∈ paths(T ) ∧ p ∼ p′}Now, construct function f as follows:

f(p) =

∅ if ∀T ∈ inputs(E). p 6∈ paths(T )

{⊥} if ∃(T, T ′) ∈ E ∧ p ∈ paths(T )∧¬∃p′.(p′ ∈ paths(T ′) ∧ p ∼ p′)

O(p) otherwise

First, we show (p ∈ inputs(E ′) ∧ p′ ∈ f(p)) ⇒ (p, p′) ∈ E ′.Since p′ ∈ f(p), we know that f(p) 6= ∅; hence ∃T ∈ inputs(E)such that p ∈ paths(T ). Now, if f(p) = {⊥}, we know that pdoes not have a corresponding output path; hence, by definition ofFURCATE, we know that (p,⊥) ∈ E ′. If f(p) 6= ∅ ∧ f(p) 6={⊥}, then by definition of O(p), we know that p′ must satisfyp ∈ paths(T ), p′ ∈ paths(T ′) ∧ p ∼ p′ for some (T, T ′). Again,by definition of FURCATE, we have (p, p′) ∈ E ′. The argument that(p ∈ inputs(E ′) ∧ (p, p′) ∈ E ′)⇒ p′ ∈ f(p) is very similar.

Proof of Theorem 4. Let Φ = {P1, . . . , Pk} be the set of par-titions inferred by INFERPATHTRANS at the end of Phase I (seeFigure 10), and let each Pi be the triple 〈Ei, χi, φi〉. Then the pathtransformer f synthesized by INFERPATHTRANS is:

λx. {φ1 → χ1 ⊕ . . .⊕ φk → χk}We first prove that ((p, p′) ∈ E)⇒ (p′ ∈ f(p)). First, observe

that, for any k, if PARTITION(∅, E , k) 6= ∅, we have:( ⋃Pi∈Φ

Ei)

= E

Hence, any (p, p′) ∈ E must belong to the examples of somepartition Pi. Furthermore, our classification algorithm guaranteesthat, for any p ∈ inputs(Ei), we have φi[p/x] ≡ true. Hence, weknow that χi ∈ f(p). Since UNIFY(Ei) = χi 6= null, we haveχi[p/x] = p′. This implies p′ ∈ f(p).

We now prove the other direction of the theorem, i.e.:

((p ∈ inputs(E) ∧ p′ ∈ f(p))⇒ (p, p′) ∈ E)

Suppose that p ∈ inputs(E) and p′ ∈ f(p). Since p′ ∈ f(p), theremust exist some Pi = 〈Ei, χi, φi〉 such that φi[p/x] ≡ true andχi[p/x] = p′. Recall that classification guarantees:

(a) ∀p ∈ inputs(Ei). (φi[p/x] ≡ true)(b) ∀p ∈ (inputs(E)− inputs(Ei)). (φi[p/x] ≡ false)

Since p ∈ inputs(E), this implies p ∈ inputs(Ei); otherwise wewould have φi[p/x] ≡ false. Hence, we must have (p, p′) ∈ Ei,which in turn implies (p, p′) ∈ E .