working with trees in the phyloinformatic age. wh piel
TRANSCRIPT
![Page 1: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/1.jpg)
Working with Trees in the Phyloinformatic Age
William H. Piel
Yale Peabody Museum
Hilmar Lapp
NESCent, Duke University
![Page 2: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/2.jpg)
Dealing with the Growth of Phyloinformatics
• Trees: Too Many– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees: Too Big– Visualizing and manipulating large trees
• Demo PhyloWidget
![Page 3: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/3.jpg)
Searching Stored Tree
• Path Enumerations
• Nested Sets
• Adjacency Lists
• Transitive Closure
![Page 4: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/4.jpg)
A B C D E
0.1
0.1.1 0.1.2
0.2
0.2.1
0.2.1.1 0.2.1.2 0.2.2
0
Dewey system:
![Page 5: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/5.jpg)
Label Path
Root 0
NULL 0.1
A 0.1.1
B 0.1.2
NULL 0.2
NULL 0.2.1
C 0.2.1.1
D 0.2.1.2
E 0.2.2
A B C D E
Find clade for: Z = (<CS+Ds)
Find common pattern starting from left
SELECT * FROM nodesWHERE (path LIKE “0.2.1%”);
![Page 6: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/6.jpg)
• ATreeGrep– Uses special suffix indexing to optimize speed
– Shasha, D., J. T. L. Wang, H. Shan and K. Zhang. 2002. ATreeGrep: Approximate Searching in Unordered Tree. Proceedings of the 14th SSDM, Edinburgh, Scotland, pp. 89-98.
• Crimson– Uses nested subtrees to avoid long strings
– Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim, and S. B. Davidson. 2006. Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms. 32nd International Conference on Very Large Data Bases, ACM, pp. 1231-1234.
![Page 7: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/7.jpg)
Searching Stored Tree
• Path Enumerations
• Nested Sets
• Adjacency Lists
• Metrics
• Transitive Closure
![Page 8: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/8.jpg)
A B C D E
2
3 5
8
9
10 12 15
1
4 6
7
17
11 13 16
18
14
Depth-first traversal scoring each node with a lef and right ID
![Page 9: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/9.jpg)
Label Left Right
1 18
2 7
A 3 4
B 5 6
8 17
9 14
C 10 11
D 12 13
E 15 16
A B C D E
2
3 5
8
9
10 12 15
1
4 6
7
17
11 13 16
18
14
SELECT * FROM nodesINNER JOIN nodes AS includeON (nodes.left_id BETWEEN include.left_id AND include.right_id)WHERE include.node_id = 5 ;
Minimum Spanning Clade of Node 5
![Page 10: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/10.jpg)
• PhyloFinder
– Duhong Chen et al.
– http://pilin.cs.iastate.edu/phylofinder/
• Mackey, A. 2002. Relational Modeling of Biological Data: Trees and Graphs. Bioinformatics Technology Conference. http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html
![Page 11: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/11.jpg)
Searching Stored Tree
• Path Enumerations
• Nested Sets
• Adjacency Lists
• Metrics
• Transitive Closure
![Page 12: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/12.jpg)
A B C D E
1
23 4
5
67 8 9
-
1
-
-
2
1
A
3
2
B
4
2 -
6
5-
5
1
C
7
6
E
9
5
D
8
6
![Page 13: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/13.jpg)
-
1
-
-
2
1
A
3
2
B
4
2
-
6
5
-
5
1
C
7
6
E
9
5
D
8
6
node_label:
node_id:
parent_id:
SQL Query to find parent node of node “D”:
SELECT *FROM nodes AS parent
INNER JOIN nodes AS childON (child.parent_id = parent.node_id)
WHERE child.node_label = ‘D’;
…but this requires an external procedure to navigate the tree.
![Page 14: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/14.jpg)
Searching Stored Tree
• Path Enumerations• Nested Sets• Adjacency Lists• Metrics• Transitive Closure
![Page 15: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/15.jpg)
Searching trees by distance metrics: USim distanceWang, J. T. L., H. Shan, D. Shasha and W. H. Piel. 2005. Fast Structural Search in
Phylogenetic Databases. Evolutionary Bioinformatics Online, 1: 37-46
A B C DA B C D
A B C D
A 0 1 2 3
B 1 0 2 3
C 1 1 0 2
D 1 1 1 0
A B C D
A 0 1 2 2
B 1 0 2 2
C 2 2 0 1
D 2 2 1 0
![Page 16: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/16.jpg)
Searching Stored Tree
• Path Enumerations
• Nested Sets
• Adjacency Lists
• Transitive Closure
![Page 17: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/17.jpg)
Transitive Closure• Finding paths between vertices on a graph• DB2 and Oracle have special functions:
– From EdgeStart With (child_id = A and tree_id = T)Connect By (Prior parent_id = child_id)And (Prior tree_id = tree_id)
• Nakhleh, L., D. Miranker, F. Barbancon, W. H. Piel, and M. Donoghue. 2003. Requirements of phylogenetic databases. Third IEEE Symposium on Bioinformatics and Bioengineering, p. 141-148.
• Paths can be precomputed and stored: BioSQL
![Page 18: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/18.jpg)
Dealing with the Growth of Phyloinformatics
• Trees Too Many– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big– Visualizing and manipulating large trees
• Demo PhyloWidget
![Page 19: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/19.jpg)
BioSQL: http://www.biosql.org/Schema for persistent storage of sequences and features tightly integrated with BioPerl (+ BioPython, BioJava, and BioRuby)• phylodb extension designed at NESCent Hackathon • perl command-line interface by Jamie Estill, GSoC
![Page 20: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/20.jpg)
A B
1
2
3 4
5
C
1
2
1
53
2
4
2
1
3
1
4
CREATE TABLE node_path ( child_node_id integer, parent_node_id integer, distance integer);
Index of all paths from ancestors to descendants
![Page 21: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/21.jpg)
A B
1
2
3 4
5
C
1
2
1
53
2
4
2
1
3
1
4
SELECT pA.parent_node_idFROM node_path pA, node_path pB, nodes nA, nodes nBWHERE pA.parent_node_id = pB.parent_node_idAND pA.child_node_id = nA.node_idAND nA.node_label = 'A'AND pB.child_node_id = nB.node_idAND nB.node_label = 'B';
Find all paths where A and B share a common parent_node_id
![Page 22: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/22.jpg)
A B
1
2
3 4
5
C
1
2
1
53
2
4
2
1
3
1
4
SELECT pA.parent_node_idFROM node_path pA, node_path pB, nodes nA, nodes nBWHERE pA.parent_node_id = pB.parent_node_idAND pA.child_node_id = nA.node_idAND nA.node_label = 'A'AND pB.child_node_id = nB.node_idAND nB.node_label = 'B'ORDER BY pA.distanceLIMIT 1;
…of those paths, select one that has the shortest path
![Page 23: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/23.jpg)
A B
1
2
3 4
5
C
1
2
1
53
2
4
2
1
3
1
4
SELECT pA.parent_node_idFROM node_path pA, node_path pB, nodes nA, nodes nBWHERE pA.parent_node_id = pB.parent_node_idAND pA.child_node_id = nA.node_idAND nA.node_label = 'A'AND pB.child_node_id = nB.node_idAND nB.node_label = 'B'ORDER BY pA.distance DESCLIMIT 1;
…of those paths, select one that has the longest path
![Page 24: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/24.jpg)
SELECT e.parent_id AS parent, e.child_id AS child, ch.node_label, pt.tree_idFROM node_path p, edges e, nodes pt, nodes chWHERE e.child_id = p.child_node_idAND pt.node_id = e.parent_idAND ch.node_id = e.child_idAND p.parent_node_id IN ( SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B')AND NOT EXISTS ( SELECT 1 FROM node_path np, nodes n WHERE np.child_node_id = n.node_id AND n.node_label = 'C' AND np.parent_node_id = p.parent_node_id);
Find the maximum spanning clade (i.e. the subtree) for each tree that includes A and B but not C:
Get all ancestors shared by A and B
Exclude thosethat are alsoancestors to C
Return an adjacency list for each subtree
![Page 25: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/25.jpg)
SELECT DISTINCT t.tree_id, t.nameFROM node_path p, nodes ch, trees tWHERE ch.node_id = p.child_node_idAND ch.tree_id = t.tree_idAND p.parent_node_id IN ( SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B')AND NOT EXISTS ( SELECT 1 FROM node_path np, nodes n WHERE np.child_node_id = n.node_id AND n.node_label = 'C' AND np.parent_node_id = p.parent_node_id);
Find trees that contain a clade that includes A and B but not C:
Get all ancestors shared by A and B
Exclude thosethat are alsoancestors to C
List the set of trees with these ancestors
![Page 26: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/26.jpg)
SELECT qry.tree_id, MIN(qry.name) AS "tree_name"FROM ( SELECT DISTINCT ON (n.node_id) n.node_id, t.tree_id, t.name FROM trees t, nodes n, (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id FROM nodes inN, node_path inP WHERE inN.node_label IN ('A','B','C') AND inP.child_node_id = inN.node_id GROUP BY inN.tree_id, inP.parent_node_id HAVING COUNT(inP.child_node_id) = 3 ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca, WHERE n.node_id IN (lca2.parent_node_id) AND t.tree_id = n.tree_id AND NOT EXISTS (SELECT 1 FROM nodes outN, node_path outP WHERE outN.node_label IN ('D','E') AND outP.child_node_id = outN.node_id AND outP.parent_node_id = lca.parent_node_id) AND EXISTS (SELECT c.tree_id FROM trees c, nodes q WHERE q.node_label IN ('D','E') AND q.tree_id = c.tree_id AND c.tree_id = t.tree_id GROUP BY c.tree_id HAVING COUNT(c.tree_id) = 2)) AS qryGROUP BY (qry.tree_id)HAVING COUNT(qry.node_id) = 1;
Find trees that contain a clade that includes (A, B, C) but not D or E:
Get all ancestorsof A, B, C from alltrees that have A, B, C
Exclude thosethat are alsoancestors to D, E
But make sure thatthe tree still contains D, E
Number of clades that each tree must satisfy
Number of ingroups that share node
Number of non-ingroups that must be in tree
![Page 27: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/27.jpg)
SELECT t.tree_id, t.name FROM trees t INNER JOIN (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id FROM nodes inN, node_path inP WHERE inN.node_label IN ('A','B','C') AND inP.child_node_id = inN.node_id GROUP BY inN.tree_id, inP.parent_node_id HAVING COUNT(inP.child_node_id) = 3 ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca USING (tree_id) WHERE NOT EXISTS ( SELECT 1 FROM nodes outN, node_path outP WHERE outN.node_label IN ('D','E') AND outP.child_node_id = outN.node_id AND outP.parent_node_id = lca.parent_node_id) AND EXISTS ( SELECT c.tree_id FROM trees c, nodes q WHERE q.node_label IN ('D','E') AND q.tree_id = c.tree_id AND c.tree_id = t.tree_id GROUP BY c.tree_id HAVING COUNT(c.tree_id) = 2);
Here's a faster, cleaner version:
![Page 28: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/28.jpg)
Matching a whole tree means querying for all clades
A B C D E
1
23 4
5
67 8 9
(A, B) but not C, D, E(C, D) but not A, B, E(C, D, E) but not A, B
![Page 29: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/29.jpg)
Dealing with the Growth of Phyloinformatics
• Trees Too Many– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big– Visualizing and manipulating large trees
• Demo PhyloWidget
![Page 30: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/30.jpg)
Sus scrofa
Hippopotamus
Balaenoptera
Equus caballus
Felis catus
Balaenoptera
Hippopotamus
Sus scrofa
Equus caballus
Felis catus
(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_caballus)vs
((Sus_scrofa, (Hippopotamus,Balaenoptera)),Equus_caballus)
Mining trees for interesting, general, relationship questions:
![Page 31: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/31.jpg)
Sus scrofa
Hippopotamus
Balaenoptera
Equus caballus
Felis catus
Sus celebensis
Hippopotamus
Balaenoptera
Equus asinus
Felis catus
Even if with perfectly-resolved OTUs, you will still fail to hit relevant trees:
![Page 32: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/32.jpg)
Step 1: for each clade all trees in database, run a stem query on a classification tree (e.g. NCBI)
A B C D E
1
23 4
5
67 8 9
Stem Queries:Node 2: (>A, B - C, D, E)Node 3: (>A - B, C, D, E)Node 4: (>B - A, C, D, E)Node 5: (>C, D, E - A, B)Node 6: (>C, D - A, B, E)Node 7: (>C - A, B, D, E)Node 8: (>D - A, B, C, E)Node 9: (>E - A, B, C, D)
Step 2: label each node with an NCBI taxon id (if there is a match)
Step 3: do the same for the query tree
![Page 33: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/33.jpg)
Gorilla gorilla
Homo sapiens
Pan troglodytes
Macaca sinica
Macaca nigra
Rename nodes according to their deepest stem query…
Hominoidea
Cercopithecoidea
Gorilla
Homo
Pan
Macaca sinica
Macaca nigra
Pongo pygmaeus
Macaca irus
Hominoidea
Cercopithecoidea
![Page 34: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/34.jpg)
Dealing with the Growth of Phyloinformatics
• Trees Too Many– Search, organize, triage, summarize, synthesize
• Review existing methods
• Describe queries for BioSQL phylo extension
• Making generic queries
• Trees Too Big– Visualizing and manipulating large trees
• Demo PhyloWidget
![Page 35: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/35.jpg)
PhyloWidget• Greg Jordan
– Google Summer of Code student– Nick Goldman's group, EBI
• Java Applet– Uses the Processing graphics library
• Originally as a graphical phylogenetic query and display tool for TreeBASE, BioSQL, etc
• Can be used for:– Manipulating, visualizing large trees– Building supertrees through pruning & grafting
![Page 36: Working with Trees in the Phyloinformatic Age. WH Piel](https://reader035.vdocument.in/reader035/viewer/2022070315/554e811fb4c9054a698b549f/html5/thumbnails/36.jpg)
Thanks