the prometheus taxonomic database cédric raguenaud, jessie kennedy, peter barclay napier...
TRANSCRIPT
The Prometheus Taxonomic Database
Cédric Raguenaud, Jessie Kennedy, Peter BarclayNapier University, Edinburgh
http://www.dcs.napier.ac.uk/~prometheus
Contents
What is taxonomy? What are the features of taxonomic data/processes Which database? The Prometheus approach Schema example Particularities of the model Example queries Summary & Conclusions
What is plant taxonomy?
(vi)
(i)
family
genus
(iii) family
genus
tribe
(iv)
species
genus
tribe
(v)genus
variety
species
(ii)family
genus
Red squares Yellow round shapes
?
Red squares Yellow round shapesPurple diamond shapesYellow round shapes!
Plant Taxonomy Data
The data is hierarchical Multiple overlapping hierarchies co-exist
distinct hierarchies need identified - manipulation and extraction explicit relationships (=> graphs) querying is recursive & dependent on the context of the relationships
Nodes in the hierarchy are aggregate objects also have association to other objects outside the hierarchy
differentiate between association and aggregation in relationships extraction of composite objects required
Levels of the hierarchy bear information Ranks biologically significant (e.g. “genus” vs “species”)
Domain specific rules are important data is derived based on domain specific rules
definition of constraints necessary for defining rules positioning of objects in a hierarchy dependent on domain
specific constraints (e.g. family names must end with -eceae)
Which Database? Existing Taxonomic Databases are inadequate due to:
simplicity of model of taxonomy support single classifications only
limitations of underlying database: Relational model
limited semantics, no explicit relationships, no recursive querying Graph models
limited semantics, often no constraints Semi-structured data
limited semantics, no a priori schema Object-Oriented models
limited support for relationships, no recursive querying
Need OODB with relationships + Graph functionality OODBs with relationships already exist (e.g. OMS, Albano’s, GraphDB)
limited (e.g. no QL, no semantics for relationships, or no constraints) or based on uncommon models (e.g. collection based model of
Albano)
Prometheus Approach
Prometheus Model ODMG model extended with relationships as first class
constructs Association & Aggregation
cardinality, traversibility, sharability, dependency … Reduces gap between design and implementation Attributes on relationships used to distinguish classifications
POOL OQL + operators for manipulating relationships and graphs
query relationship objects define query on aggregation relationships only specify a particular path to be followed through a hierarchy specify the transitive closure of a relationship return a hierarchy as a structure
Prometheus prototype implemented using POET (ODMG OODB) and Java
Simple Taxonomic Schema
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
Circumscription
theCircumscription
0..n
0..n
theCircAuthor
theCircPublication
Date
theDate
AuthorAbbreviationtheAbbreviation
ReferenceDatabasetheReference
theRef
LinkToDet
0..n 0..n
theAuthor
theDate
authors
0..n
PublicationAbbreviationtheAbbreviation
theRef
nextRank
previousRank
collector0..n
theAuthorAbbreviations0..n
0..n
thePublicationAbbreviation
0..n
AuthorgivenNames
surnameDOBDOD
EpithettheName
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
LinkToType
Typedefinition
0..n
theEpithet
Placement
0..n0..1
PublicationthePublication
thePageRank
theBinomialtheName
theAuthors
0..n
thePublication
theRank
Relationships in the DB
The semantics of relationships (e.g. composition) can vary: Prometheus implements all these semantics by providing a set of
behaviours, constraints, and flags that can be combined e.g. When a classification is published, it is unchangeable (even if it
includes mistakes) the theCircumscription relationship implements the “not changeable”
behaviour
Directionality of relationships is important for propagation of operations (e.g. deletion of a composition) as groups at any level contain groups at lower levels
a family contains several genera each of which contain several species
Attributes of relationships are important classifying is independent from the objects classified
relationships build the classification attributes of relationships differentiate classifications
the system is a generic classification system
Downcast operator select the Names whose type is called graveolens.
select n from Name n where n.LinkToType[Name].theEpithet.theName = “graveolens”
the type of the object targeted by the destination attribute of the TaxonomicType relationship should be Name, and not TypeDefinition as shown in the model.
All objects which are not of type Name are discarded with no error reported.
Example Queries - 1
Querying relationships Select the Names whose rank is Genus.
select n from Name n where n.theRank.destination.theName = “Genus”
theRank is a relationship class. n is considered the origin of theRank in the query and the
relationship should be followed only from source to destination i.e. no reverse traversing of the relationship.
0..n
RanktheBinomial
theName
theRank
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
LinkToType
Typedefinition
0..n
EpithettheName
theEpithet
Example Queries - 2
Aggregate operator Select the Names whose circumscription contains the specimen
whose name is “X”select shallow aggregate n from Name n where n.theCircumscription[Specimen].barCode = “X”
extracts the Name objects that satisfy the criterion, then finds for each Name object all objects aggregated to form the concept of Name .
Transitive Closure Select the Names or whose subordinate Names contain the
specimen whose name is “X” select n from Name n where n.theCircumscription[Name]*.theCircumscription.destination[Specimen].barCode = “X”
we use a relationship class as a simple regular expression follow 0 or more theCircumscription relationships to find the
Name objects containing the specimen called “X”. “*” - the repetition of a path between 0 and n times, “?” - an
optional path, “+” - the repetition of a path strictly once or more
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
0..n
AuthorgivenNames
surnameDOBDOD
EpithettheName
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
LinkToType
Typedefinition
0..n
theEpithet
Placement
0..n0..1
PublicationthePublication
thePageRank
theBinomialtheName
theAuthors
0..n
thePublication
theRank
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
Circumscription
theCircumscription
0..n
0..n
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
Example Queries - 3
Follow operator select the names hierarchy
select n, n.theCircumscription from Name n follow theCircumscription
the query engine would know that Name objects in the resulting set must be related by a theCircumscription relationship object.
a hierarchy is a directed connected graph. Therefore, the answer to such a query is a set of connected graphs.
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
Circumscription
theCircumscription
0..n
0..n
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
XLINK Select the names that have specimen “X” in their circumscription
select n from Name n where n.theCircumscription[Name]*.theCircumscription[Specimen].barCode = “X” xlink
finds Name objects that are related to a Specimen whose name is “ X” via one or more theCircumscription relationships in a single hierarchy.
Without xlink, any path relating a Name to a Specimen would be followed and hierarchies mixed up.
Example Queries - 4
Integrity of graphs in path expressions select the names containing specimen X in the
circumscription where the classification was published in Yselect n from Name n, theCircumscription c where n.c[Name]*.c[Specimen].barCode = X” xlink where c.theCircPublication.thePublication = “Y”
finds all Name objects containing the specimen in their circumscription at any depth
but only according to one publication that is declared in the xlink clause.
NameTheValidity
calculatedFullNameNoAuthorcalculatedFullName
Circumscription
theCircumscription
0..n
0..n
SpecimenbarCode
herbariumcollectionNumber
latitudelongitude
Note
AuthorgivenNames
surnameDOBDOD
PublicationthePublication
thePage
theCircAuthor
theCircPublication
Summary & Conclusions
New model (schema) of plant taxonomy defined extensive use of relationships
Plant taxonomy DBMS implemented using Prometheus DB final stages of testing by taxonomists stores all examples of data provided can answer all queries posed demo via http interface available Soon available for download
Conclusion Explicit relationships in DB provide ways to improve
modelling power & mapping between model and implementation support for graph structures
QL support necessary to profit from relationships increased power of ad hoc querying without being domain specific
Acknowledgements
Collaborators Dr Mark Watson, Dr Martin Pullan, Dr Mark Newman
Royal Botanic Garden, Edinburgh
Funding UK Engineering and Physical Sciences Research Council and
Biological and Biotechnology Research Council - Bioinformatics Initiative
Project page: http://www.dcs.napier.ac.uk/~prometheus
Demo: http://146.176.18.75:8080