jcr or rdbms - semantic scholar · jcr or rdbms why, when, how? bertil chapuis 12/31/2008 creative...
TRANSCRIPT
JCR or RDBMS why, when, how?
Bertil Chapuis
12/31/2008
Creative Commons Attribution 2.5 Switzerland License
This paper compares java content repositories (JCR) and relational database management systems (RDBMS).
The choice between these technologies is often made arbitrarily. The aim is to clarify why this choice should be
discussed, when one technology should be selected instead of an other and how the selected technology should
be used. Four levels (Data model, Specification, Project, Product) are analyzed to show the impact of this choice
on different scopes. Follow a discussion on the best choice depending on the context. This defines the
foundations of a decision framework.
2 Table of content
Table of Contents
1 Introduction .................................................. 3
1.1 What is compared? ................................ 3
1.2 Why is it comparable? ........................... 3
1.3 What is the purpose of this comparison?3
1.4 How will it be compared? ....................... 4
2 State of the arts ............................................ 5
2.1 Roles ...................................................... 5
2.2 Domains of responsibility ....................... 5
2.3 Data Models ........................................... 6
3 Data model comparison .............................. 8
3.1 Model Definitions ................................... 8
3.2 Structure ................................................ 9
3.3 Integrity ................................................ 12
3.4 Operations and queries ....................... 14
3.5 Navigation ............................................ 16
3.6 Synthesis ............................................. 17
4 Specification comparison ......................... 18
4.1 Use Case Definition ............................. 18
4.2 Structure .............................................. 18
4.3 Integrity ................................................ 20
4.4 Operations and queries ....................... 22
4.5 Navigation ............................................ 24
4.6 Transactions ........................................ 25
4.7 Inheritance ........................................... 26
4.8 Access Control ..................................... 27
4.9 Events .................................................. 28
4.10 Version control ..................................... 29
4.11 Synthesis .............................................. 30
5 Development process comparison .......... 31
5.1 Data Understandability ......................... 31
5.2 Coding Efficiency ................................. 32
5.3 Application Changeability ..................... 33
5.4 Synthesis .............................................. 34
6 Product comparison .................................. 35
6.1 Theoretical analysis ............................. 35
6.2 Benchmark ........................................... 36
6.3 Synthesis .............................................. 38
7 Scenario Analysis ...................................... 39
7.1 Survey .................................................. 40
7.2 Reservation .......................................... 40
7.3 Content management ........................... 41
7.4 Workflow............................................... 41
8 Conclusion .................................................. 42
9 Appendix – JCR and design...................... 43
9.1 Model .................................................... 43
9.2 Convention ........................................... 43
9.3 Methodology ......................................... 44
9.4 Application ............................................ 45
10 Appendix – Going further .......................... 47
10.1 Queries in semi-structured models ...... 47
10.2 Queries on transitive relationships ....... 47
10.3 Modular and configurable databases ... 47
11 Bibliography ............................................... 49
University of Lausanne & Day Software AG JCR or RDBMS
3
1 Introduction
Day Software AG (Day) led the development of a
JAVA specification which defines a uniform
application programming interface (API) to manage
content. This specification is called content repository
API for java (JCR) and is part of the java community
process. Implementations of this specification are
actually provided by well known companies such as
Oracle, Day or Alfresco.
JCR implementations are often used to build high
level content management systems and collaborative
applications. Day also provides an open source
implementation of the specification which is called
Jackrabbit and which is used as a shell for some of
its products.
This diploma thesis takes place in this context. Day
wants to clarify some points which relate to the data
model promoted by their specification. The basic idea
is to compare their approach to managing content
with the approach promoted by competitors at
different levels. The following sections will clarify the
approach adopted to do this and give an overview of
the content developed in this report.
1.1 What is compared?
As explained, the purpose is to locate JCR in the
database world. This work will be done by comparing
the relational model and the model promoted by
JCR. The relational model defined by Codd in the
70’s is actually the most widely used data model. The
unstructured or semi-structured model subtended by
the JCR specification encounter a growing success
in the content management area. These two models
will be described and analyzed in this report.
1.2 Why is it comparable?
Each data model supports a philosophy, to structure
and access data. On the one hand, the success of
the relational model comes in large part from the
facilities which are offered to describe clear data
structures. On the other hand, the success of the
JCR specification relates essentially to the facilities
which are offered to express flexible data structures.
These aspects show us that the discussion takes
place at the same level. Thus, it makes sense to
compare them, and to clarify their respective
possibilities and limits. It also makes sense to give a
clear picture of their respective philosophies which
are promoted and used by each of the models.
1.3 What is the purpose of this
comparison?
By making this comparison, Day wants to more
precisely position the data model, the specification
and the products which relate to JCR. Doing this
should help people to understand better the main
offers available on the market and show when it
make sense to use them.
More precisely, with an external perspective, the goal
is to define and give clear advice, which can help
people to choose the approach which will best fit in
with their needs. Some people are asking if their
applications should be implemented with a relational
database or a java content repository. Thus,
clarifying the philosophies promoted by each model
could help in making good decisions and
understanding the impact of a choice made at a data
model level.
4 Introduction
With an internal perspective, some questions relate
on how a java content repository should be
implemented. Some companies are doing that over
relational databases and some others are providing
native implementations of the model. Should JCR be
seen as a data model or as an abstraction layer over
an existing data model. Answering this kind of
question can have a strong effect on the future
implementation of the products and also on the best
way promote them.
1.4 How will it be compared?
First of all, the chapter “State of the art” will try to
give a snapshot of the main data models which have
been described and used during the four last
decades. This will be done with the purpose of
identifying the main influences which have lead to the
current market environment. The goal is also to
understand why some data models have
encountered success and why others have not.
Then the comparison between the relational
approach and the JCR approach will start. Because
the two approaches show big differences on four
different levels, these are the ones we will examine
and compare, thus avoiding unnecessary discussion
regarding incomparable aspects.
The chapter “data model comparison” will be the first
level of comparison. In this chapter, the two models
will be formally defined, respectively; the relational
model and the model used by JCR. This should help
the reader to understand the theoretical concepts
hidden by each model. The purpose of this chapter is
also to show the impact of these theoretical aspects
on real world problems and help people to
understand more clearly why they should use one
approach instead the other to solve their problem.
The chapter “specification comparison” will be the
second level of comparison. This chapter will leave
the theoretical point of view for a more practical
perspective. The SQL standard and the JCR
specification will be compared more precisely in this
chapter. This will allow us to show practically in
which context the concepts described in the ―data
model comparison‖ make sense. Some differences
which relate more to the specification definition will
also be pointed out.
The chapter ―Project process comparison‖ will be the
highest level undertaken in this report. On the basis
of the previous chapters, a discussion will take place
on different aspects and notable advantages which
can significantly influence the development process
will be looked into. This discussion will try to clarify
parameters as the efficiency reached with each
approach.
The chapter ―product comparison‖ will discuss the
impact of data models on the products. The
performance question constantly occurs at a product
level. This chapter will try to address this question
with a theoretical cost analysis and a practical
benchmark.
The ―Scenario analysis‖ chapter can be seen as a
synthesis of the main aspects pointed during the
whole comparison process. Four test cases
characterized by different features will be analyzed in
regard of the significant aspects presented in this
report. The purpose is to set the foundations of a
framework which helps in choosing the best
approach by doing quick requirement analysis.
Appendices are also included in this document. They
contain aspects which are not directly linked to the
comparison but which are interesting for the person
who would like to study the subject further.
University of Lausanne & Day Software AG JCR or RDBMS
5
2 State of the arts
The necessity of splitting information from
applications became clear in the 60’s when many
applications had to access the same set of
information. This segregation has given birth to new
concepts and new roles which relate to the activity of
managing information. This chapter will clarify the
main roles and the main domains of responsibility
linked to information management. Some of the
main approaches which are used to handle
information will also be presented. Basically, the
idea is to build a common language for the following
chapters.
2.1 Roles
People are generally involved in information systems
and data management. Three main roles can almost
always be distinguished when data models and
databases are mentioned:
The database administrator (DBA) who
maintains the database in a usable state.
The application programmer who writes
applications which may access databases.
The user who uses applications to access,
edit, and write data in the database.
Each role generally relates to certain responsibilities.
Several domains of responsibilities come from
disciplines such as the design, the development or
the security. Domain examples could be the
structure, the integrity, the availability or the
confidentiality of data. Choosing a data model
impacts these different roles by attributing them
more or less responsibility.
2.2 Domains of responsibility
The Figure 2.2-1 shows four main domains of
responsibility which will be mentioned regularly in
this report. This role/responsibility diagram tries to
translate the classical repartition which is generally
made when relational databases or similar
approaches are used to manage data.
Figure 2.2-1: Classical responsibility repartition
The WordNet semantic lexicon gives the following
definitions to the concepts identified as domain of
responsibilities in Figure 2.2-1:
Content: everything that is included in a
collection and that is held or included in
something
Structure: the manner of construction of
something and the arrangement of its parts
Integrity: an undivided or unbroken
completeness
Coherence: logical and orderly and
consistent relation of parts
Content and structure are relatively clear concepts.
However, in the context of this report, it makes
sense to be precise as to which meaning is given to
the integrity and to the coherence. Integrity here
relates to the ―state of completeness‖ of data which
always has to be ensured in the database. This state
is preserved with integrity rules at a database level.
Coherence relates to the logical organization of data
and quality thereof. Coherence can be ensured with
6 State of the arts
constraints at a database level but also
programmatically at an application level. For several
reasons incoherence can be tolerated during a time
in the database. This is not the case for integrity.
Choosing a data model has an impact on the
responsibility repartition in different ways. This report
will try to detail this impact and show the
consequences of these kinds of choices on the
different roles.
2.3 Data Models
A data model should be seen as a way to logically
organize, link and access content. Since the 1960’s,
some data models have appeared and disappeared
for several reasons; this section will give us a brief
overview of the history of the main data models. It
will also give an overview of their respective reasons
for success.
Hierarchical Model
In a hierarchical model (1), data is organized in tree
structures. Each record has one and only one parent
and can have zero or more children. A pure
hierarchical model allows only this kind of attribute
relationship. If an entry makes an appearance in
several parts of the tree, this latter is simply
replicated. A directed graph without cycles as
depicted in Figure 2.3-1 gives probably the best
representation of how entries are organized in this
model.
Figure 2.3-1: Tree graph
In general two types of entries are distinguished, the
root record type and the dependent record type.
The first type characterizes a record from the level
zero of the hierarchy which has no parental
relationship. The second type characterizes all the
records which are located under the root record.
They are dependants in the sense that their lifetime
will never be longer than the lifetime of their parent.
In the hierarchical model, each record can generally
store an arbitrary number of fields which allow for
storing data.
While some real problems have a tree like structure,
the assumption made that only this kind of attribute
relationship governs the world is too strong. During
its history, the hierarchical model has probably
suffered from this. Some people have probably
abandoned it for models which seem to fit better with
the reality.
The main implementation of the hierarchical model
was in the 60’s by IBM. This database is called IMS
which stands for Information Management System.
Today, IMS is still used in the industry for very large
scale applications. IBM sold it as a solution for
critical online applications. In fact, IBM continues to
invest in this product and to develop new releases.
Most directory services are using concepts inherited
from the hierarchical model. Moreover,
reminiscences of the model are also visible on every
system. Everybody use hierarchical concepts to
organize files and folders. So every computer user is
more or less familiar with the hierarchical
organization of information.
Furthermore, during the last decade, the hierarchical
model has found a new popularity with the
increasing use of micro format as XML or YAML. In
a web browser, the Document Object Model (DOM)
also uses a hierarchy of objects to organize the
elements of a web page. Thus, this model is not in
jeopardy of disappearing. It will probably continue to
encounter further success in the future as well.
Network Model
Instead of limiting the organization of data around a
tree structure, the network model allows to link
entries between themselves in any direction. A
directed graph, as shown in Figure 2.3-2, is probably
the best representation which could be given to
show how data is structured in a network model. The
University of Lausanne & Day Software AG JCR or RDBMS
7
other properties of this model are shared with the
hierarchical model. Thus we can say that the
hierarchical model is a subset of the network model.
Figure 2.3-2: Network graph
Initially developed during the 70’s to bypass the lack
of flexibility of the hierarchical model, the network
model has encountered a lot of success during this
decade. This model has found hundreds
applications in different fields of computer science
such as the management of in memory objects or
bioinformatics applications. However, it seems that
actually not a lot of people are using it to organize
their data. However it still has notoriety in embedded
applications, whilst large scale applications built on it
are slowly disappearing.
Relational Model
Before its definition by Codd during the 70’s, the
relational model (2) had not encountered a lot of
success. However, after this formal work based on
the set theory and the first order logic, some
companies chose to make implementations of this
model. IBM was one of the first companies which
took the lead in the market with the DB2 database.
Oracle is now the uncontested leader with its
implementations of the relational model.
The relational model defines the concepts of
relations, domain, tuples and attributes which are
more often called tables, columns, rows and fields.
Interestingly, today this model is so widely taught
and used that the question of its pertinence to solve
specific cases is rarely questioned.
Figure 2.3-3: Relation, domain, tuple and attribute
Some people link the success of the relational model
to its mathematical foundation. However the
implementations actually used are a far cry, from the
beautiful concepts defined at the beginning. The
main building blocks are now hidden by features
which are provided to address practical
requirements.
Thus, the success of this data model should be
linked to the practical answers which have been
given to solve problems encountered in the business
world during the 80’s and the 90’s (3). The
normalization principle was used to earn storage
capacity. Furthermore, during this stage, information
systems had been widely used for automation and
monitoring tasks. The relational model has offered a
very good canvas to express and solve problems
such as these.
8 Data model comparison
3 Data model comparison
This chapter will define more clearly the JCR model
and the relational model. Several aspects which
relates to the model’s foundations will be presented
and compared. The main purpose of this section is to
understand the philosophy or basis of each model.
The ―Model definition‖ section briefly presents the
main ideas subtended by the models. The ―Structure‖
and ―Integrity‖ sections will mainly discuss the
aspects which relates to the place respectively of the
content, the structure and the semantic in both data
models. The ―Operations and queries‖ and
―Navigation‖ sections will show different ways used to
retrieve and edit content. Throughout the whole
chapter, an important place will be given to the
impacts of the choice made in terms of the data
model and the reasons which should drive this
choice.
3.1 Model Definitions
Some works and references give definitions to the
different data models actually used (4) (5). Some
tools are also available to understand the main
concepts of these models. The purpose of this
section is not to enrich these definitions but they are
included simply to draw attention to some theoretical
aspects required in order to build a common
language for the comparison.
JCR Model
To organize records, this model includes concepts
inherited from the hierarchical and from the network
model. Thus, as shown in the Figure 3.1-1, records
stored with the JCR data model are primarily
organized in a tree structure. However, the limitations
of the hierarchical model are avoided by giving the
ability to link each record horizontally. Attributes
which point on other nodes can be stored at each
level to create network relationships. This type of
model permits the creation of a network in a sort of
tree structure.
Figure 3.1-1: JCR graph
Currently, some explanation of the schema which
relates to the data model definition can be founded in
the specification (4) (5). The Figure 3.1-2, based on
this information, attempts to express more formally
the JCR data model. It’s interesting to note that at
this stage, no differentiation between the content and
the structure can be made. In fact the structure
appears with the instantiation of items.
Figure 3.1-2: JCR class diagram
University of Lausanne & Day Software AG JCR or RDBMS
9
Relational Model
The relational model which was quickly introduced in
the ―state of the art‖ chapter is based on the set
theory. A relation as defined by Codd (2) made
reference to the mathematical concept of relation. In
his paper, he gives the following definition to a
relation:
R is a subset of the Catesian product S1 x S2 x … x Sn
Practically, because all these sets have to be
distinguished from the others they are identified as
domains. Thus, assuming the domains of first-names
F, of last-names L and of ages A, a Person relation is
a set of tuples (f, l, a) where f Є F, l Є L and a Є A.
The Figure 3.1-3represents a table view of this
relation. In this representation, each domain
corresponds to a column and each tuple to a row.
Figure 3.1-3: Relation, domain, tuple and attribute
This basic definition does not mention the ability to
create associations between relations. In fact there is
no link between the name of the model and
associations. The ability to express associations
comes later with the joint operations defined by
relational algebra. These operations will be
introduced later in the next sections.
The Figure 3.1-4 show a class diagram which could
be used to express relations. While the pertinence of
this kind of diagram can be discussed the purpose is
to give a simple and visual base of a relation.
Furthermore, parts derived from this diagram will be
reused later to express the intersections between the
relational model and the JCR model.
Figure 3.1-4: Relational class diagram
3.2 Structure
A rich debate around the respective places of data
and structure in data models has been ongoing for
several years both on the web (6) and in academic
fields (3). This debate could be summarized as
following: Should data be driven by the structure or
should the structure be driven by data?
These discussions come from the fact that some
concepts do not really fit into a predefined canvas. A
predefined canvas can covers a lot of advantages
and facilities. For example, it’s easier to express
integrity constraints on a well known structure.
Equally indexation or query optimization (7) can also
benefit from the assumption that a clear structure can
always be found to a problem. However, in real life
situations, there is always an exception which does
not conform to the canvas.
The following sections will situate two models which
apply to this context. Both approaches will be
presented with the data and the structure shown
respectively in each case. Clarification of when each
strategy could be considered logical or illogical will
also be identified.
JCR model
In Figure 3.1-2, a class diagram shows the main
aspects of the JCR data model. In this figure, the
instantiations of nodes, properties and values leads
to the creation of content. If we try to identify the
10 Data model comparison
structure’s place in this diagram, it appears that no
real differentiation is made between the content itself
and its structure.
Thus, the model proposed by JCR does not require
the definition of a structure to instantiate content.
Instances of nodes, properties and values can be
created before defining any kind of structure. In fact,
the structure appears with the content.
A parallelism can be made between this approach
and the semi structured approach described during
the end of the 90’s (8). No separation was made
between data and their structure. This provides two
possible advantages, firstly a dynamic schema, to
store data which does not fit into a predefined canvas
or secondly to be able to browse the content without
knowing its structure.
Some modern programming languages such as
Ruby or Python also give the ability to extend objects
on the fly with properties and functions (reflection).
While a part of the structure appears at runtime, it is
possible to define a semantic which identifies the
main concepts. In JCR this is done with node-types.
Basically, defining a semantic does not limit the
capacity of a node to store an infinite combination of
sub-nodes and properties. To proceed in this manner
allows for the creation or evolution of records when
and as required.
For example if we want to define a semantic item for
media, there is no real need to take into account all
the possible properties which could appears during
the application life cycle under this node. Each
special case of media items, such as images, videos,
etc. can have specific attributes which are not
impacting the whole set of media instances and
which do not necessarily have to be specified at
conception.
Relational model
Figure 3.1-4 represents a basic class diagram
describing succinctly the main ideas proposed by the
relational model. We see in this diagram that the
concept of record which is represented by the
Element class is separated from the structure.
Remark that the paradigm is completely different in
the relational model to the one proposed by JCR. A
structure made of relations and domains has to be
instantiated. Then, tuples which fit into this structure
can be created. While the DBA can choose the level
of flexibility in the initial structure, it appears that this
kind of model differentiates between the data and its
schema.
Differentiating the structure from the data can reap
some benefits. For example this would be
appropriate for a problem solving approach rather
than a data storage approach. This is evident as
many developers will create an entity relationship
model during the early phases of defining data
requirement.
However in real life situations the assumption that
content and structure can be completely separated is
not always valid. For example to handle expansion
in the relational database some artificial artifacts or
miscellaneous fields are often created to allow for
this expansion in the relational structure. These can
take the form of fields added to create hierarchies or
fields added to define customized orders in a set of
tuples. These conceptual entities can become
difficult to describe within the confines of the
structure. As the application evolves and new
requirements are added the management of the
additions can become difficult and dangerous. A
change could even imply a rethink of the whole
structure of the implementation.
Content, structure and responsibility
As shown in the state of the art chapter, in classical
situations, the DBA is generally responsible for the
data structure. The application programmer can
influence decisions made in this area but he does not
have the final responsibility for the structure. Finally,
the user has clearly nothing to say, his scope is
limited by the functionalities developed by the
application programmer to create, remove and
update data.
As shown in Figure 3.2-1, choosing a content driven
approach instead a structure driven approach
significantly impacts the respective roles of the DBA,
the application programmer and the user. In fact the
DBA loses his responsibility of main structure owner.
If the structure is driven by data, this ownership is
shared with the application programmer and the
user.
University of Lausanne & Day Software AG JCR or RDBMS
11
Figure 3.2-1: Responsibility repartition revisited
It is true that a clear separation between the content
and the structure makes some aspects of data
management easier. Splitting clearly the structure
and the content makes it easier to define roles and to
separate the duties. The DBA has the ownership of
the database and of all the structures which allow to
instantiate records. In this context, the application
programmer becomes a kind of super user with
extended rights but the user may only access what is
available in the application.
This kind of scenario does give a lot of responsibility
to the DBA and places him at the centre of database
evolution. Unfortunately he is not necessarily tuned
in to the real needs of the users. It would therefore
be advisable that the DBA be responsible more for
aspects of data integrity, the availability or the
recoverability of data and not for the structure or the
content. In general these should be left under joint
definition to the application programmer and the
user.
Choosing the right approach
In a real working environment, some problems
benefits from being driven by a structure whereas
others clearly do not fit into any predefined
structures. A simple analogy may help to explain this
complicated situation. For example houses are rarely
built from scratch without blueprints. However, if we
take the scope of cities, there are generally no
blueprints which plan their final states. So which
lessons can we learn from this simple example? Are
complex problems driven by data instead by
structure? Not necessarily.
In the example of the house and of the city the
problem could be seen as following. For houses,
because budgets and resources available are
generally known in advance, the most effective way
to proceed is to define a structure before the
construction. For cities, because resources and
budgets available are generally not known in
advance and are evolving, the most effective way to
proceed is to let their structure emerge. If necessary,
guidelines can be defined to control their growth.
Since information system problems involve a wide
and growing community of stakeholders and
providers cannot know what will be done with their
applications, these kind of questions should be
debated at the onset of the design:
Are the users known or not?
Is the behavior of the users known or not?
Is the final usage of the application known or
not?
Are entities fitting in a canvas or not?
The response to these questions is probably one of
the best indicators when deciding upon one of the
two approaches.
The JCR model advocates clearly for a structure
driven by data. By creating content, items, nodes and
properties, users are building the structure. Database
administrators and application programmers are just
guiding this structure by defining rules and
constraints. In model implementations made with a
relational approach, a structure is first defined by the
database administrator and the application
programmer. Then the users can register content
items which fit to this structure.
Depending on the case in use each data model could
be useful. It rests basically through which perspective
we wish to view the data a fixed structure or a more
flexible data driven model. The choice of model will
be based on the certitude or incertitude of the
responses to the few decisive questions as
stipulated.
12 Data model comparison
3.3 Integrity
A strong association between structure and data
integrity is often made. Thus some people are afraid
of letting their users taking part in the definition of the
structure. However, it’s more correct to say that data
integrity belongs to semantic.
Generally, integrity definitions do not make any
mention of the structure. A structure made of
relations and domains is evidently an elegant way to
express a semantic. It’s also a good basis in which to
declare integrity constraints. Nonetheless integrity
constraints can be defined at a lower level, directly
over a semantic. Advantages could be for example
that all the structures which respect the semantic
constraints can be instantiated in the database and
not only the records which fit into the structure.
Furthermore, as mentioned in the ―state of the art‖
chapter, integrity definitions generally do not make
mention of coherency. In the database environment,
an amalgam is often made between these two
concepts. While data coherency can be preserved by
integrity constraints, the integrity of a dataset is not
necessary lost if incoherent records are present in
the database.
Unquestionably data integrity means that no
accidental or intentional destruction, alteration, or
loss of data should ever occur. While data integrity
should be ensured at all times during a database’s
lifecycle the assumption that data coherency should
have the same property is probably too strong.
Some people have the habit of treating directly in the
database both aspects, everything which relates to
data coherency along with integrity constraints. This
ensures that the data coherency is preserved in all
the cases. However, this also has a cost in term of
performances and checks which have to be
performed each time a write access is made on the
database. Therefore a tradeoff has to be made
between data integrity and data coherency.
A balanced approach which can result in a better
user experience consists in identifying, sometimes
arbitrarily, what relates to integrity and what relates
to coherency. Data Integrity will be treated with
constraints at a database level. Data coherency will
be treated programmatically at an application level in
a way which alleviates the work load of the system.
JCR Model
An analogy can be made between the JCR model
and a black list. The most generic node sustains any
kind of children, any kind of properties and any kind
of values. A mechanism is provided through the
concept of node-type to let the DBA defining integrity
constraints.
In the JCR model, node-types are used to express a
semantic. Declaring constraints on this semantic
allows the declaration of restrictions on the nodes
and on their content. Each node has a primary node-
type and can have several mixin node-types which
extend the primary node-type. Node-types allow for
specifying constraints on the children of a node, on
the properties of a node and on the values of the
properties stored by a node.
Figure 3.3-1: JCR model and integrity
Using several node-types permits the possibility of
ensuring the integrity of transitive relations in a
hierarchy. For example, it is possible to define a
node-type which support only children with a specific
type. The later could also have node-types which
declare constraints for their children. Proceeding in
this fashion would narrow down the usage within a
node, that the children of the children of a specific
node should have a certain type.
When integrity is mentioned, we often speak about
entity integrity, referential integrity and domain
integrity. These concepts relate closely to the
University of Lausanne & Day Software AG JCR or RDBMS
13
relational model but as shown in Figure 3.3-1 we can
find similar ways to express constraints in the JCR
model.
Entity integrity is ensured by the fact that basically
each node is unique and identified by its location in
the data model or by its UUID. Paths cannot really be
considered as unique identifiers because same paths
sibling are allowed for XML compatibility. Referential
integrity is ensured by the fact that all the references
properties of a node have to point on a referenceable
node. Furthermore, a referenceable node cannot be
deleted while it is referenced. Domain integrity can
be ensured by forcing nodes to have specific
properties which contain values in predefined ranges.
Data coherence can be checked with integrity
constraints but the model does not provide all the
tools to do a complete coherency check. This proves
that making a separation between the two areas is
beneficial. Integrity should be ensured at the data
model level and data coherency at the application
level.
Relational Model
An analogy between the Relational model and a
white list is appropriate. As explained in the last
section, the relational approach made the
assumption that structure and content have to be
separated. Thus saving content is allowed only if a
structure has been defined. Some integrity
constraints are implicit to the relational structure. The
domain constraints ensure, for example, that all the
values stored in a same domain have the same type.
The entity integrity constraints give the guaranty that,
due to the primary key, all records in a table are
unique.
Furthermore, the structure is generally taken as a
base on which to declare other integrity constraints.
The referential integrity ensures that a foreign key
domain is a subset of the pointed domain. In the
same way some other integrity constraints which
make use of the operations proposed by the model
can be described.
Figure 3.3-2: Relational model and integrity
A structure known in advance and from which the
evolution is controlled is an elegant base to ensure
integrity. The syntaxes which permit the expression
of integrity constraints are generally derived from first
order logic. The fact that the main building blocks of
the relational model are based on well known
mathematical disciplines, respectively the set theory
and first order logic, permits the expression of
implementation models which share these
mathematical properties.
In term of data integrity, this provides advantages
because the solidity of the implementation model can
be mathematically proven. In its simplicity, this way
of proceeding also allows the opportunity with short
statements to declare rules and constraints for nearly
everything. As a result, solid implementation models
can be quickly declared with a high level of accuracy
and a minimum level of programming effort.
However, as mentioned before, the assumption that
each problem can fit in predefined structure is often
too strong. Furthermore, while the relational model
has the ability to express hierarchies and network
14 Data model comparison
structures, the first order logic is limited when having
to declare them with constraints. In conclusion, it’s
often difficult to know what should be managed at a
model level or at an application level.
Integrity, coherency and responsibility
In general, DBAs have the custom of declaring very
strong structures. Their implementation models are
thought of as white lists which preserve data integrity
and data coherence. However, to build generalized
and flexible implementation models it is really only
the data integrity level which should be constrained
at model level.
Furthermore the argument that data integrity and
data coherency should be the responsibility of the
DBA does not really reflect the reality or the ideal, as
all of the tests made at an application level to ensure
that users do not inject into the data, testify to the
veracity of this fact.
Figure 3.3-3: Responsibility repartition revisited
Therefore the clarification of the repartitions of
responsibility of such checks would be of an
enormous benefit to the overall functionality. This
would help in defining reasons in choosing any given
model. Equally it identifies any shortcuts on aspects
of data integrity and helps to avoid these sort of
pitfalls. Furthermore, dividing clearly the
responsibility of the integrity and of the coherence
could enhance the ability to design more intuitively
applications which take into account the cost of the
checks made at a data model level.
Choosing the right approach
The argument that the relational model has
mathematical properties (2) which will ensure rock
solid data integrity is often selected for the wrong
reasons. In fact these properties are only used for
very specific applications and the integrity of an
implementation model as understood here is rarely
proven mathematically because it is not a
requirement.
The choice of the best approach should be made
with regard to the responsibility given to the DBA and
to the application programmer. The following two
examples can illustrate this idea. On one hand, a
prison guardian must control all the movements of
the people in the prison during the day. In this case,
a rock solid program conceived as a white list is
ideal. The people may only do the things that they
are allowed to do. On the other hand, a tourist guide
has to ensure that the travelers have a good trip by
directing them and giving them the right information.
In this case, a program conceived as a black list will
probably give more satisfaction to the user.
Some functional cases do not benefit from being
governed by a lot of constraints. Unfortunately, the
relational model often leads DBAs and application
programmers to design restricting implementation
models. This gives them the feeling that their
applications is well thought out but often it only
frustrates the users.
The following questions should be honestly asked:
Do users have to be guarded or guided?
Does data coherency have to be preserved
at a database level or at an application level?
Therefore choosing the good data model is not only a
question of preferences but it should be based on a
choice which is always related to the analysis of the
case in use.
3.4 Operations and queries
Query languages are close to fields as relational
algebra, first order logic or simply mathematics.
Depending on the cases, queries can be expressed
with declaratives calls or with procedural languages.
In general, queries are composed of several
University of Lausanne & Day Software AG JCR or RDBMS
15
operations which make use of the structure or of the
data semantic.
Some operations can be used in queries. These
operations such as the selection, the projection, the
rename or others set operations are inherited from
the disciplines mentioned at the beginning of the
section. In addition to these operations, some query
languages provide statements which allow creating,
modifying or deleting of data. This section shall
clarify the bounds of each model in term of queries
and operations.
JCR Model
An abstract query model is used as a basis to
retrieve data in the JCR Model (4) (5). This query
model makes a kind of mapping between the JCR
model and the notions of relations, domains, tuples
and attribute present in the relational model. The
Figure 3.4-1 is a modified version of the Figure 3.1-4
which visually shows this mapping.
Figure 3.4-1: JCR model, operations and queries
It seems that, in the actual state, node-tuples are
seen as relation, property as domain, nodes as tuple
and values as attributes. Basically node-tuples are
arbitrary sets of nodes. However, node-types are
used as the main source of node-tuples in queries.
While this kind of mapping could not be considered
as an application of the principles of the set theory, it
allows the running of some interesting queries which
can satisfy nearly all requirements.
The operations provided by this query model are the
selection and the ensemble of set operations which
permit the performing of joins between node-tuples
sets. The result of a query is composed of all the
nodes which satisfy the selection condition and the
join condition.
Basically, in the JCR model, queries are seen as a
way to perform search requests. This provides a way
of retrieving records but this selection criterion does
not however allow them to be sequentially deleted or
updated. This functionality is not dictated by
conceptual barriers, it could be modified as required.
As mentioned before, the structure and the schema
are not separated in this model. Thus, some
attributes of the records at their depth level or their
hierarchical path can be viewed as properties. This
opens up the ability to easily perform queries on
things which are generally not taken into account in
other models as transitive relationships in
hierarchies.
Relational Model
The relational algebra defines the primitive
operations available in the relational model (9).
These operations are mainly the selection, the
projection, the rename, the Cartesian product, the
union and the difference. The power of this query
model states in fact that the input and the output of
these operations are always relations. Thus, it’s
possible to express complex statements and
imbrications.
In addition to these operations, some mathematical
operators can be used. It’s also possible to specify
additional domains for the output relation. Some
domain operations are also provided to retrieve
information for example the number of attributes
stored in a domain or the domain’s maximal value.
The query languages which are provided by
relational database implementations generally
propose statements which allow modifying, creating
or deleting data (10). Used in conjunction with the
previously presented operations, these statements
become very useful. They provide a means of
performing sequential changes on data sets which
reply to precise conditions.
The possibilities given by the usage of these
operations are huge. However limitations are
encountered when transitive relationships appear
(11). This sort of query cannot be expressed with first
order logic statements. For example, if it is not
possible to define a query which retrieves all of the
16 Data model comparison
descendants of an element some other solutions are
available (12). They do however often add
complexity to the implementation models.
Choosing the right approach
While JCR provide a means of carrying out some
operations and queries, the relational model is clearly
more complete in this area. In some situations, this
completeness can become a decision criterion if the
case in use implies that complex join operation may
be required.
The features proposed by most of the relational
databases which allow the use of operations in
conjunction with update and delete statements is
also a significant advantage proposed by this
relational model. For the use case which involves a
lot of write access, this possibility allows for quick
creation, update and deletion of content. However,
caution should be taken with this type of usage when
complex hierarchies are present.
3.5 Navigation
During the 70’s, Charles W. Bachman described
different ways of accessing records in databases
(13). By focusing on the programmer’s role, he
describes his opportunities to access data as the
following:
1. He can start at the beginning of the
database, or at any known record, and
sequentially access the "next" record in the
database until he reaches a record of
interest or reaches the end.
2. He can enter the database with a database
key that provides direct access to the
physical location of a record. (A database
key is the permanent virtual memory address
assigned to a record at the time that it was
created.)
3. He can enter the database in accordance
with the value of a primary data key. (Either
the indexed sequential or randomized
access techniques will yield the same result.)
4. He can enter the database with a secondary
data key value and sequentially access all
records having that particular data value for
the field.
5. He can start from the owner of a set and
sequentially access all the member records.
(This is equivalent to converting a primary
data key into a secondary data key.)
6. He can start with any member record of a set
and access either the next or prior member
of that set.
7. He can start from any member of a set and
access the owner of the set, thus converting
a secondary data key into a primary data
key.
These rules give the programmer the ability to cross
datasets by following the references which are
structuring the records. The interesting point on this
approach is that the programmer can adopt access
strategies without knowing the whole structure of the
database. As a navigator, he explores the database.
Figure 3.5-1: Navigation path
Rules, as defined by Charles W. Bachman, can be
implemented as procedural calls made over an API
or as declarative statements. The main difference
between the queries mentioned in the previous
section and the navigation principles defined here
are the following. Queries are built over the semantic
or over the structure of the data model. Navigation is
independent of the semantic or of the structure and
directly uses the content. Thus, in our context,
XQUERY and XPATH should be considered as
navigational languages because they use the content
to navigate in XML files.
JCR Model
In the JCR Model, each record stores properties
which relates to the localization of the item in the
database. The level, the path and, under certain
conditions, the unique identifier are good examples
of these specific properties. The rules mentioned
before are nearly all included in the model and allows
University of Lausanne & Day Software AG JCR or RDBMS
17
for the navigation through the database with different
types of strategies.
The root node can be seen as the beginning of the
database. As mentioned in the first rule, it gives the
ability to sequentially access all the sub-nodes. The
path and the unique identifier properties allows
navigating in a way which respects the second, the
third, and the fourth rules by giving specific entry
points for specific situations. The node types and the
parent nodes can be seen as set owners and thus
allows for the navigation of the database in ways
which respect the fifth, sixth and seventh rules.
These possibilities offered by the JCR Model (4) (5)
give the programmer a lot of flexibility. He is really
able to navigate through the data and adopt
strategies which will allow him to find data in
structures that are unfamiliar.
Relational Model
In the relational model (2), records are seen as basic
tuples of values. Basically, these data structures do
not know their localization in the database and are
not ordered in relations. To enter the database, a
programmer must have a good knowledge of the
schema and of the data organization.
In one sense, we could say that the fifth rule
previously defined is fulfilled. However, because the
records are not ordered, it is not really the case.
Thus, the relational model does not take into account
these rules at all. The relational model only defines a
way to organize data and shifts the navigation
problem to a higher level.
Choosing the right approach
In term of navigation, both models are not
comparable. The signification given to the units of
content are really different. Thus choosing the right
approach depending on the use case is not really
hard. If the use case involves traversal access,
exploration or navigation in data, a model which
includes these concepts is always superior.
3.6 Synthesis
The two data models show fundamental differences.
The approach’s choice highly relates to the degree of
flexibility which has to be given to the user. This
choice also relate to the nature of the requirements
which involve clear or abstract entities. The choice of
the data model should always be made by doing a
good analysis of the use case.
The selection of an approach also affects the main
roles and responsibilities which relate to data
management. A requirement would be that all of the
people using a database should be informed clearly
of their roles accompanied with guidelines of usage.
Paying particular attention to certain previous data
usage habits as they would have to be changed or
their usage need to evolve if a new data model is
chosen.
Some users could voice reticence concerning these
factors as conservative behavior is an obstacle when
deep changes arise. The data model’s choice should
not be affected by this type of reasoning. The
advantages engendered through good and coherent
choices are enormous and can have a significantly
impact on the application and the development
process.
18 Specification comparison
4 Specification comparison
Specifications describe the features that databases
should support. The main specification for relational
database is without doubt SQL which has been
released several times (SQL92, SQL98, SQL**)
since its first edition and which is more or less
implemented by each relational database provider.
The JCR Specification was released in 2005 (JSR
180) and a second version of the specification is in
incubation (JSR 283). Some companies as Day,
Alfresco or Oracle provide implementations of this
specification with different levels of compliance.
We could discuss the many aspects of each
specification which would take a long time but the
principal objective in this document is to highlight the
philosophy behind the specifications which provide
practical answers which solve common problems. It
is for this reason that, the examples shown in the
following sections are essentially based on the
SQL92 specification and on the version 1.0 of JCR.
The first section of this chapter presents a use case
which demonstrates how each specification can give
practical answers to running problems. Being well
balanced it shows the possibilities and limits of each
model. The four following sections will essentially
show how the concepts presented in the ―Data model
comparison” chapter actually take form in the
specifications. Finally, the last section will point to
practicalities by presenting features which respond to
the more common differences in requirements.
4.1 Use Case Definition
Consider an editor who sells books and wants to
create a system to manage his book collection and
his orders. A book collection is composed of books
and sub collections. A book can be tagged with
keywords. Through a website, the editor wants to let
anonymous visitors navigating through the whole
catalogue by collection.
He also wants to provide a book preview for the
authenticated customers and partners and let the
partners show the whole digital copy of the books. In
addition to the ability to navigate through collections,
partners and customers should be able to search
products ISBN number, with full text criterions, or by
asking for the most successful items.
Figure 4.1-1: Editor use case diagram
The Figure 4.1-1is a draft of the use case diagram of
this application which summarizes the main actors
and the main features which have been identified
during the conception process. In the next sections,
this use case will be used to point to some key
aspects which differentiate the relational databases
from the java content repositories.
4.2 Structure
In term of structure, both approaches are radically
different. However, it makes sense to understand
how each specification makes use of the basic
concepts presented in the ―Data Models‖ chapter.
University of Lausanne & Day Software AG JCR or RDBMS
19
This can assist people developing implementation
models and in solving practical problems.
JCR Specification
As other unstructured and semi-structured models,
the JCR Model does not make a separation between
data and their structure. Thus, there are no specific
needs to identify entities and attributes as required
by relational databases. It is also important and
useful to identify the semantic beforehand or in other
words, identify the concepts represented by nodes in
the content repository. This can be done by defining
a node-type or by specifying an attribute which
declares the type of the node. The schema depicted
in Figure 4.2-1 does not represent the structure of
the repository. It simply shows how the main
concepts which can be found in the structure should
be organized.
Figure 4.2-1: Semantic diagram
The root can be seen as the editor system which is
dealing with persons, orders, order lines, collections,
books and tags. This diagram does not take into
account the additional artifacts which could be added
in the content repository to organize data.
<editor = 'http://www.editor.com/1.0'> [editor:person] > nt:unstructured
[editor:order] > nt:unstructured
[editor:orderline] > nt:unstructured [editor:collection] > nt:unstructured
[editor:book] > nt:unstructured
[editor:tag] > nt:unstructured
Table 4.2-1: Node-types
The most intuitive way to design this structure or
organization is to think in term of its composition.
Simply the manner in which, one concept will always
be a component of another concept. If UML class
diagrams are used during the design phase, it
consists only of translating the composition
relationships into hierarchies. The various other
associations will be stored as references or paths as
properties. More tips on how to design JCR
applications are available in the Appendix – JCR and
design‖ appendix.
In considering the environment as structured we are
often unable to translate clearly this structure.
Consequently, keeping the schema as weak as
possible, allows easily to take into account new
requirements at runtime by simply recording new
data. If node-types are used as markers, it make
sense to simply let them extend the nt:unstructured
node-type without adding more constraints.
Thus, at design time there is no real need to fix all
the attributes and all the entities. In this example,
some decisions can be taken later by the application
programmer. The general idea is simply to leave
open the place for new requirements.
SQL Specification
As explained in the previous chapter, the relational
model implies that data and their schema are
separate. In practice this means that all the tables
and their respective columns have to be identified at
the time of design. During the development process
the entity relationship notations are often used for
this purpose.
20 Specification comparison
For the editor’s use case, means that some decisions
need to be made which will strongly impact the future
evolution of the application. Data security and save
routines must make use of the predefined columns.
Everything has to have been describe clearly
previously. For example the identification of what an
order, what a book is and what a customer is
imperative. Hence the final application must and will
reflect all these decisions which are often arbitrary.
Figure 4.2-2: Entity relationship diagram shows a
database schema which reflects the decisions which
have been taken during the design phase. In this use
case, it is relatively easy to find relations and
domains for the main entities as person, order, order
line and tag. At design time, their attributes can
clearly be identified and it is quite easy to conceive a
relational schema for them.
However, the book entity is difficult to fit into a table.
For example, this schema only stores the title and
the description of the book. However as a
requirement there is a need to also store a digital
copy and a preview of the book. The content of the
book could be part of the database or it could be
stored somewhere else in the file system. This kind
of decision is completely arbitrary and has an
enormous impact on the application’s life cycle.
4.3 Integrity
As mentioned, integrity can have different meanings.
In the database vocabulary, integrity generally
relates to the fact that accidental or intentional
destruction, alteration, or loss of data should not
happen. It also relate to the state of completeness of
data which have to be preserved in all cases in the
database. This section will make a quick roundup of
the possibilities proposed by JCR and SQL to deal
with integrity.
JCR Specification
Data integrity can be ensured in JCR with node-
types. Some predefined node types are specified by
the JCR specification. These represent different
concepts which are often encountered in repositories
such as folders, files, links, unstructured nodes, etc.
These node-types can be extended and rules which
force the nodes to respect certain rules can be
defined.
In our use case, the state of completeness of data
which always has to be preserved in the database
Figure 4.2-2: Entity relationship diagram
University of Lausanne & Day Software AG JCR or RDBMS
21
does not require a lot of constraints. In a real-time
situation, it could happen that a person places an
order and comes to take direct delivery of the product
or a special edition of a book could have no ISBN.
We often say that this kind of decision has to be
taken into consideration. However they should not be
taken at a level which is detrimental for future
requirements.
The only integrity constraints we might choose to
define concern the orders and the order lines. For
law compliance, it would be necessary that an order
stores a date and that an order line stores a property
with a unit price and a quantity. This is shown in
Table 4.3-1.
<editor = 'http://www.editor.com/1.0'> [editor:order] > nt:unstructured
- 'created' (Date) mandatory
[editor:orderline] > nt:unstructured - 'quantity' (double) mandatory
- 'unitprice' (double) Mandatory
Table 4.3-1: Node-type and integrity constraints
The fact that an order line can only be found under
orders node cannot be expressed at a repository
level. However, this constraint can be taken into
account at an application level. We might also need
to define a referential integrity constraint between the
ordered product and the order line. The code shown
in Table 4.3-2 demonstrates how this can be done.
[editor:orderline] > nt:unstructured - 'product' (reference)
Mandatory
Table 4.3-2: Node-type and referential integrity
The meaning for this kind of attribution could be
discussed at length but keeping a strong reference
between the product and the order line which
implicates referential integrity does not really make
sense. A product can evolve and this sort of
association would lose its signification. Furthermore
the editor may want to sell in the future a service
instead of a book. Therefore imposing referential
integrity is probably extreme and we can
consequently more realistically accept broken
references between order line and product. The
same comment can be made for the tags which are
made with an association of a similar nature.
SQL Specification
The fact that, in the relational model, the structure is
separated from the content and that it has to be
described leads to creating data models which are a
representation of what will be the final usage of the
application. Furthermore because some integrity
rules are implicit to the model, DBAs generally do not
hesitate in defining all of the integrity rules which will
enclose the preservation of the entire data coherence
at design time.
In practice for the editor’s use case, this means that
some application logic can be translated into integrity
constraints. With check constraints, we could ensure
that the quantity attribute of an order line is always
positive. With referential integrity, we can ensure that
when a tag is deleted that, all the links which concern
this tag are also deleted. The statements in Table
4.3-3 and Table 4.3-4 show how this can be
achieved.
CREATE TABLE IF NOT EXISTS `mydb`.`OrderLine` ( `Order_idOrder` NOT NULL, `Book_isbn` VARCHAR(45) NOT NULL ,
`unitprice` DECIMAL(11) NULL CHECK (unitprice > 0) , `quantity` INT NULL CHECK (quantity > 0) ,
PRIMARY KEY (`Order_idOrder`, `Book_isbn`))
Table 4.3-3: Table and integrity constraints
CREATE TABLE IF NOT EXISTS `mydb`.`Tag_has_Book` (
`Tag_idTag` INT NOT NULL , `Book_idBook` NOT NULL , PRIMARY KEY (`Tag_idTag`, `Book_idBook`) ,
CONSTRAINT `fk_Tag_has_Book_Tag` FOREIGN KEY (`Tag_idTag` ) REFERENCES `mydb`.`Tag` (`idTag` )
ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_Tag_has_Book_Book`
FOREIGN KEY (`Book_idBook` ) REFERENCES `mydb`.`Book` (`isbn` ) ON DELETE CASCADE
ON UPDATE CASCADE)
Table 4.3-4: Table and referential integrity
The advantage of referential integrity constraints is
not negligible. They minimize the efforts made at
application level to ensure the coherence of the data
stored in the database. However in the case of the
tag, if the tag is attributed a thousand times, deleting
one tag will imply a thousand and one write
accesses. If tags are changing a lot, the system will
probably not sustain these integrity checks. A better
policy could be to allow incoherent tag attributions to
22 Specification comparison
survive in the database and to delete them if they are
incoherent during the next read access.
Specifying all the integrity constraints at a model
level can lead to performance and scalability
problems but it also restricts potential utilizations
which have not been identified at design time.
Implementing a new requirement would impose a
new development cycle which starts from the
implementation model definition and finishes with the
implementation of the user interface.
4.4 Operations and queries
In term of operations and queries, we could consider
the four following requirements. The editor wants to
identify the top 10 best sellers. He also wants to
change the status of all of the orders which respect
some specific conditions. He wants to be able to
retrieve all the books which are under a specific
collection and finally, he wants to perform full text
search on all items stored in the system.
JCR Specification
The abstract query model of JCR is implemented in
several ways for different utilizations. The version 1.0
of JCR uses a common subset of XPATH and SQL
which opens up the opportunity for some interesting
requests. The draft of the version 2.0 declares
XPATH as deprecated and replaces it by a query
language which uses java objects.
The first requirement which is aimed at identifying the
best sellers cannot be easily expressed with JCR in
one request. The reason being is that domains
operations as Max and Min are not included in the
specification, joins only allow the retrieval of books
which have been ordered at least once (Table 4.4-1).
SELECT * FROM editor:book, editor:orderline WHERE editor:book.jcr:path = editor:orderline.product
Table 4.4-1: simple JCR query
As shown in Table 4.4-2, the top 10 can be realized
by doing a query for each book which returns its
number of related orders. Then, the sum of the
results can be used to create the top 10. This is good
for simple queries but if connections which include
domains operations are needed, the complexity of
the code is extensive.
The second requirement which is aimed at changing
the status of some orders cannot be expressed with
a single query. However, the results can be
accessed and modified through the navigation API. If
the selection criteria involves domain conditions or
many connections this kind of query becomes very
complicated.
SELECT * FROM editor:order WHERE date < '+2008-11-02T00:00:00:000TZD'
(…)
NodeIterator ni = queryresult.getNodes(); while (ni.hasNext()) {
Node n = ni.nextNode();
n.setProperty("status", "closed"); }
Table 4.4-2: JCR query and iteration on the result
Retrieving all the books which are stored under a
collection is very easy to implement (Table 4.4-3).
Some properties which relate to the record (path,
uuid, etc.) are accessible through XPATH and SQL.
The strengths of JCR and its features are very
evident in this type of situation.
SELECT * FROM editor:book WHERE jcr:path LIKE '/collections/science/%'
Table 4.4-3: JCR query and hierarchy
JCR offers domain independent functions which
allow the execution of queries on all the properties
stored in nodes. As mentioned, the JCR model is
unstructured, and the nodes do not have to reflect
the same properties. Therefore this is a very powerful
functionality for all the use cases which require full
text searchs. As illustrated in Table 4.4-4 retrieving
the set of nodes which contain a specific sequence of
characters is very simple.
SELECT * FROM nt:base WHERE CONTAINS(*, '*computer*')
Table 4.4-4: JCR query and full-text search
In conclusion, the use cases which are presently
characterized by a lot of join and domain operations
will not really benefits from the features proposed by
JCR. On the other hand, in term of operations and
queries if the use cases characteristically require
hierarchical queries, full text search queries and
search queries in binary content, a java content
repository would be advisable.
University of Lausanne & Day Software AG JCR or RDBMS
23
SQL Specification
As explained in the last chapter, the relational model
shows all of it power when the requirements need
connecting operations and domain operations.
Furthermore, if the requirements need to perform a
high volume of sequential changes to large volumes
of records the possibilities offered by this model do
not respond favorably to these needs.
The first requirement, retrieving a top 10 of the most
sold books can easily be expressed with SQL. The
Table 4.4-5 shows how this can be done with a
simple join and a group clause.
SELECT b.isbn, b.title, sum(o.quantity)
FROM editor.book b JOIN editor.orderline o ON o.bookIsbn=b.isbn
GROUP BY b.isbn
ORDER BY sum(o.quantity) DESC LIMIT 10;
Table 4.4-5: SQL query and simple join operation
Updating the status of the orders is also quite easy to
implement with one query (Table 4.4-6). This kind of
statements is very useful when sequential
modifications which answer to complex conditions
have to be performed on the dataset.
UPDATE editor.`order` o
SET o.`satus` = ('closed') WHERE o.`date` <= curdate() - INTERVAL 1 YEAR;
Table 4.4-6: SQL update query
The third requirement is more complicated to realize.
In this case, the depth of the hierarchy of collections
is not known in advance and it is not possible to
define an SQL query which takes into account this
unknown parameter. Another possible way to
proceed is to recursively retrieve the collections with
a statement similar to code found in Table 4.4-7,
followed by running a query on all the books stored
under these retrieved collections.
SELECT c1.id FROM collection AS c1 JOIN collection c2 ON c1.parentId = c2.id
WHERE c2.id = $categoryId;
SELECT * FROM book as b
WHERE b.collectionId = $categoryId[0];
OR b.collectionId = $categoryId[1]; OR b.collectionId = $categoryId[n];
Table 4.4-7: SQL query and recursion limitation
Nested sets can be used to avoid recursive calls.
However the performance costs needed to update
the hierarchy are randomized. Nested intervals (12)
solve partially this problem but, as nested sets, they
incur some maintenance complexity. While relational
databases permit the management of hierarchies,
they do not exactly provide the right or effective tools
for this maintenance. Applications programmers tend
to use frameworks to manage these requirements in
a more elegantly manner.
Performing full text search queries on a relational
database require a good knowledge of the structures.
In fact, only the columns specified in the statement
will be considered in the result. For complex models,
alternative solutions with external indexes are often
used to perform this kind of request.
SELECT * FROM book as b WHERE b.title LIKE '%computer%' OR b.description LIKE '%computer%';
SELECT * FROM collection as c
WHERE c.title LIKE '%computer%'
OR c.description LIKE '%computer%'; SELECT * FROM tag as t
WHERE t.title LIKE '%computer%' OR t.description LIKE '%computer%';
Table 4.4-8: SQL query and full-text search limitation
The Table 4.4-9 present the non standardized syntax
proposed by MySQL for full text search.
Unfortunately, a problem linked to the structure is not
really solved and this solution does not support full
text search for multiple tables.
SELECT * FROM book as b WHERE MATCH ( b.title,
b.description, b.isbn, ) AGAINST ('word');
Table 4.4-9: MySQL and full-text search
The first requests in this section show the power that
can be reached by combining different operators in
declarative statements. For complex models which
imply sequential data modification in conjunction with
domain operations, relational databases make more
sense. However, the force engendered by a structure
disappears when the case in use involves features
linked to hierarchies, networks and search on semi
structured data. Therefore a good knowledge of the
whole use case is required before being able to
make a choice between the two options.
24 Specification comparison
4.5 Navigation
In our use case, the entity ―book‖ has not been
clearly defined. This type of entity is difficult to
concretize. Some other unknown entities are
identifying it as a title, paragraphs, images, pages or
covers. Furthermore these entities can vary from one
book to the other. For the editor’s use case, we could
consider the two following types of books saved in
the system. Firstly one could be considered a roman,
essentially composed of ordered chapters, titles, and
paragraphs. Secondly another one as a comic
composed of ordered cartoon boards or planks.
JCR Specification
Without a doubt, navigation constitutes the main
feature proposed by the JCR specification. Creating
and exploring a tree or a network structures is not
always easy. Navigation simplifies this.
The API proposed by the JCR specification allows
navigation in and through records with direct access
or traversal access. A session is the main entry point
of the repository and provides a traversal access to
the root node and a direct access to each node by
using their uuid or path. Each item of the repository
also provides navigational functionalities which make
use of direct access through relative path or traversal
access through children, properties, references or
parents. This API also provides write features to the
repository. Thus, the Table 4.5-1 show how new
nodes, properties and values can easily be created
and saved.
session.getRootNode();
session.getNodeByUUID("uuid"); session.getItem("path"); Node.getNode(“name”);
Node.getNodes(); Node.getProperty(“name”); Node.getProperties();
Table 4.5-1: JCR navigation API
As mentioned, in our use case the entity ―book‖
cannot be completely defined at design time. That is
why the application programmer should give the user
the ability to decide what a book is at the entry point.
At the moment of creation the application
programmer will not be occupied with what types of
entities are present in a book. He will let the user
define them at a later stage. The book can be
identified by displayed the configuration of its
components.
public void displayBook(Node book) throws RepositoryException {
this.traverse(book); }
public void traverse(Node node) throws RepositoryException {
NodeIterator nodeIterator = node.getNodes();
displayNode(node); while(nodeIterator.hasNext()) { traverse(nodeIterator.nextNode());
}
}
public void displayNode(Node node) { // display logic... }
Table 4.5-2: JCR traversal access
The methods shown in Table 4.5-2 try to schematize
the advantages that can be reached by using
navigation. There are a few possibilities now
accessible to the application programmer. He could
provide tools to let the user store the display logic
Figure 4.4-1: Unstructured entity
University of Lausanne & Day Software AG JCR or RDBMS
25
directly in the nodes, giving the maximum flexibility.
This kind of strategy can be adopted through
features proposed by the JCR specification. A
framework as Sling can facilitate this task.
SQL Specification
As mentioned, the relational model does not take
navigation into consideration and forces the
responsibility on the programmer to implement these
features. Furthermore, all the entities have to be
defined at design time and semi-structured data is
not catered for.
For the editor’s use case, the implications are that
the application programmer will face some problems
if he is not able to define an abstract entity for the
content of the book. Figure 4.5-2 shows how the
application programmer could choose to design his
relational model to take into account that the
structure of the book appears and can only be
concretized at the time of input.
Figure 4.5-2: SQL and unstructured entity
SQL does not standardize mechanisms which
simplify the navigation through records during a
session. Furthermore, there is no real context of
position in the database which is conserved during a
sessions and which can be reused simply.
To navigate the application programmer is obliged to
build a mechanism which is able to perform dynamic
queries on the model. Therefore even if the model is
extremely abstract and able to take into account all
the possible situations, the application programmer is
forced to develop all the application logic to navigate
the structure. This task is by no means trivial.
It is possible to make an implementation model which
adds artifacts or miscellaneous entities to the records
to create hierarchies, networks or explicit orders.
However, this methodology exposes the application
programmer to some conception failures, which are
very difficult to correct once the system is in
production.
4.6 Transactions
In the current context, we can identify two levels of
transaction. The transactions which deal with one
resource and ensure that a sequence of changes can
be considered as a unit of work can be considered as
local. The others referred to as global transactions
(14), deal with several resources and require a
coordinator or a transaction manager to make sure
that the changes can be committed to the pertinent
resources.
Figure 4.6-1: Global and local transaction
26 Specification comparison
JCR Specification
The JCR specification includes both cases. In a local
manner, if the application programmer deals with
only one repository instance, he can ensure that a
sequence of changes can be considered as a unit of
work. All the changes between two save calls can be
considered as unit of work.
Session.save(); Item.save();
Table 4.6-1: JCR and local transaction
In an application, a content repository can be used
as a resource in conjunction with other resources as
a relational database, a messaging service or
something else. The specifications mention that a
repository implementation can be used in conjunction
with the Java Transaction API (JTA). In a java
container, when the Transaction API is used, the
changes made on the JCR resource are determined
only at the end of the transaction.
// Get user transaction (for example, through JNDI) UserTransaction utx = ...
// Perform some changes in a java content repository // Perform some changes in a relational database
// Commit the user transaction utx.commit();
Table 4.6-2: JCR and global transaction
SQL Specification
The SQL specification allows the regrouping of
statements as a unit of work. These statements will
only be permanent in the database if they all
succeed. This determines that local transactions as
the one shown in Table 4.6-3 are part of the
standard.
START TRANSACTION; (Statement list…) COMMIT;
Table 4.6-3: SQL and local transaction
However, using the database in conjunction with
other resources is not taken into account by the
specification. Some implementations provide
statements to manage this kind of scenario similarly
to the XA statement of MySQL. All the same, this can
and is more often completed at a higher and more
standardized level. Some APIs provide these
features and most JDBC drivers can therefore be
used with JTA.
4.7 Inheritance
To enrich our use case with a wider panel of
associations, we could consider a subsequent new
requirement which implicates inheritance features.
The editor wants to differentiate between his
collaborators, his partners and his customers but he
also wants to take into consideration that an
individual can have several roles.
JCR Specification
For the inheritance requirement, node-types and
mixin-types can be used. For example let us consider
a Person node-type which has three mixin-types
respectively customer, collaborator and partner. By
taking one or more mixin-type, a node which has
been defined as a person can take on all the roles
encountered in the system.
Figure 4.7-1: inheritance semantic
<editor = 'http://www.editor.com/1.0'> [editor:partner] > editor:person
mixin [editor:collaborator] > editor:person
mixin [editor:customer] > editor:person
Mixin
Table 4.7-1: node-types and inheritance
The primary advantage is that queries made on the
person node-type will return all nodes of this type
and it will also including nodes which inherit from this
node-type. All the properties of the returned nodes
are immediately accessible and a node which was
not considered as a person can also acquire this
status through the mixin-type.
SQL Specification
Inheritance tends to be encountered at application
level. However, some relational databases, for
example PostgreSQL can have extensions which
University of Lausanne & Day Software AG JCR or RDBMS
27
manage inheritance. However these tools are not
standardized and tend not to be used in practice.
A classical way to administer this requirement
consists of creating tables for each susceptible entity
which will inherit characteristics from the person
entity. The identifier of these sub entities is known as
a foreign key which point to the person table. Figure
4.7-2 visually represents how this could be
implemented with SQL.
Figure 4.7-2: SQL and inheritance
It is quite easy to create a query which retrieves the
entire set of persons and all their inherited properties.
The one depicted in Table 4.7-2: SQL query and
inheritance, shows how this can be done with left
outer joins. Additionally a view can be created to
avoid having to rewrite the query.
SELECT * FROM person p
LEFT OUTER JOIN partner pa ON pa.id=p.id LEFT OUTER JOIN collaborator co ON co.id=p.id LEFT OUTER JOIN customer cu ON cu.id=p.id;
Table 4.7-2: SQL query and inheritance
While JCR seems a more flexible way to express
inheritance, this can lead to the conclusion that both
approaches are approximately equal in expressing
this kind of associations. However in reality it
demonstrates that the advantage in JCR is that each
node can inherit from several mixin node-type. With
the annotation that this advantage relates more to
the semi-structured approach rather than inheritance
problems.
4.8 Access Control
Access control can be defined as the action of
authorizing or denying access, modification and
creation of records. While this is nearly always a
requirement in business applications, specifications
rarely respond to real-time situations.
In the editor’s use case, it was mentioned that a
person should be able to see a digital preview of the
book and under certain conditions the whole book.
This implies that books’ components can have
different access policies.
JCR Specification
Since the 1.0 version of JCR (4), access control is
one of the core feature. In its first release, the
specification only declares how to login to the
repository and how to check the permissions
attributed to the items of the repository. The
hierarchical path of the items stored in the repository
is used as the basis on how to check these
permissions. However, the specification does not
specify how access control should be implemented
and manage.
Repository.login(Credentials cred);
Session.checkPermission(String absPath, String actions);
Table 4.8-1: JCR 1.0 and access control
The version 2.0 of the specification (5) defined how
the concepts of privileges and access control policies
in the repository would function. Each item stores
properties which relates to privileges. These
properties can be modified through the API. Thus the
access control feature can be delegated to the
content repository which is able to manage the list of
permissions at an item level.
Session.getUserManager();
UserManager.addUser(…); UserManager.addGroup(…);
Session.getAccessControlManager(); AccessControlManager.getApplicablePolicies(path); Policy.addEntry(…);
AccessControlManager.setPolicy(path, policy);
Table 4.8-2: JCR 2.0 and access control
In both cases, this means that for the editor’s use
case, the application programmer will only have to
define the structure and to use the repository
28 Specification comparison
features provided to manage access control. The
access control granularity proposed by the API is
close enough to the data to address all the potential
use cases. Consequently, further access control logic
is not required.
SQL Specification
In SQL, access control is basically managed with the
data stored in the information schema (10). This
provides the ability to grant and deny privileges at a
table or a column level. However, while the base
functionalities provided by SQL allows the
declaration of implementation models which manage
permissions at a record level, there is no inherent
standard solution provided. This comes from the fact
that the identifiers of the records in relational
database can be distributed across several domains.
Conserving this property makes it difficult to specify a
generic way to manage access control at a record
level.
Basically, for the editor’s use-case, managing the
readability of the information of which a book is
composed imposes that access control should be
administered at a record level. This is obligatory
because the SQL specification does not provide this
feature. The application programmer must therefore
include it in his implementation model.
The Figure 4.8-1 shows the solution where each
record has a unique identifier stored in a column. The
record controller table allows for the identification of
accessible resources within the database. The
record_accessor table allows for the identification of
the persons accessing the database, they can then
be stored through out the database in a user or a
group table. This model still means that the
application programmer must manage and
implement the logic which will perform the privilege
checks.
Figure 4.8-1: JCR and access control
4.9 Events
Another requirement often encountered concerns the
observation of the changes which can be applied to a
dataset. At the infrastructure level, messaging
services are common examples of components
which make use of these types of events. Some use
cases benefit from being event driven one such case
would be the management of flows. The editor’s use
case could also benefit from this type of
methodology. For example, the editor may want to
notify some clients each time a new book is added to
a specific collection.
JCR Specification
The JCR specification provides an Event Listener
interface which traces all the imaginable operations
which have to be performed when a specific event
University of Lausanne & Day Software AG JCR or RDBMS
29
occur. These listeners can be registered for different
types of event for example:
when nodes are added or removed
for events which occur under a particular
path, at a specific level
for events which occurs on the instances of a
node-type or on a single node identified by a
UUID.
The coded example presented in Table 4.9-1 shows
how an event listener can be registered for all the
events which occur when a book is added to the
computer collection.
ObservationManager om =
session.getWorkspace().getObservationManager(); EventListener el = new EventListener() {
@Override public void onEvent(EventIterator ei) { System.out.println("A book has been added");
} };
String[] nt = { "editor:collection" }; om.addEventListener( el, Event.NODE_ADDED,
"/collections/science/computer", true, null, nt, false);
Table 4.9-1: JCR and observation
This observation mechanism allows listening in on
events with a fine granularity. Furthermore, the fact
that the observation mechanism is provided directly
through a java API instead a specific procedural
language allows a high level of interaction between
the application and the repository.
However, an important aspect is that the listeners are
not permanent. This means that if the repository is
restarted, all the listeners have to be reregistered. In
certain situations, especially those which occur when
the event listeners are registered at runtime, the
recovery of the application’s state can be difficult and
complex.
SQL Specification
The SQL specification addresses the observation
problem with triggers. One of the main advantages of
triggers is that they remain in the information
schema. This ensures that the state of the database
including the triggers can be easily recovered.
Triggers can be registered for insert, update or delete
operations which are visible on specific tables. The
body of the trigger generally contains procedural
calls which can be launched before or after queries.
CREATE TRIGGER editor.book_insert AFTER INSERT ON editor.book
FOR EACH ROW BEGIN (Statement list…)
END;
Table 4.9-2: SQL and triggers
For the editor’s use case the trigger shown in Table
4.9-2: SQL and triggers listens in on the registration
of new books. However, it is not possible to listen in
on only the events which occur in a subset of the
table. In addition, there is no standard way to
propagate the event from the procedural language to
the application. Hence triggers are mainly used to
modify data in the database following inserts or
updates.
4.10 Version control
Version control is often an issue when people are
collaborating on the same data. It is therefore
prudent to retain to memory the history of an object
and to give the user access to the evolution of an
object. For the case in question, we could imagine
that after a certain lapse of time, the editor decides to
manage in the system the different versions and
editions of the books.
JCR Specification
Version control characterizes how content
repositories are fully compliant with the JCR
specification. The JCR specification includes
versioning as a part of the standard. It can be
supported for individual items and for hierarchies of
items. This simplifies the life of application
programmers who normally have to deal with these
kind of needs. As shown in Table 4.10-1, managing
versions of a hierarchy does not require an
enormous effort.
// mixin versioning type book.addMixin("mix:versionable");
session.save(); // version creation
book.checkout(); book.addNode("chapter1"); session.save();
book.checkin(); book.checkout();
book.addNode("capter2");
30 Specification comparison
session.save();
book.checkin(); book.checkout();
book.setProperty("isbn", "0-85131-041-9"); session.save(); book.save();
book.checkin(); // get the second version
VersionIterator vi = book.getVersionHistory().getAllVersions(); Version v;
v = vi.nextVersion(); v = vi.nextVersion();
// restore the second version book.checkout(); book.restore(v, true);
Table 4.10-1: JCR and version control
SQL Specification
Some relational databases implementations provide
versioning functionalities. However, versioning is not
part of the SQL standard. Any person wishing to
build an interoperable application have to include
versioning in their implementation model. Managing
properly complex graphs in relational databases is
quite difficult. So while versioning could be
implemented this task would not be undertaken with
SQL.
4.11 Synthesis
It seems that for both specifications the structural
part and the integrity parts are well defined.
However, while the relational model provides very
clear foundations for operations and queries, the
JCR specification seems to provide operations and
queries on a relatively obscure basis.
The same remark can be made for navigation. While
the JCR specification provide a strong navigational
basis, the last versions of the SQL specification have
difficulty to provide a coherent set of features which
take this factor into consideration. Improvements
could be made in these areas for both models with
recommendations and enhancements being shared
mutually.
As an additional key aspect the differences between
each specification is note worthy. Generally, it
appears that the JCR specification is pragmatic in
relation to the SQL specification. The features
provided by JCR give practical answers to common
and recurrent problems.
Providing a standard way to solve running problems
in a natural and elegant manner is not obligatory but
by doing so this actually protects the application
programmer from conception failures. Failures which
could relate to the managing of versioning or access
control.
While relational databases implemented on the SQL
specifications have the potential to represent all
types of use cases which could appear in real life,
They are often badly constructed due to the
constraints which impact and govern a projects
evolution or lifecycle. This does not detract from the
fact that the relational model does contain a
complete set of main building blocks for a database.
At specification level, SQL makes extensive use of its
base components to express its various extensions.
Conclusions can be drawn from this, principally that a
specification’s foundation should be able to handle
and manage all kinds of use cases and secondly that
a specification should evolve and build onto its
foundation and not away from it.
University of Lausanne & Day Software AG JCR or RDBMS
31
5 Development process comparison
Another perspective is taken in this chapter to
compare relational databases and java content
repositories. The purpose is to show the key
differences between data models which impact the
application’s development process. These
differences cannot really be measured but are
significant enough to be mentioned.
Agile development processes such as ―Extreme
Programming‖, ―Rational Unified Process‖ or ―Open
Up‖ divide project life cycles into steps such as
inception, elaboration, construction and transition.
These phases can be interactively executed. The
process depicted in Figure 4.11-1: Agile and iterative
development process summarizes a possible
segmentation of the time taken for the Open Up
development process. The following sections will
make reference to these steps. The purpose is to
show where and how both models, the JCR one and
the relational one, can respectively impact this
process.
5.1 Data Understandability
Making architectural and implementation models
understandable is one of the key aspects of the
elaboration phase. Clear architecture which can
easily be communicated allows people to enter more
quickly into the project. It is also easier to define
tasks and duties if the architecture is clear and made
of separate modules.
Generally the architecture is defined or refined by an
architect or an analyst during the elaboration stage.
This actor takes the requirement identified during the
inception phase as input and delivers blueprints
which explain the behavior of the system at different
levels. At an application level, these blueprints
generally include use case diagrams, collaboration
diagrams or class diagrams. To show how the
application’s data persists, these schemas are often
Figure 4.11-1: Agile and iterative development process
32 /Development process comparison
translated into database schemas which take the
properties of the data model into account.
JCR development
As mentioned, the structure and the content are
indivisible in JCR. However it is possible to define a
semantic which shows how data and structure will
be instantiated. In this semantic, some aspects of
the content can be omitted.
For example, if a semantic item has an unstructured
basis, all the possible and imaginable properties can
be saved under it. Thus, there is no need to mention
them if they are not mandatory or don’t have to
respect specific constraints. It is enough to declare
them in the application’s schemas as made in a
class diagram. Thus, the semantic diagram of a java
content repository says less than the other
architectural diagrams. This impacts its readability.
In fact, reading the semantic of a repository gives a
snapshot of the final application and helps to
understand its general behavior.
Figure 5.1-1: JCR translation
Another interesting aspect is that the complexity of
the JCR semantic is not decupled by many-to-many
relationships. No intermediary nodes or artifacts are
needed to represent these associations. Thus, these
diagrams are very much closed from the other
architectural schema. No translation rules are
needed to create them.
Relational development
Class diagrams can be used as input to generate
relational schemas. Entity-relationship diagrams (15)
or Crow's Foot diagrams are often used to represent
them. Translation rules are generally needed to
produce these schemas. Far from summarizing the
architecture, they enumerate to a high degree all the
aspects of the final application.
Figure 5.1-2: SQL translation
Everything has to be explicitly mentioned in these
database schemas. Only the records which respect
the data structure can be instantiated in a relational
database. Thus, it is necessary to carefully define
this structure and make it fit in perfectly with the
application architecture.
Many-to-many associations cannot be represented
in relational database schemas without reification.
This means that many-to-many associations will
always require intermediary entities. Consequently,
the internal complexity of a relational schema
increases faster than the complexity of the other
architectural diagrams. Thus, they don’t really help
to understand the application. They are more often
used as implementation’s blueprints.
5.2 Coding Efficiency
The construction phase of a development process is
highly influenced by efficiency. Coding requires time,
resources and money. These parameters are very
sensitive. Furthermore, if developers have to write
code twice, there is a high probability that they will
make more than double the programming errors.
Thus, efficiency also impacts quality.
University of Lausanne & Day Software AG JCR or RDBMS
33
Measuring coding efficiency implies some soft
parameters. The programmer’s education and
knowledge should be taken into account.
Furthermore, the semantic and the readability of the
code are also significant. These parameters make it
difficult to judge the technology’s efficiency. Without
going too deep into these questions, the following
sections contain useful information which can be
taken into consideration when making a decision in
this area.
JCR development
Programmers are not really familiar with the JCR
API and don’t really know the best practice linked to
content repositories. However, the API is in large
part self-explanatory and people generally have the
habit of thinking in terms of hierarchies. These
parameters should give to JCR a good learning
curve.
Some interactions are possible between the query
part of the API and it’s navigational part. One of the
big advantages of JCR is stated in the fact that
these aspects are merged coherently and are not
considered as different abstraction levels.
The code quantity highly relates to the use case. If
complex joining operations are mainly required, JCR
will not be an efficient choice. However, if navigation
is required, the size of the code will be much
smaller. If special requirements such as versioning
or fine grained access control are needed, it
becomes clearly difficult to reach the same level as
the one proposed by JCR.
Relational development
Nearly all programmers are familiar with the
relational model and people have often used it in
recent years. Thus, SQL and API as JDBC are part
of the common language. In real world situations,
this general knowledge often favors the relational
model. Some problems need to be treated in a
specific manner and the intuitive approach often
gives bad results.
If complex operations are required by the use case,
the relational model should not be bypassed. The
completeness of the queries and the panel of
operations made it very efficient in term of code
quantity. However, if the use case implies
requirements such as navigation or versioning, the
developer will have to add some artifacts into his
implementation model to manage parameters such
as tree structure or order. He will also face the
problem of having to implement huge applicative
logic. Thus, in terms of efficiency, the model’s choice
should be driven by an honest analysis of the use
case’s properties.
5.3 Application Changeability
Requirements which appear during the development
process are often difficult to include in previously
defined architecture. Modern software development
processes generally address this problem with
iteration cycles (16). Well managed, iterations
should allow to include efficiently new requirements.
However, because each logic level is generally
impacted by architectural changes made during the
elaboration phase, last iterations are more
expensive than early iterations.
Decoupling clearly logic levels can reduce this
increasing cost. Thus, data models which can
transparently accept changes are really appreciated.
To make this point, we will consider how simple
changes are impacting the data logic of a system.
JCR development
As mentioned in the ―Schema understandability‖
section, repository’s schemas summarize the other
architectural diagrams. While this could appear
meaningless, it is really not the case. Keeping the
repository as weak as possible allows and includes
new requirements without touching the data logic
level. Only the application logic level is impacted.
Thus, adding a property at an application level
doesn’t necessarily require or touch the repository’s
organization.
To be sure, deep changes impact data logic and
JCR, and they do not provide a magic solution
either. The JCR allows for a decoupling of most of
the data logic from the application and the interface
levels. It is also interesting to note that frameworks
like Sling allow decoupling in a similar manner to the
application logic from the interface logic. This
34 /Development process comparison
approach is clearly an attractive one, especially in
environments driven by changes and agility.
Relational development
Nearly each modification made on the overall
architecture will impact the data logic level. This
comes from the fact that relational databases do not
allow for instantiate elements which have not been
previously defined in the structure. Thus, there is a
great probability that a change made in a formulary
of the interface or in the application logic will require
perform changes on the data model logic.
Some frameworks provide tools to automate these
changes. However, if the system has a production
version, once executed the change will have a big
foot print on all the database’s items. Furthermore,
classical model-view-controller frameworks are not
really decoupling the applications level from the
interface. For example, a change made on a
controller will often impact on views and models.
5.4 Synthesis
At a project level, people are often looking for
solutions which will allow for the quick integration of
changes into their environment. In situations where
some changes have to be performed the semi-
structured nature of JCR will certainly be
appreciated. Furthermore, the inclusion of features
such as navigation, versioning or access control can
gain us a lot of time.
Nevertheless, it is important to keep in mind that the
efficiency of both solutions relates in a large way to
the nature of the use case. The agility of JCR should
not influence this aspect. Furthermore, the agility is
inked in no small way to the project team. Thus,
saying that JCR is a way to achieve agility is a too
big a shortcut.
In all cases, the choice of a database technology
should always be discussed during the inception and
elaboration phases of the first iteration of the
development process. This can be done by leveling
the different parameters. Changing the persistence
technology cannot easily be achieved after the first
iteration. Consequently, this choice will have a
strong impact for the rest of the project.
University of Lausanne & Day Software AG JCR or RDBMS
35
6 Product comparison
Choosing between database products implies that
we use different criteria. We can mention the
compliance with a standard, the additional features
proposed by the provider, the support offered by a
company or by a community or the scalability of the
solution. All these criteria have an importance. They
should be weighed carefully and a choice made
depending to the situation.
In our context, basic and significant differences
distinguish java content repositories from relational
databases. Thus, a decision to employ one
technology instead of another should be taken at a
lower level. However, in relation to the product,
people often ask in terms of performance, if they
should use a relational database or a java content
repository to manage their hierarchical information.
This section will try to address, and answer this issue
by reminding us of some basic theoretical concepts
which relate to data structures and to the cost of
associations. Then, at a more practical level, a
benchmark of several database products will verify if
these assumptions can be proved.
6.1 Theoretical analysis
In general, database products use basic data
structures to manage their data. This section reminds
us of simple concepts which relate to these
structures and to the cost of associations made
between data items. The goal is to determine if the
product’s performances will be significantly impacted
by the subtended approach.
Hierarchical and network database
In the hierarchical and network models, associations
are made by storing references or pointers between
items. The advantage of this kind of structure is that,
because each node stores direct references with
other nodes, a constant number of read accesses
are needed to go from one node to its target.
Creating an association between two nodes also has
a constant cost because the number of operations
needed to perform this is always the same.
Thus, the cost of crossing and creating associations
is constant and could be noted as O(1) in big O
notation. Some people say that these associations
are pre-computed.
Some strategies allow the representation of directed
graphs such as those needed by the hierarchical and
the network models. The most classical
representations of this are adjacency lists and
adjacency matrixes (17). Generally, the choice
between one approach instead of another is made
simply by analyzing the density of the graph.
If the graph has a number of arcs which are close to
the square of the number of edges, selecting an
adjacency matrix will show a better result. However,
the JCR model is mainly driven by hierarchical
associations. In this context, the number of arcs will
not be a lot taller than the number of edges. Thus, an
adjacency list will show more respect for the memory
usage by requiring only the space needed to store
the associations. It is also interesting to note that this
kind of organization allows, with a certain amount of
ease, the giving of an order to the children of a node.
36 Product comparison
Figure 6.1-1: A hierarchy and its adjacency matrix
Implementing this with a programming language can
be accomplished by using several data structures
such as arrays, maps or hash-tables. Some other
solutions could also be presented but the main idea
is that crossing an association has a constant cost
and that crossing a graph has a cost which is
proportional to the number of arcs and edges
traversed. Thus, managing this kind of data is cost
effective.
Relational database
In the relational model, associations are made
between relations by computing the matching values
stored in two domains. This allows for the expression
of all imaginable associations between two or more
data sets.
What is the cost implication of computing and
creating associations in a relational database? To
compute an association, a relational database has to
cross the targeted set to find the matching values. In
this case, the cost of the association equals O(n),
with n the number of tuples stored in the source and
in the target. However, most database products
provide indexation facilities such as b-tree indexes.
So, in most cases, finding the matching entries has a
cost of O(log(n)). While b-tree indexes are good,
some articles (18) argue that in the network models,
because associations are pre-computed, it is
possible to reach better performance.
However, in most cases there is no need to use other
comparison operators other than ―= ― or ―≠‖ to
express relationships as these are presented in a
hierarchical or network model. Consequently, hash
indexes can be used on the domains which
constitute the association. If the relational database
provides good hash indexes’ implementations, the
cost of retrieving data through associations will be
close to O(1). It also results in a constant cost of O(1)
when new items are added to the targeted sets and
in the index. Thus, there are virtually no significant
differences between the associations of the relational
model and of the hierarchical model.
6.2 Benchmark
The previous section has summarized very succinctly
and too quickly a huge problem. However, the main
point to keep in mind is that intolerable differences
should not appear if hierarchical data is managed
with a content repository or a relational database.
The following benchmark has been done to verify this
assumption.
Four products are included in this benchmark. CRX
is a native implementation of the JCR specification.
The persistence of the items is managed with a
proprietary technology which is based on the tar file
compression (19) and implemented with java. H2
and Derby are two open source relational databases
written in java. MySQL is one of the most widely
used open source databases.
A simple wrapper has been defined for this
benchmark. This wrapper proposes basic functions
to create trees made of nodes and properties. The
CRX wrapper uses directly the functionalities
provided by the API. The SQL wrapper uses a simple
database schema. One table stores the nodes and
the other table stores the properties. The
associations between items are managed with a
parent foreign key and the default indexes of the
University of Lausanne & Day Software AG JCR or RDBMS
37
database are used on all fields. JDBC allows
performing queries and prepared statements to avoid
parsing the SQL statements each time.
The benchmark is composed of four parts which all
measure the time required to perform an operation in
hierarchies of different sizes. Each node of these
base hierarchies has 5 sub-nodes and 5 properties
except leaves which only have 5 properties. The first
hierarchy has one level. The following ones always
include one more level. The tests have been
launched 5 times on a Dell Latitude D820 installed
with windows XP (processor: Intel Core Duo 2.00
GHz, virtual memory: 2.00GB). The average result is
used in the following diagrams.
Writing the hierarchy This test measures the time required to create the base hierarchy. The throughputs correspond to the time needed to write one item of the hierarchy. While the differences seem huge, all the throughputs are constant. The assumption that native implementations of JCR and relational databases should be equivalent in term of performance is true in this case. MySQL cannot be embedded in the application. This has a high impact on the result. H2 does not appear in the chart because its performance for write accesses is too good.
Reading the hierarchy This test consists to read once all the items of the base hierarchy from the root to the leaves. The throughputs displayed in the chart correspond to the average time needed to read one item of the hierarchy. For most databases the results seam to be constant. Derby is just out of range. When recursive queries are performed on this database, the results are not tolerable.
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
36 186 936 4686 23436
Mill
ise
con
ds
Items
crx h2 mysql derby
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
36 186 936 4686 23436
Mill
ise
con
ds
Items
crx h2 mysql derby
38 Product comparison
Randomly writing the hierarchy
The test consists of randomly writing 100 sub-hierarchies in the base hierarchy. Each sub hierarchy has a depth of 2 levels. Each level has two sub nodes and two properties. Thus, each sub hierarchy is composed of 21 items. The throughputs relate to the average time required to create all the items of one sub-hierarchy. The results of the first test are quite similar to this one. The good point is that all the databases have constant results.
Randomly reading the hierarchy
The test consists of randomly reading 100 nodes and their descendants on two levels in the base hierarchy. The throughput relates to the average time required to read one node and its descendant. As in the second test, Derby is just out of range. The same problem is encountered with recursive queries. It appears that CRX is well optimized for these situations. To be really pertinent this test should be launched on bigger hierarchies. However, the difference between the results is constant and relational databases are not showing extremely bad performances for recursive queries.
6.3 Synthesis
As shown in this chapter, performance should not be
used as the main argument to choose one technology
over another. The aspects mentioned in the previous
chapters are more important. The choice should relate
to the nature of the problem which has to be solved
and not to the nature of the product.
The assumption that relational databases are able to
effectively manage hierarchical data is true. However,
this does not mean that java content repositories
should be implemented as a layer over relational
databases. Some base concepts of both
specifications are in a mismatch and make a relational
schema for JCR, which include all the aspects of the
specification, will look unsuitable. More modularity (3)
in the database world could benefit from both
approaches. While this goal is not achieved, native’s
implementation of JCR is probably the better of the
proposed solutions.
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
36 186 936 4686 23436M
illis
eco
nd
sItems
crx h2 mysql derby
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
36 186 936 4686 23436
Mill
ise
con
ds
Items
crx h2 mysql derby
University of Lausanne & Day Software AG JCR or RDBMS
39
7 Scenario Analysis
The following diagram synthesizes the main aspects pointed out during the whole comparison process. Four use
cases characterized by different features will be shortly analyzed in regard to their respective requirements and
to the presented approaches.
JCR RDBMS
Data Model Level
Structure Unstructured Semi structured Structured
Structured
Integrity Entity integrity Domain integrity Referential integrity Transitive integrity in hierarchies
Entity integrity Domain integrity Referential integrity Tools to manage data coherency
Operations and Queries Selection Equi-join operations Full text search operation Transitive queries on hierarchies
Selection Projection Rename Join operations Domain operation Create, read, update, delete statements
Navigation Navigation API Traversal access Direct access Write access
Not supported
Specification Level
Inheritance Node types inheritance Node inheritance
Not supported
Access control Record level
Table and Column level Record level not supported
Observation Record level Un-persisted event listeners Application interaction supported
Table level Persisted triggers Application interaction not supported
Version control Supported
Not supported
Project Level
Schema understandability DataGuides or Graphs Summarize the architecture Not impacted by many-to-many associations
Entity Relationship Represent the whole architecture Impacted by many-to-many associations
Code complexity Simple for Navigation Complex for Operations
Complex for Navigation Simple for Operations
Changeability More agile Decoupled from the application
More rigid Coupled with the application
40 Scenario Analysis
7.1 Survey
An agency wants to implement an application which
is able to carry out surveys over the web. This tool
should be able to allow for the collection of data from
questionnaires, to configure the type of answers,
and to aggregate the survey’s results in a suitable
form.
Main characteristics of the application:
All the entities can easily be identified at the
design time. (Structure)
Some verification has to be made on the
data. (Integrity)
The results aggregation implies complex
operations. (Operations and Queries)
Once in production the application will not
evolve to a great degree. (Project)
The choice of a relational database for this kind of
scenario is probably the best alternative. The
features provided by a content repository will not
really be used. Furthermore, programming
operations will only add complexity in the
application.
7.2 Reservation
An event organizer wants a portal which gives the
opportunity to buy tickets for events. The event
organizer should be able to create the events
characterized by a name and a short description.
The customer should be able to browse and search
the event’s catalogue and to order tickets. On the
other hand, the event organizer wants to monitor his
sales and manage his prices depending to the
success of the event.
Main characteristics of the application:
All the entities can easily be identified at
design time. (Structure)
Some verification has to be made on the
data. (Integrity)
Monitoring the sales can imply some
operations on the dataset. (Operations and
Queries)
Browsing and searching the catalogue
require traversal and direct access.
(Navigation)
As a strategic application, the application is
subject to improvements. (Project)
This application has strong needs which relates to
the relational database world. The clear structure
linked to the management of orders and events
could lead us to conclude that a relational database
is the ideal candidate. However the need of
navigation and the potential extensions linked to the
catalogue could benefit from the features of a
content repository.
A balanced approach could consist of storing the
orders in a relational database and using a content
repository for the events catalogue. This also fits in
particularly well with the fact that the catalogue will
mainly be subject to read access and the ticketing
service to write access. This should not be a
problem because complex interactions between the
JCR and the RDBMS can be managed with the Java
Transaction API. Making hybrid decisions can in
certain contexts allow us to benefit from both
applications, thus having the best of both worlds.
University of Lausanne & Day Software AG JCR or RDBMS
41
7.3 Content management
A publisher wants an application to be able to
manage all the content generated by its
collaborators. The content will be composed of
videos, photos, text or anything else. Several
taxonomies should be available to organize the
content. The main purpose of the publisher is to
offer a coherent set of features which allow for the
easy retrieving of resources for each type and to
enable the reuse of them in different contexts or in
other publications.
Main characteristics of the application:
The editor wants to take into consideration
that new entities of content could appear.
(Structure)
The main verifications regarding data
concerns virus. (Integrity)
Searching requires full text indexation.
(Operations and Queries)
Taxonomies imply simple operations.
(Operations and Queries)
Exploration is needed everywhere.
(Navigation)
Future improvements could imply
versioning, observation and access control.
(Specification features)
The system will continuously evolve with the
enterprise. (Project)
The flexibility and the features provided by JCR are
typically made for these types of scenarios. Content
as understood here is difficult to store in a relational
database. Furthermore, all the complex
requirements such as versioning or access control
can be included during the application life cycle
without too much of a problem.
7.4 Workflow
An editor wants to manage the interactions of his
collaborators. The situation could be the following:
The editor in chief and the board decide which
subjects have to be treated in the next edition of a
publication. These subjects are communicated to the
workforce (journalists and photographers). Once
edited, the articles are sent for proofreading. Once
corrected, the editor in chief is notified. He decides if
the article can be published or not. If the article will
appear in the publication, it is sent to a typography
service which produces a model which includes
pictures. Once the publication integrates all the
articles and all the pictures, the editor in chief will
read it once again and take the decision to publish it
or not.
Main characteristics of the application:
The entities are composite and difficult to
design. (Structure)
The structure mainly involves graphs.
(Structure)
Editing and exploring the process implies
traversal access. (Navigation)
Notifications imply to observe local events.
(Observation)
Notifications imply interactions between the
data model and the application.
(Observation)
This kind of scenario involves semi-structured
models in conjunction with good observation
capabilities. While the other features proposed by
JCR such as versioning or access control do not
directly find an application, the foundations of the
model will really be appreciated in this case. The
workflow structure can be directly designed with
nodes and items and once instantiated the workflow
will clearly benefit from the observation mechanisms
proposed by JCR.
42 Conclusion
8 Conclusion
The choice of a data model or of a database is often
arbitrary. Sometimes, specific technologies are
imposed by an enterprise policy or simply by
irrational preferences. When the time comes to
choose a technology, the good arguments are not
often put forward. Furthermore, the myth of a general
multi-purpose database is still ingrained in some
minds and people are always looking for a magic
solution which can be used in all imaginable
circumstances.
Today, the cohabitation of several infrastructure
components can be achieved with minimal effort. A
platform such as J2EE provides tools to manage
distributed resources. In this context, the choice of a
data model or of a database should not be reduced
to an arbitrary decision.
As shown in the ―Scenario Analysis‖ chapter, a
pragmatic analysis gives quick results. The
technology which fits in best with the requirements
can be identified and used to the greatest effect. In
some cases, hybrid strategies can also be adopted.
A coherent choice can lead to significant advantages
and this question should always be discussed during
the early phases of each project.
Relational databases have been successfully used
for several years. However, the growing power of the
user and the rigidity of the relational approach make
it difficult to implement features which are actually
required by some applications. It’s possible to push
the boundaries of the model but the constraints of
time and money make it difficult to do so correctly.
Some frameworks are partially effective in solving
these problems. Depending on a middleware layer
for features such as access control, navigation, or
versioning only push the hot potato at a higher level.
This does not really solve the problem but adds
complexity to the overall environment.
Java content repositories cannot replace relational
databases in every situation. Actually, the features
proposed by the API fit very well with all the
requirements encountered in content management
and collaborative applications.
Nevertheless, JCR enriches the debate around
databases and data models in relation to two
important aspects. Primarily JCR includes some
features at a data model and specification level.
Secondly the specification is aware of its
environment and takes into account that java content
repositories can be used in conjunction with other
infrastructure components. This is not the case for a
specification such as SQL.
This tendency seams relatively new but will probably
be consolidated during the next few years. With a
position of precursor, Day can play an important role
in this debate and will gain in notoriety. Some
challenges will arise with the growing popularity of
the JCR specification. Selecting good opportunities
should allow for the database field to make its mark.
This in its turn will create a footprint that will overflow
into the world of infrastructure components.
University of Lausanne & Day Software AG JCR or RDBMS
43
9 Appendix – JCR and design
As mentioned in the ―data model comparison‖
chapter, a Java Content Repository schema is
dynamic and evolves with the content. The structure
appears when nodes and properties are instantiated.
However, during the development process the need
to establish a semantic for the repository appears.
Several publications which treat semi structured
approaches propose solutions in how to represent
these schemas (20) (21). These representations are
called DataGuide (DG) or Approximate DataGuide
(ADG). The lesser elaborate version can capture
visually the organization of semi-structured
databases. The JCR specification (4) use graphs to
represent the example of the structure which can be
found in the content repository.
DataGuides and other graphs notations fit
particularly well with Java Content Repositories but
are not expressive enough to be used as
implementation blueprints. The goal of this appendix
is to summarize the possibilities offered by JCR to
organize content and to enrich the notation
proposed in the specification which needs to
communicate the whole semantic of a repository.
9.1 Model
The most common relationship provided by the
model is the composition. Semantic items can be
instantiated as node and properties. A node can be
composed of sub-nodes and properties. A property
can only be composed of values. Except for the root
node, all other nodes and properties are
components.
Some as seen allow for the creation of horizontal
relationships between the branches of a hierarchy. A
common relationship is achieved by storing one or
more paths values in a node property. This method
has an advantage because the hierarchical property
of the target can be used in queries. Another
relationship consists to store one or more UUID
values in a node property. The maintains the validity
of the link even if the target is moved. Any one of
these approaches could be used and be appropriate
depending on the context.
9.2 Convention
Semantic items which will be instantiated as node or
properties are respectively represented by circles
and boxes. The circle’s label refers to the node-type,
the box label to the property-type. Without a label,
the circle or the box means that the node can be
found. An empty circle means that everything which
is not mention is allowed under the semantic item
(black list). A barred circle means that everything
which is not mention is not allowed (white list). An
empty box means that the property is simple. A box
which contains a ―M‖ means that the property can
store multiple values.
The composition of associations is represented by
filled arrows which link two semantic items. The
arrow’s label refers to the relative path which links
the two semantic items. Only descendant relative
paths are allowed. Stars (*) and variables
(<variable>) can be used to express pattern in the
path. Without a label, the arrow means that a
semantic item, as the one targeted, can be found
everywhere under the source. The arrow can end
with a cardinality (1..N). Without cardinality the
meaning is N.
Horizontal associations are represented by dotted
arrows. They always start from a box and finish on a
circle. No labels are put on these arrows. They are
44 Appendix – JCR and design
only used to give implementation information. The
arrow can end with a cardinality (1..N). Without
cardinality the meaning is N.
Inheritance associations between semantic items
can be represented by empty arrows. They should
always go from the bottom to the top. No labels are
put on these arrows. The elements which are
represented with a bold style are mandatory. If
specific constraints have to be declared they can be
shown as comments in the diagram.
9.3 Methodology
Designing a JCR semantic can be made with different approaches. If a development process is used, the
semantic will be obtained by translating the applications diagrams. The approach proposed here consists of six
steps which can be iteratively be executed and which result in a semantic blueprint which can be implemented in
a repository.
Input Output Activity
Step 1 Identifying the semantic items
Existing semantic Requirement
Semantic items Identifying the concepts which relate to the requirement and which have to be localized in the repository.
Step 2 Identifying the inheritance relationships
Existing semantic Requirement Semantic items
Inheritance semantic Identifying inheritance relationships between the semantic items.
Step 3 Identifying the hierarchical relationships
Existing semantic Requirement Semantic items
Hierarchical semantic Identifying hierarchical relationships between the semantic items. Thinking in term of composition.
Step 4 Identifying the horizontal relationships
Existing semantic Requirement Semantic items
Horizontal semantic Identifying horizontal relationships between semantic items. Identifying relationship’s types. Thinking in term of association or aggregation.
Step 5 Defining cool structure artifacts
Existing semantic Requirement Hierarchical semantic Horizontal semantic
Organizational semantic Identifying the patterns which link hierarchical semantic items.
Step 6 Carefully defining the integrity rules
Existing semantic Requirement Semantic items Inheritance semantic Hierarchical semantic Horizontal semantic Organizational semantic
New semantic Only if necessary, declaring in the semantic the level of coherence which has to be preserved at a repository level.
University of Lausanne & Day Software AG JCR or RDBMS
45
9.4 Application
Based on a very simple use case, this section shows
how the methodology and the notation previously
defined can be applied. The purpose is to deliver a
blueprint which shows how data is organized and all
data aspects required to build the application.
The specifications of the case are as follows: A blog
application deals with posts. A post always stores its
creation date and should contain some information
such as text, images, etc. A post can belongs to zero
or one category and can have zero to an infinite
number of tags. A category can have subcategories.
From any category it should be possible to find all the
posts which relates to it and to its subcategories.
When a category is deleted, the related posts are not
deleted. Anonymous readers can respond to posts
with comments. For navigation, it may be useful to
organize posts by years, months and dates.
46 Appendix – JCR and design
Output Comments
Step 1 Identifying the semantic items
Properties do not have to be localized in the repository.
Step 2 Identifying the inheritance relationships
The requirement does not contain inheritance but we could imagine this kind of relations.
Step 3 Identifying the hierarchical relationships
Post and categories are not linked with a composition relationship.
Step 4 Identifying the horizontal relationships
To satisfy the requirement, posts are linked to categories with path values and with UUID values to tags.
Step 5 Defining cool structure artifacts
The year, month, year pattern is part of the hierarchical association.
Step 6 Carefully defining the integrity rules
In our case we only have to ensure that a post always has a creation date.
University of Lausanne & Day Software AG JCR or RDBMS
47
10 Appendix – Going further
Only a few subjects have been mentioned in this
report. This appendix presents three fields which
relates to JCR and to databases in general. These
fields could benefit from being studied in more depth.
Furthermore, some existing products could be
improved if these questions were addressed.
10.1 Queries in semi-structured models
In the JCR Model, the notions of sets, relations and
domains, which provide the means of expressing first
order logic statements over the model, are present
but currently not formally defined. It seems that at the
present, node-types are seen as relations, properties
as domain, nodes as tuples and properties’ values as
attributes.
The fact that these notions are well defined in
relational databases procures advantages. For
example, on this basis, some databases engines are
able to analyze queries and to optimize them in
regard to the structure. In semi-structured databases,
query optimization is a known issue and research is
still being conducted in this area (20).
It is currently not clear if mapping as proposed by
JCR could ensure more efficiency when queries are
performed. Greater work on this question and further
improvements of the JCR’s query model could be a
very interesting field of investigation.
10.2 Queries on transitive
relationships
The model proposed by JCR stores the hierarchical
paths of each node. This allows the performing
queries on transitive relationships in hierarchies by
using the path property. Assuming a tree structure
limits the whole number of paths to the number of
leafs.
Doing the same for horizontal relationships is a bit
more problematic. To summarize, in a network
structure, pre-computing all the paths is not
proportional to the number of leafs but to the square
of the number of nodes (11). The storage capacity
required to store the transitive paths between the
nodes also grows in a similar manner.
Some use cases such as those which involve social
networks need to store these kind of relationships.
Defining a standardized way to manage this could be
very useful in some situations. However, it demands
that some research be made on finding the best
algorithms and solutions which relate to this problem.
Furthermore, query languages based on first order
logic are limited when having to define queries on
transitive closures and transitive relationships in
general. It is in this measure and area that
improvements still have to be accomplished.
10.3 Modular and configurable
databases
As shown in the ―product comparison‖ chapter, the
relational model is able to manage efficiently
hierarchical relationships. Therefore is it really
necessary or intelligent to implement, from the
ground up, a data model which can be constructed
from another, with approximately the same results?
Some reasons could lead to this conclusion.
However, the base differences between JCR and
SQL cannot be omitted. For example, does it make
sense to create a procedural API over a declarative
48 Appendix – Going further
query language which will be retranslated in
declarative calls in the database? While the cost
relating to the parsing of a query is insignificant, it is
also a good reason indicating that it is preferable not
to proceed in this manner.
In reality databases are presently used with many
different purposes in many different contexts. A few
applications are embedding databases to manage
small data sets in single client applications while
others are dealing with thousands of connections
and scalability problems. In this context, a
multipurpose monolithic database is unimaginable
even mythological. Margo Seltzer promotes a more
modular and configurable approach to build
databases (3). These recommendations lead
developers into using database components at
different level depending on their needs.
JCR and SQL are two high level backend solutions
which have possibilities but also limits. Their
significant differences do not mean they do not have
common denominators. More modularity in their
architecture could give a better understanding of
their behavior. This could also allow them to share
components and to be adapted more easily to
specific requirements and contexts.
University of Lausanne & Day Software AG JCR or RDBMS
49
11 Bibliography
1. Tsichritzist, D. C. and Lochovsky, H. Hierarchical
Data-Base Management: A Survey. New York, New York :
ACM, 1976.
2. CODD, E. F. A Relational Model of Data for Large
Shared Data Banks. San Jose, California : ACM, 1970.
3. Sestzer, Margo. Beyond Relational Databases. ACM
Queue. New York, New York : s.n., 2005.
4. Nuescheler, David and Piegaze, Peeter. Content
Repository API for Java™ Technology Specification. s.l. :
Java Community Process, 11 May 2005. version 1.0.
5. —. Content Repository API for Java™ Technology
Specification. s.l. : Java Community Process, 2 July 2007.
version 2.0 Public Review.
6. Mazzocchi, Stefano. Data First vs. Structure First.
Stefano’s Linotype. [Online] July 28, 2005.
http://www.betaversion.org/~stefano/linotype/news/93/.
7. Chaudhuri, Surajit. An Overview of Query Optimization
in Relational Systems. Redmond, Washington : ACM,
1998.
8. Buneman, Peter. Semistructured Data. Tucson,
Arizona : ACM, 1997.
9. Aho, Alfred V. and Ullman, Jeffrey D. Universality of
data retrieval languages. San Antonio, Texas : ACM,
1979.
10. Database Language SQL. Information Technology.
[Online] July 30, 1992.
http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.t
xt.
11. Li, Zhe and Ross, Kenneth A. On the cost of
Transitive Closures in Relational Databases. New York,
New York : Columbia University Press, 1993.
12. Tropashko, Vadim. Trees in SQL: Nested Sets and
Materialized Path. DBAzine.com. [Online] April 13, 2005.
http://www.dbazine.com/oracle/or-articles/tropashko4.
13. Bachman, Charles W. The Programmer as Navigator.
Waltham, Massachusetts : ACM, 1973.
14. Distributed Transaction Processing:The XA
Specification. s.l. : The Open Group for distributed
transaction processing, 1991.
15. CHEN, PETER PIN-SHAN. The Entity-Relationship
Model-Toward a Unified View of Data. Cambridge,
Massachusetts : ACM, 1976.
16. Introduction to OpenUP. OpenUp. [Online] October 27,
2008. http://epf.eclipse.org/wikis/openup/.
17. Cormen, Thomas H., et al. Introduction to Algorithms,
Second Edition. Cambridge, Massachusetts : The MIT
Press, 2001.
18. Bates, Duncan. Embedded databases: Why not to
use the relational data model. Embedded Computing
Design. [Online] January 01, 2008. http://www.embedded-
computing.com/articles/id/?2569.
19. Müller, Thomas. CRX Tar PM. dev.day.com. [Online]
Day Software AG, November 11, 2008.
http://dev.day.com/microsling/content/blogs/main/tarpm.ht
ml.
20. Goldman, Roy and Widom, Jennifer. DataGuides:
Enabling Query Formulation and Optimization in
Semistructured Databases. Palo Alto, California : Stanford
University Press, 1997.
21. —. Approximate DataGuides. Palo Alto, California :
Standford University Press, 1999.
22. Nuescheler, David. David's Model: A guide for blissful
content modeling. Jackrabbit Wiki. [Online] August 22,
2007. http://wiki.apache.org/jackrabbit/DavidsModel.
23. Priti, Mishra and Margaret, Eich. Join Processing in
Relational Databases. Dallas, Texas : ACM, 1992.