jcr or rdbms - semantic scholar · jcr or rdbms why, when, how? bertil chapuis 12/31/2008 creative...

49
JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories (JCR) and relational database management systems (RDBMS). The choice between these technologies is often made arbitrarily. The aim is to clarify why this choice should be discussed, when one technology should be selected instead of an other and how the selected technology should be used. Four levels (Data model, Specification, Project, Product) are analyzed to show the impact of this choice on different scopes. Follow a discussion on the best choice depending on the context. This defines the foundations of a decision framework.

Upload: others

Post on 16-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

JCR or RDBMS why, when, how?

Bertil Chapuis

12/31/2008

Creative Commons Attribution 2.5 Switzerland License

This paper compares java content repositories (JCR) and relational database management systems (RDBMS).

The choice between these technologies is often made arbitrarily. The aim is to clarify why this choice should be

discussed, when one technology should be selected instead of an other and how the selected technology should

be used. Four levels (Data model, Specification, Project, Product) are analyzed to show the impact of this choice

on different scopes. Follow a discussion on the best choice depending on the context. This defines the

foundations of a decision framework.

Page 2: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

2 Table of content

Table of Contents

1 Introduction .................................................. 3

1.1 What is compared? ................................ 3

1.2 Why is it comparable? ........................... 3

1.3 What is the purpose of this comparison?3

1.4 How will it be compared? ....................... 4

2 State of the arts ............................................ 5

2.1 Roles ...................................................... 5

2.2 Domains of responsibility ....................... 5

2.3 Data Models ........................................... 6

3 Data model comparison .............................. 8

3.1 Model Definitions ................................... 8

3.2 Structure ................................................ 9

3.3 Integrity ................................................ 12

3.4 Operations and queries ....................... 14

3.5 Navigation ............................................ 16

3.6 Synthesis ............................................. 17

4 Specification comparison ......................... 18

4.1 Use Case Definition ............................. 18

4.2 Structure .............................................. 18

4.3 Integrity ................................................ 20

4.4 Operations and queries ....................... 22

4.5 Navigation ............................................ 24

4.6 Transactions ........................................ 25

4.7 Inheritance ........................................... 26

4.8 Access Control ..................................... 27

4.9 Events .................................................. 28

4.10 Version control ..................................... 29

4.11 Synthesis .............................................. 30

5 Development process comparison .......... 31

5.1 Data Understandability ......................... 31

5.2 Coding Efficiency ................................. 32

5.3 Application Changeability ..................... 33

5.4 Synthesis .............................................. 34

6 Product comparison .................................. 35

6.1 Theoretical analysis ............................. 35

6.2 Benchmark ........................................... 36

6.3 Synthesis .............................................. 38

7 Scenario Analysis ...................................... 39

7.1 Survey .................................................. 40

7.2 Reservation .......................................... 40

7.3 Content management ........................... 41

7.4 Workflow............................................... 41

8 Conclusion .................................................. 42

9 Appendix – JCR and design...................... 43

9.1 Model .................................................... 43

9.2 Convention ........................................... 43

9.3 Methodology ......................................... 44

9.4 Application ............................................ 45

10 Appendix – Going further .......................... 47

10.1 Queries in semi-structured models ...... 47

10.2 Queries on transitive relationships ....... 47

10.3 Modular and configurable databases ... 47

11 Bibliography ............................................... 49

Page 3: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

3

1 Introduction

Day Software AG (Day) led the development of a

JAVA specification which defines a uniform

application programming interface (API) to manage

content. This specification is called content repository

API for java (JCR) and is part of the java community

process. Implementations of this specification are

actually provided by well known companies such as

Oracle, Day or Alfresco.

JCR implementations are often used to build high

level content management systems and collaborative

applications. Day also provides an open source

implementation of the specification which is called

Jackrabbit and which is used as a shell for some of

its products.

This diploma thesis takes place in this context. Day

wants to clarify some points which relate to the data

model promoted by their specification. The basic idea

is to compare their approach to managing content

with the approach promoted by competitors at

different levels. The following sections will clarify the

approach adopted to do this and give an overview of

the content developed in this report.

1.1 What is compared?

As explained, the purpose is to locate JCR in the

database world. This work will be done by comparing

the relational model and the model promoted by

JCR. The relational model defined by Codd in the

70’s is actually the most widely used data model. The

unstructured or semi-structured model subtended by

the JCR specification encounter a growing success

in the content management area. These two models

will be described and analyzed in this report.

1.2 Why is it comparable?

Each data model supports a philosophy, to structure

and access data. On the one hand, the success of

the relational model comes in large part from the

facilities which are offered to describe clear data

structures. On the other hand, the success of the

JCR specification relates essentially to the facilities

which are offered to express flexible data structures.

These aspects show us that the discussion takes

place at the same level. Thus, it makes sense to

compare them, and to clarify their respective

possibilities and limits. It also makes sense to give a

clear picture of their respective philosophies which

are promoted and used by each of the models.

1.3 What is the purpose of this

comparison?

By making this comparison, Day wants to more

precisely position the data model, the specification

and the products which relate to JCR. Doing this

should help people to understand better the main

offers available on the market and show when it

make sense to use them.

More precisely, with an external perspective, the goal

is to define and give clear advice, which can help

people to choose the approach which will best fit in

with their needs. Some people are asking if their

applications should be implemented with a relational

database or a java content repository. Thus,

clarifying the philosophies promoted by each model

could help in making good decisions and

understanding the impact of a choice made at a data

model level.

Page 4: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

4 Introduction

With an internal perspective, some questions relate

on how a java content repository should be

implemented. Some companies are doing that over

relational databases and some others are providing

native implementations of the model. Should JCR be

seen as a data model or as an abstraction layer over

an existing data model. Answering this kind of

question can have a strong effect on the future

implementation of the products and also on the best

way promote them.

1.4 How will it be compared?

First of all, the chapter “State of the art” will try to

give a snapshot of the main data models which have

been described and used during the four last

decades. This will be done with the purpose of

identifying the main influences which have lead to the

current market environment. The goal is also to

understand why some data models have

encountered success and why others have not.

Then the comparison between the relational

approach and the JCR approach will start. Because

the two approaches show big differences on four

different levels, these are the ones we will examine

and compare, thus avoiding unnecessary discussion

regarding incomparable aspects.

The chapter “data model comparison” will be the first

level of comparison. In this chapter, the two models

will be formally defined, respectively; the relational

model and the model used by JCR. This should help

the reader to understand the theoretical concepts

hidden by each model. The purpose of this chapter is

also to show the impact of these theoretical aspects

on real world problems and help people to

understand more clearly why they should use one

approach instead the other to solve their problem.

The chapter “specification comparison” will be the

second level of comparison. This chapter will leave

the theoretical point of view for a more practical

perspective. The SQL standard and the JCR

specification will be compared more precisely in this

chapter. This will allow us to show practically in

which context the concepts described in the ―data

model comparison‖ make sense. Some differences

which relate more to the specification definition will

also be pointed out.

The chapter ―Project process comparison‖ will be the

highest level undertaken in this report. On the basis

of the previous chapters, a discussion will take place

on different aspects and notable advantages which

can significantly influence the development process

will be looked into. This discussion will try to clarify

parameters as the efficiency reached with each

approach.

The chapter ―product comparison‖ will discuss the

impact of data models on the products. The

performance question constantly occurs at a product

level. This chapter will try to address this question

with a theoretical cost analysis and a practical

benchmark.

The ―Scenario analysis‖ chapter can be seen as a

synthesis of the main aspects pointed during the

whole comparison process. Four test cases

characterized by different features will be analyzed in

regard of the significant aspects presented in this

report. The purpose is to set the foundations of a

framework which helps in choosing the best

approach by doing quick requirement analysis.

Appendices are also included in this document. They

contain aspects which are not directly linked to the

comparison but which are interesting for the person

who would like to study the subject further.

Page 5: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

5

2 State of the arts

The necessity of splitting information from

applications became clear in the 60’s when many

applications had to access the same set of

information. This segregation has given birth to new

concepts and new roles which relate to the activity of

managing information. This chapter will clarify the

main roles and the main domains of responsibility

linked to information management. Some of the

main approaches which are used to handle

information will also be presented. Basically, the

idea is to build a common language for the following

chapters.

2.1 Roles

People are generally involved in information systems

and data management. Three main roles can almost

always be distinguished when data models and

databases are mentioned:

The database administrator (DBA) who

maintains the database in a usable state.

The application programmer who writes

applications which may access databases.

The user who uses applications to access,

edit, and write data in the database.

Each role generally relates to certain responsibilities.

Several domains of responsibilities come from

disciplines such as the design, the development or

the security. Domain examples could be the

structure, the integrity, the availability or the

confidentiality of data. Choosing a data model

impacts these different roles by attributing them

more or less responsibility.

2.2 Domains of responsibility

The Figure 2.2-1 shows four main domains of

responsibility which will be mentioned regularly in

this report. This role/responsibility diagram tries to

translate the classical repartition which is generally

made when relational databases or similar

approaches are used to manage data.

Figure 2.2-1: Classical responsibility repartition

The WordNet semantic lexicon gives the following

definitions to the concepts identified as domain of

responsibilities in Figure 2.2-1:

Content: everything that is included in a

collection and that is held or included in

something

Structure: the manner of construction of

something and the arrangement of its parts

Integrity: an undivided or unbroken

completeness

Coherence: logical and orderly and

consistent relation of parts

Content and structure are relatively clear concepts.

However, in the context of this report, it makes

sense to be precise as to which meaning is given to

the integrity and to the coherence. Integrity here

relates to the ―state of completeness‖ of data which

always has to be ensured in the database. This state

is preserved with integrity rules at a database level.

Coherence relates to the logical organization of data

and quality thereof. Coherence can be ensured with

Page 6: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

6 State of the arts

constraints at a database level but also

programmatically at an application level. For several

reasons incoherence can be tolerated during a time

in the database. This is not the case for integrity.

Choosing a data model has an impact on the

responsibility repartition in different ways. This report

will try to detail this impact and show the

consequences of these kinds of choices on the

different roles.

2.3 Data Models

A data model should be seen as a way to logically

organize, link and access content. Since the 1960’s,

some data models have appeared and disappeared

for several reasons; this section will give us a brief

overview of the history of the main data models. It

will also give an overview of their respective reasons

for success.

Hierarchical Model

In a hierarchical model (1), data is organized in tree

structures. Each record has one and only one parent

and can have zero or more children. A pure

hierarchical model allows only this kind of attribute

relationship. If an entry makes an appearance in

several parts of the tree, this latter is simply

replicated. A directed graph without cycles as

depicted in Figure 2.3-1 gives probably the best

representation of how entries are organized in this

model.

Figure 2.3-1: Tree graph

In general two types of entries are distinguished, the

root record type and the dependent record type.

The first type characterizes a record from the level

zero of the hierarchy which has no parental

relationship. The second type characterizes all the

records which are located under the root record.

They are dependants in the sense that their lifetime

will never be longer than the lifetime of their parent.

In the hierarchical model, each record can generally

store an arbitrary number of fields which allow for

storing data.

While some real problems have a tree like structure,

the assumption made that only this kind of attribute

relationship governs the world is too strong. During

its history, the hierarchical model has probably

suffered from this. Some people have probably

abandoned it for models which seem to fit better with

the reality.

The main implementation of the hierarchical model

was in the 60’s by IBM. This database is called IMS

which stands for Information Management System.

Today, IMS is still used in the industry for very large

scale applications. IBM sold it as a solution for

critical online applications. In fact, IBM continues to

invest in this product and to develop new releases.

Most directory services are using concepts inherited

from the hierarchical model. Moreover,

reminiscences of the model are also visible on every

system. Everybody use hierarchical concepts to

organize files and folders. So every computer user is

more or less familiar with the hierarchical

organization of information.

Furthermore, during the last decade, the hierarchical

model has found a new popularity with the

increasing use of micro format as XML or YAML. In

a web browser, the Document Object Model (DOM)

also uses a hierarchy of objects to organize the

elements of a web page. Thus, this model is not in

jeopardy of disappearing. It will probably continue to

encounter further success in the future as well.

Network Model

Instead of limiting the organization of data around a

tree structure, the network model allows to link

entries between themselves in any direction. A

directed graph, as shown in Figure 2.3-2, is probably

the best representation which could be given to

show how data is structured in a network model. The

Page 7: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

7

other properties of this model are shared with the

hierarchical model. Thus we can say that the

hierarchical model is a subset of the network model.

Figure 2.3-2: Network graph

Initially developed during the 70’s to bypass the lack

of flexibility of the hierarchical model, the network

model has encountered a lot of success during this

decade. This model has found hundreds

applications in different fields of computer science

such as the management of in memory objects or

bioinformatics applications. However, it seems that

actually not a lot of people are using it to organize

their data. However it still has notoriety in embedded

applications, whilst large scale applications built on it

are slowly disappearing.

Relational Model

Before its definition by Codd during the 70’s, the

relational model (2) had not encountered a lot of

success. However, after this formal work based on

the set theory and the first order logic, some

companies chose to make implementations of this

model. IBM was one of the first companies which

took the lead in the market with the DB2 database.

Oracle is now the uncontested leader with its

implementations of the relational model.

The relational model defines the concepts of

relations, domain, tuples and attributes which are

more often called tables, columns, rows and fields.

Interestingly, today this model is so widely taught

and used that the question of its pertinence to solve

specific cases is rarely questioned.

Figure 2.3-3: Relation, domain, tuple and attribute

Some people link the success of the relational model

to its mathematical foundation. However the

implementations actually used are a far cry, from the

beautiful concepts defined at the beginning. The

main building blocks are now hidden by features

which are provided to address practical

requirements.

Thus, the success of this data model should be

linked to the practical answers which have been

given to solve problems encountered in the business

world during the 80’s and the 90’s (3). The

normalization principle was used to earn storage

capacity. Furthermore, during this stage, information

systems had been widely used for automation and

monitoring tasks. The relational model has offered a

very good canvas to express and solve problems

such as these.

Page 8: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

8 Data model comparison

3 Data model comparison

This chapter will define more clearly the JCR model

and the relational model. Several aspects which

relates to the model’s foundations will be presented

and compared. The main purpose of this section is to

understand the philosophy or basis of each model.

The ―Model definition‖ section briefly presents the

main ideas subtended by the models. The ―Structure‖

and ―Integrity‖ sections will mainly discuss the

aspects which relates to the place respectively of the

content, the structure and the semantic in both data

models. The ―Operations and queries‖ and

―Navigation‖ sections will show different ways used to

retrieve and edit content. Throughout the whole

chapter, an important place will be given to the

impacts of the choice made in terms of the data

model and the reasons which should drive this

choice.

3.1 Model Definitions

Some works and references give definitions to the

different data models actually used (4) (5). Some

tools are also available to understand the main

concepts of these models. The purpose of this

section is not to enrich these definitions but they are

included simply to draw attention to some theoretical

aspects required in order to build a common

language for the comparison.

JCR Model

To organize records, this model includes concepts

inherited from the hierarchical and from the network

model. Thus, as shown in the Figure 3.1-1, records

stored with the JCR data model are primarily

organized in a tree structure. However, the limitations

of the hierarchical model are avoided by giving the

ability to link each record horizontally. Attributes

which point on other nodes can be stored at each

level to create network relationships. This type of

model permits the creation of a network in a sort of

tree structure.

Figure 3.1-1: JCR graph

Currently, some explanation of the schema which

relates to the data model definition can be founded in

the specification (4) (5). The Figure 3.1-2, based on

this information, attempts to express more formally

the JCR data model. It’s interesting to note that at

this stage, no differentiation between the content and

the structure can be made. In fact the structure

appears with the instantiation of items.

Figure 3.1-2: JCR class diagram

Page 9: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

9

Relational Model

The relational model which was quickly introduced in

the ―state of the art‖ chapter is based on the set

theory. A relation as defined by Codd (2) made

reference to the mathematical concept of relation. In

his paper, he gives the following definition to a

relation:

R is a subset of the Catesian product S1 x S2 x … x Sn

Practically, because all these sets have to be

distinguished from the others they are identified as

domains. Thus, assuming the domains of first-names

F, of last-names L and of ages A, a Person relation is

a set of tuples (f, l, a) where f Є F, l Є L and a Є A.

The Figure 3.1-3represents a table view of this

relation. In this representation, each domain

corresponds to a column and each tuple to a row.

Figure 3.1-3: Relation, domain, tuple and attribute

This basic definition does not mention the ability to

create associations between relations. In fact there is

no link between the name of the model and

associations. The ability to express associations

comes later with the joint operations defined by

relational algebra. These operations will be

introduced later in the next sections.

The Figure 3.1-4 show a class diagram which could

be used to express relations. While the pertinence of

this kind of diagram can be discussed the purpose is

to give a simple and visual base of a relation.

Furthermore, parts derived from this diagram will be

reused later to express the intersections between the

relational model and the JCR model.

Figure 3.1-4: Relational class diagram

3.2 Structure

A rich debate around the respective places of data

and structure in data models has been ongoing for

several years both on the web (6) and in academic

fields (3). This debate could be summarized as

following: Should data be driven by the structure or

should the structure be driven by data?

These discussions come from the fact that some

concepts do not really fit into a predefined canvas. A

predefined canvas can covers a lot of advantages

and facilities. For example, it’s easier to express

integrity constraints on a well known structure.

Equally indexation or query optimization (7) can also

benefit from the assumption that a clear structure can

always be found to a problem. However, in real life

situations, there is always an exception which does

not conform to the canvas.

The following sections will situate two models which

apply to this context. Both approaches will be

presented with the data and the structure shown

respectively in each case. Clarification of when each

strategy could be considered logical or illogical will

also be identified.

JCR model

In Figure 3.1-2, a class diagram shows the main

aspects of the JCR data model. In this figure, the

instantiations of nodes, properties and values leads

to the creation of content. If we try to identify the

Page 10: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

10 Data model comparison

structure’s place in this diagram, it appears that no

real differentiation is made between the content itself

and its structure.

Thus, the model proposed by JCR does not require

the definition of a structure to instantiate content.

Instances of nodes, properties and values can be

created before defining any kind of structure. In fact,

the structure appears with the content.

A parallelism can be made between this approach

and the semi structured approach described during

the end of the 90’s (8). No separation was made

between data and their structure. This provides two

possible advantages, firstly a dynamic schema, to

store data which does not fit into a predefined canvas

or secondly to be able to browse the content without

knowing its structure.

Some modern programming languages such as

Ruby or Python also give the ability to extend objects

on the fly with properties and functions (reflection).

While a part of the structure appears at runtime, it is

possible to define a semantic which identifies the

main concepts. In JCR this is done with node-types.

Basically, defining a semantic does not limit the

capacity of a node to store an infinite combination of

sub-nodes and properties. To proceed in this manner

allows for the creation or evolution of records when

and as required.

For example if we want to define a semantic item for

media, there is no real need to take into account all

the possible properties which could appears during

the application life cycle under this node. Each

special case of media items, such as images, videos,

etc. can have specific attributes which are not

impacting the whole set of media instances and

which do not necessarily have to be specified at

conception.

Relational model

Figure 3.1-4 represents a basic class diagram

describing succinctly the main ideas proposed by the

relational model. We see in this diagram that the

concept of record which is represented by the

Element class is separated from the structure.

Remark that the paradigm is completely different in

the relational model to the one proposed by JCR. A

structure made of relations and domains has to be

instantiated. Then, tuples which fit into this structure

can be created. While the DBA can choose the level

of flexibility in the initial structure, it appears that this

kind of model differentiates between the data and its

schema.

Differentiating the structure from the data can reap

some benefits. For example this would be

appropriate for a problem solving approach rather

than a data storage approach. This is evident as

many developers will create an entity relationship

model during the early phases of defining data

requirement.

However in real life situations the assumption that

content and structure can be completely separated is

not always valid. For example to handle expansion

in the relational database some artificial artifacts or

miscellaneous fields are often created to allow for

this expansion in the relational structure. These can

take the form of fields added to create hierarchies or

fields added to define customized orders in a set of

tuples. These conceptual entities can become

difficult to describe within the confines of the

structure. As the application evolves and new

requirements are added the management of the

additions can become difficult and dangerous. A

change could even imply a rethink of the whole

structure of the implementation.

Content, structure and responsibility

As shown in the state of the art chapter, in classical

situations, the DBA is generally responsible for the

data structure. The application programmer can

influence decisions made in this area but he does not

have the final responsibility for the structure. Finally,

the user has clearly nothing to say, his scope is

limited by the functionalities developed by the

application programmer to create, remove and

update data.

As shown in Figure 3.2-1, choosing a content driven

approach instead a structure driven approach

significantly impacts the respective roles of the DBA,

the application programmer and the user. In fact the

DBA loses his responsibility of main structure owner.

If the structure is driven by data, this ownership is

shared with the application programmer and the

user.

Page 11: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

11

Figure 3.2-1: Responsibility repartition revisited

It is true that a clear separation between the content

and the structure makes some aspects of data

management easier. Splitting clearly the structure

and the content makes it easier to define roles and to

separate the duties. The DBA has the ownership of

the database and of all the structures which allow to

instantiate records. In this context, the application

programmer becomes a kind of super user with

extended rights but the user may only access what is

available in the application.

This kind of scenario does give a lot of responsibility

to the DBA and places him at the centre of database

evolution. Unfortunately he is not necessarily tuned

in to the real needs of the users. It would therefore

be advisable that the DBA be responsible more for

aspects of data integrity, the availability or the

recoverability of data and not for the structure or the

content. In general these should be left under joint

definition to the application programmer and the

user.

Choosing the right approach

In a real working environment, some problems

benefits from being driven by a structure whereas

others clearly do not fit into any predefined

structures. A simple analogy may help to explain this

complicated situation. For example houses are rarely

built from scratch without blueprints. However, if we

take the scope of cities, there are generally no

blueprints which plan their final states. So which

lessons can we learn from this simple example? Are

complex problems driven by data instead by

structure? Not necessarily.

In the example of the house and of the city the

problem could be seen as following. For houses,

because budgets and resources available are

generally known in advance, the most effective way

to proceed is to define a structure before the

construction. For cities, because resources and

budgets available are generally not known in

advance and are evolving, the most effective way to

proceed is to let their structure emerge. If necessary,

guidelines can be defined to control their growth.

Since information system problems involve a wide

and growing community of stakeholders and

providers cannot know what will be done with their

applications, these kind of questions should be

debated at the onset of the design:

Are the users known or not?

Is the behavior of the users known or not?

Is the final usage of the application known or

not?

Are entities fitting in a canvas or not?

The response to these questions is probably one of

the best indicators when deciding upon one of the

two approaches.

The JCR model advocates clearly for a structure

driven by data. By creating content, items, nodes and

properties, users are building the structure. Database

administrators and application programmers are just

guiding this structure by defining rules and

constraints. In model implementations made with a

relational approach, a structure is first defined by the

database administrator and the application

programmer. Then the users can register content

items which fit to this structure.

Depending on the case in use each data model could

be useful. It rests basically through which perspective

we wish to view the data a fixed structure or a more

flexible data driven model. The choice of model will

be based on the certitude or incertitude of the

responses to the few decisive questions as

stipulated.

Page 12: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

12 Data model comparison

3.3 Integrity

A strong association between structure and data

integrity is often made. Thus some people are afraid

of letting their users taking part in the definition of the

structure. However, it’s more correct to say that data

integrity belongs to semantic.

Generally, integrity definitions do not make any

mention of the structure. A structure made of

relations and domains is evidently an elegant way to

express a semantic. It’s also a good basis in which to

declare integrity constraints. Nonetheless integrity

constraints can be defined at a lower level, directly

over a semantic. Advantages could be for example

that all the structures which respect the semantic

constraints can be instantiated in the database and

not only the records which fit into the structure.

Furthermore, as mentioned in the ―state of the art‖

chapter, integrity definitions generally do not make

mention of coherency. In the database environment,

an amalgam is often made between these two

concepts. While data coherency can be preserved by

integrity constraints, the integrity of a dataset is not

necessary lost if incoherent records are present in

the database.

Unquestionably data integrity means that no

accidental or intentional destruction, alteration, or

loss of data should ever occur. While data integrity

should be ensured at all times during a database’s

lifecycle the assumption that data coherency should

have the same property is probably too strong.

Some people have the habit of treating directly in the

database both aspects, everything which relates to

data coherency along with integrity constraints. This

ensures that the data coherency is preserved in all

the cases. However, this also has a cost in term of

performances and checks which have to be

performed each time a write access is made on the

database. Therefore a tradeoff has to be made

between data integrity and data coherency.

A balanced approach which can result in a better

user experience consists in identifying, sometimes

arbitrarily, what relates to integrity and what relates

to coherency. Data Integrity will be treated with

constraints at a database level. Data coherency will

be treated programmatically at an application level in

a way which alleviates the work load of the system.

JCR Model

An analogy can be made between the JCR model

and a black list. The most generic node sustains any

kind of children, any kind of properties and any kind

of values. A mechanism is provided through the

concept of node-type to let the DBA defining integrity

constraints.

In the JCR model, node-types are used to express a

semantic. Declaring constraints on this semantic

allows the declaration of restrictions on the nodes

and on their content. Each node has a primary node-

type and can have several mixin node-types which

extend the primary node-type. Node-types allow for

specifying constraints on the children of a node, on

the properties of a node and on the values of the

properties stored by a node.

Figure 3.3-1: JCR model and integrity

Using several node-types permits the possibility of

ensuring the integrity of transitive relations in a

hierarchy. For example, it is possible to define a

node-type which support only children with a specific

type. The later could also have node-types which

declare constraints for their children. Proceeding in

this fashion would narrow down the usage within a

node, that the children of the children of a specific

node should have a certain type.

When integrity is mentioned, we often speak about

entity integrity, referential integrity and domain

integrity. These concepts relate closely to the

Page 13: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

13

relational model but as shown in Figure 3.3-1 we can

find similar ways to express constraints in the JCR

model.

Entity integrity is ensured by the fact that basically

each node is unique and identified by its location in

the data model or by its UUID. Paths cannot really be

considered as unique identifiers because same paths

sibling are allowed for XML compatibility. Referential

integrity is ensured by the fact that all the references

properties of a node have to point on a referenceable

node. Furthermore, a referenceable node cannot be

deleted while it is referenced. Domain integrity can

be ensured by forcing nodes to have specific

properties which contain values in predefined ranges.

Data coherence can be checked with integrity

constraints but the model does not provide all the

tools to do a complete coherency check. This proves

that making a separation between the two areas is

beneficial. Integrity should be ensured at the data

model level and data coherency at the application

level.

Relational Model

An analogy between the Relational model and a

white list is appropriate. As explained in the last

section, the relational approach made the

assumption that structure and content have to be

separated. Thus saving content is allowed only if a

structure has been defined. Some integrity

constraints are implicit to the relational structure. The

domain constraints ensure, for example, that all the

values stored in a same domain have the same type.

The entity integrity constraints give the guaranty that,

due to the primary key, all records in a table are

unique.

Furthermore, the structure is generally taken as a

base on which to declare other integrity constraints.

The referential integrity ensures that a foreign key

domain is a subset of the pointed domain. In the

same way some other integrity constraints which

make use of the operations proposed by the model

can be described.

Figure 3.3-2: Relational model and integrity

A structure known in advance and from which the

evolution is controlled is an elegant base to ensure

integrity. The syntaxes which permit the expression

of integrity constraints are generally derived from first

order logic. The fact that the main building blocks of

the relational model are based on well known

mathematical disciplines, respectively the set theory

and first order logic, permits the expression of

implementation models which share these

mathematical properties.

In term of data integrity, this provides advantages

because the solidity of the implementation model can

be mathematically proven. In its simplicity, this way

of proceeding also allows the opportunity with short

statements to declare rules and constraints for nearly

everything. As a result, solid implementation models

can be quickly declared with a high level of accuracy

and a minimum level of programming effort.

However, as mentioned before, the assumption that

each problem can fit in predefined structure is often

too strong. Furthermore, while the relational model

has the ability to express hierarchies and network

Page 14: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

14 Data model comparison

structures, the first order logic is limited when having

to declare them with constraints. In conclusion, it’s

often difficult to know what should be managed at a

model level or at an application level.

Integrity, coherency and responsibility

In general, DBAs have the custom of declaring very

strong structures. Their implementation models are

thought of as white lists which preserve data integrity

and data coherence. However, to build generalized

and flexible implementation models it is really only

the data integrity level which should be constrained

at model level.

Furthermore the argument that data integrity and

data coherency should be the responsibility of the

DBA does not really reflect the reality or the ideal, as

all of the tests made at an application level to ensure

that users do not inject into the data, testify to the

veracity of this fact.

Figure 3.3-3: Responsibility repartition revisited

Therefore the clarification of the repartitions of

responsibility of such checks would be of an

enormous benefit to the overall functionality. This

would help in defining reasons in choosing any given

model. Equally it identifies any shortcuts on aspects

of data integrity and helps to avoid these sort of

pitfalls. Furthermore, dividing clearly the

responsibility of the integrity and of the coherence

could enhance the ability to design more intuitively

applications which take into account the cost of the

checks made at a data model level.

Choosing the right approach

The argument that the relational model has

mathematical properties (2) which will ensure rock

solid data integrity is often selected for the wrong

reasons. In fact these properties are only used for

very specific applications and the integrity of an

implementation model as understood here is rarely

proven mathematically because it is not a

requirement.

The choice of the best approach should be made

with regard to the responsibility given to the DBA and

to the application programmer. The following two

examples can illustrate this idea. On one hand, a

prison guardian must control all the movements of

the people in the prison during the day. In this case,

a rock solid program conceived as a white list is

ideal. The people may only do the things that they

are allowed to do. On the other hand, a tourist guide

has to ensure that the travelers have a good trip by

directing them and giving them the right information.

In this case, a program conceived as a black list will

probably give more satisfaction to the user.

Some functional cases do not benefit from being

governed by a lot of constraints. Unfortunately, the

relational model often leads DBAs and application

programmers to design restricting implementation

models. This gives them the feeling that their

applications is well thought out but often it only

frustrates the users.

The following questions should be honestly asked:

Do users have to be guarded or guided?

Does data coherency have to be preserved

at a database level or at an application level?

Therefore choosing the good data model is not only a

question of preferences but it should be based on a

choice which is always related to the analysis of the

case in use.

3.4 Operations and queries

Query languages are close to fields as relational

algebra, first order logic or simply mathematics.

Depending on the cases, queries can be expressed

with declaratives calls or with procedural languages.

In general, queries are composed of several

Page 15: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

15

operations which make use of the structure or of the

data semantic.

Some operations can be used in queries. These

operations such as the selection, the projection, the

rename or others set operations are inherited from

the disciplines mentioned at the beginning of the

section. In addition to these operations, some query

languages provide statements which allow creating,

modifying or deleting of data. This section shall

clarify the bounds of each model in term of queries

and operations.

JCR Model

An abstract query model is used as a basis to

retrieve data in the JCR Model (4) (5). This query

model makes a kind of mapping between the JCR

model and the notions of relations, domains, tuples

and attribute present in the relational model. The

Figure 3.4-1 is a modified version of the Figure 3.1-4

which visually shows this mapping.

Figure 3.4-1: JCR model, operations and queries

It seems that, in the actual state, node-tuples are

seen as relation, property as domain, nodes as tuple

and values as attributes. Basically node-tuples are

arbitrary sets of nodes. However, node-types are

used as the main source of node-tuples in queries.

While this kind of mapping could not be considered

as an application of the principles of the set theory, it

allows the running of some interesting queries which

can satisfy nearly all requirements.

The operations provided by this query model are the

selection and the ensemble of set operations which

permit the performing of joins between node-tuples

sets. The result of a query is composed of all the

nodes which satisfy the selection condition and the

join condition.

Basically, in the JCR model, queries are seen as a

way to perform search requests. This provides a way

of retrieving records but this selection criterion does

not however allow them to be sequentially deleted or

updated. This functionality is not dictated by

conceptual barriers, it could be modified as required.

As mentioned before, the structure and the schema

are not separated in this model. Thus, some

attributes of the records at their depth level or their

hierarchical path can be viewed as properties. This

opens up the ability to easily perform queries on

things which are generally not taken into account in

other models as transitive relationships in

hierarchies.

Relational Model

The relational algebra defines the primitive

operations available in the relational model (9).

These operations are mainly the selection, the

projection, the rename, the Cartesian product, the

union and the difference. The power of this query

model states in fact that the input and the output of

these operations are always relations. Thus, it’s

possible to express complex statements and

imbrications.

In addition to these operations, some mathematical

operators can be used. It’s also possible to specify

additional domains for the output relation. Some

domain operations are also provided to retrieve

information for example the number of attributes

stored in a domain or the domain’s maximal value.

The query languages which are provided by

relational database implementations generally

propose statements which allow modifying, creating

or deleting data (10). Used in conjunction with the

previously presented operations, these statements

become very useful. They provide a means of

performing sequential changes on data sets which

reply to precise conditions.

The possibilities given by the usage of these

operations are huge. However limitations are

encountered when transitive relationships appear

(11). This sort of query cannot be expressed with first

order logic statements. For example, if it is not

possible to define a query which retrieves all of the

Page 16: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

16 Data model comparison

descendants of an element some other solutions are

available (12). They do however often add

complexity to the implementation models.

Choosing the right approach

While JCR provide a means of carrying out some

operations and queries, the relational model is clearly

more complete in this area. In some situations, this

completeness can become a decision criterion if the

case in use implies that complex join operation may

be required.

The features proposed by most of the relational

databases which allow the use of operations in

conjunction with update and delete statements is

also a significant advantage proposed by this

relational model. For the use case which involves a

lot of write access, this possibility allows for quick

creation, update and deletion of content. However,

caution should be taken with this type of usage when

complex hierarchies are present.

3.5 Navigation

During the 70’s, Charles W. Bachman described

different ways of accessing records in databases

(13). By focusing on the programmer’s role, he

describes his opportunities to access data as the

following:

1. He can start at the beginning of the

database, or at any known record, and

sequentially access the "next" record in the

database until he reaches a record of

interest or reaches the end.

2. He can enter the database with a database

key that provides direct access to the

physical location of a record. (A database

key is the permanent virtual memory address

assigned to a record at the time that it was

created.)

3. He can enter the database in accordance

with the value of a primary data key. (Either

the indexed sequential or randomized

access techniques will yield the same result.)

4. He can enter the database with a secondary

data key value and sequentially access all

records having that particular data value for

the field.

5. He can start from the owner of a set and

sequentially access all the member records.

(This is equivalent to converting a primary

data key into a secondary data key.)

6. He can start with any member record of a set

and access either the next or prior member

of that set.

7. He can start from any member of a set and

access the owner of the set, thus converting

a secondary data key into a primary data

key.

These rules give the programmer the ability to cross

datasets by following the references which are

structuring the records. The interesting point on this

approach is that the programmer can adopt access

strategies without knowing the whole structure of the

database. As a navigator, he explores the database.

Figure 3.5-1: Navigation path

Rules, as defined by Charles W. Bachman, can be

implemented as procedural calls made over an API

or as declarative statements. The main difference

between the queries mentioned in the previous

section and the navigation principles defined here

are the following. Queries are built over the semantic

or over the structure of the data model. Navigation is

independent of the semantic or of the structure and

directly uses the content. Thus, in our context,

XQUERY and XPATH should be considered as

navigational languages because they use the content

to navigate in XML files.

JCR Model

In the JCR Model, each record stores properties

which relates to the localization of the item in the

database. The level, the path and, under certain

conditions, the unique identifier are good examples

of these specific properties. The rules mentioned

before are nearly all included in the model and allows

Page 17: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

17

for the navigation through the database with different

types of strategies.

The root node can be seen as the beginning of the

database. As mentioned in the first rule, it gives the

ability to sequentially access all the sub-nodes. The

path and the unique identifier properties allows

navigating in a way which respects the second, the

third, and the fourth rules by giving specific entry

points for specific situations. The node types and the

parent nodes can be seen as set owners and thus

allows for the navigation of the database in ways

which respect the fifth, sixth and seventh rules.

These possibilities offered by the JCR Model (4) (5)

give the programmer a lot of flexibility. He is really

able to navigate through the data and adopt

strategies which will allow him to find data in

structures that are unfamiliar.

Relational Model

In the relational model (2), records are seen as basic

tuples of values. Basically, these data structures do

not know their localization in the database and are

not ordered in relations. To enter the database, a

programmer must have a good knowledge of the

schema and of the data organization.

In one sense, we could say that the fifth rule

previously defined is fulfilled. However, because the

records are not ordered, it is not really the case.

Thus, the relational model does not take into account

these rules at all. The relational model only defines a

way to organize data and shifts the navigation

problem to a higher level.

Choosing the right approach

In term of navigation, both models are not

comparable. The signification given to the units of

content are really different. Thus choosing the right

approach depending on the use case is not really

hard. If the use case involves traversal access,

exploration or navigation in data, a model which

includes these concepts is always superior.

3.6 Synthesis

The two data models show fundamental differences.

The approach’s choice highly relates to the degree of

flexibility which has to be given to the user. This

choice also relate to the nature of the requirements

which involve clear or abstract entities. The choice of

the data model should always be made by doing a

good analysis of the use case.

The selection of an approach also affects the main

roles and responsibilities which relate to data

management. A requirement would be that all of the

people using a database should be informed clearly

of their roles accompanied with guidelines of usage.

Paying particular attention to certain previous data

usage habits as they would have to be changed or

their usage need to evolve if a new data model is

chosen.

Some users could voice reticence concerning these

factors as conservative behavior is an obstacle when

deep changes arise. The data model’s choice should

not be affected by this type of reasoning. The

advantages engendered through good and coherent

choices are enormous and can have a significantly

impact on the application and the development

process.

Page 18: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

18 Specification comparison

4 Specification comparison

Specifications describe the features that databases

should support. The main specification for relational

database is without doubt SQL which has been

released several times (SQL92, SQL98, SQL**)

since its first edition and which is more or less

implemented by each relational database provider.

The JCR Specification was released in 2005 (JSR

180) and a second version of the specification is in

incubation (JSR 283). Some companies as Day,

Alfresco or Oracle provide implementations of this

specification with different levels of compliance.

We could discuss the many aspects of each

specification which would take a long time but the

principal objective in this document is to highlight the

philosophy behind the specifications which provide

practical answers which solve common problems. It

is for this reason that, the examples shown in the

following sections are essentially based on the

SQL92 specification and on the version 1.0 of JCR.

The first section of this chapter presents a use case

which demonstrates how each specification can give

practical answers to running problems. Being well

balanced it shows the possibilities and limits of each

model. The four following sections will essentially

show how the concepts presented in the ―Data model

comparison” chapter actually take form in the

specifications. Finally, the last section will point to

practicalities by presenting features which respond to

the more common differences in requirements.

4.1 Use Case Definition

Consider an editor who sells books and wants to

create a system to manage his book collection and

his orders. A book collection is composed of books

and sub collections. A book can be tagged with

keywords. Through a website, the editor wants to let

anonymous visitors navigating through the whole

catalogue by collection.

He also wants to provide a book preview for the

authenticated customers and partners and let the

partners show the whole digital copy of the books. In

addition to the ability to navigate through collections,

partners and customers should be able to search

products ISBN number, with full text criterions, or by

asking for the most successful items.

Figure 4.1-1: Editor use case diagram

The Figure 4.1-1is a draft of the use case diagram of

this application which summarizes the main actors

and the main features which have been identified

during the conception process. In the next sections,

this use case will be used to point to some key

aspects which differentiate the relational databases

from the java content repositories.

4.2 Structure

In term of structure, both approaches are radically

different. However, it makes sense to understand

how each specification makes use of the basic

concepts presented in the ―Data Models‖ chapter.

Page 19: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

19

This can assist people developing implementation

models and in solving practical problems.

JCR Specification

As other unstructured and semi-structured models,

the JCR Model does not make a separation between

data and their structure. Thus, there are no specific

needs to identify entities and attributes as required

by relational databases. It is also important and

useful to identify the semantic beforehand or in other

words, identify the concepts represented by nodes in

the content repository. This can be done by defining

a node-type or by specifying an attribute which

declares the type of the node. The schema depicted

in Figure 4.2-1 does not represent the structure of

the repository. It simply shows how the main

concepts which can be found in the structure should

be organized.

Figure 4.2-1: Semantic diagram

The root can be seen as the editor system which is

dealing with persons, orders, order lines, collections,

books and tags. This diagram does not take into

account the additional artifacts which could be added

in the content repository to organize data.

<editor = 'http://www.editor.com/1.0'> [editor:person] > nt:unstructured

[editor:order] > nt:unstructured

[editor:orderline] > nt:unstructured [editor:collection] > nt:unstructured

[editor:book] > nt:unstructured

[editor:tag] > nt:unstructured

Table 4.2-1: Node-types

The most intuitive way to design this structure or

organization is to think in term of its composition.

Simply the manner in which, one concept will always

be a component of another concept. If UML class

diagrams are used during the design phase, it

consists only of translating the composition

relationships into hierarchies. The various other

associations will be stored as references or paths as

properties. More tips on how to design JCR

applications are available in the Appendix – JCR and

design‖ appendix.

In considering the environment as structured we are

often unable to translate clearly this structure.

Consequently, keeping the schema as weak as

possible, allows easily to take into account new

requirements at runtime by simply recording new

data. If node-types are used as markers, it make

sense to simply let them extend the nt:unstructured

node-type without adding more constraints.

Thus, at design time there is no real need to fix all

the attributes and all the entities. In this example,

some decisions can be taken later by the application

programmer. The general idea is simply to leave

open the place for new requirements.

SQL Specification

As explained in the previous chapter, the relational

model implies that data and their schema are

separate. In practice this means that all the tables

and their respective columns have to be identified at

the time of design. During the development process

the entity relationship notations are often used for

this purpose.

Page 20: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

20 Specification comparison

For the editor’s use case, means that some decisions

need to be made which will strongly impact the future

evolution of the application. Data security and save

routines must make use of the predefined columns.

Everything has to have been describe clearly

previously. For example the identification of what an

order, what a book is and what a customer is

imperative. Hence the final application must and will

reflect all these decisions which are often arbitrary.

Figure 4.2-2: Entity relationship diagram shows a

database schema which reflects the decisions which

have been taken during the design phase. In this use

case, it is relatively easy to find relations and

domains for the main entities as person, order, order

line and tag. At design time, their attributes can

clearly be identified and it is quite easy to conceive a

relational schema for them.

However, the book entity is difficult to fit into a table.

For example, this schema only stores the title and

the description of the book. However as a

requirement there is a need to also store a digital

copy and a preview of the book. The content of the

book could be part of the database or it could be

stored somewhere else in the file system. This kind

of decision is completely arbitrary and has an

enormous impact on the application’s life cycle.

4.3 Integrity

As mentioned, integrity can have different meanings.

In the database vocabulary, integrity generally

relates to the fact that accidental or intentional

destruction, alteration, or loss of data should not

happen. It also relate to the state of completeness of

data which have to be preserved in all cases in the

database. This section will make a quick roundup of

the possibilities proposed by JCR and SQL to deal

with integrity.

JCR Specification

Data integrity can be ensured in JCR with node-

types. Some predefined node types are specified by

the JCR specification. These represent different

concepts which are often encountered in repositories

such as folders, files, links, unstructured nodes, etc.

These node-types can be extended and rules which

force the nodes to respect certain rules can be

defined.

In our use case, the state of completeness of data

which always has to be preserved in the database

Figure 4.2-2: Entity relationship diagram

Page 21: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

21

does not require a lot of constraints. In a real-time

situation, it could happen that a person places an

order and comes to take direct delivery of the product

or a special edition of a book could have no ISBN.

We often say that this kind of decision has to be

taken into consideration. However they should not be

taken at a level which is detrimental for future

requirements.

The only integrity constraints we might choose to

define concern the orders and the order lines. For

law compliance, it would be necessary that an order

stores a date and that an order line stores a property

with a unit price and a quantity. This is shown in

Table 4.3-1.

<editor = 'http://www.editor.com/1.0'> [editor:order] > nt:unstructured

- 'created' (Date) mandatory

[editor:orderline] > nt:unstructured - 'quantity' (double) mandatory

- 'unitprice' (double) Mandatory

Table 4.3-1: Node-type and integrity constraints

The fact that an order line can only be found under

orders node cannot be expressed at a repository

level. However, this constraint can be taken into

account at an application level. We might also need

to define a referential integrity constraint between the

ordered product and the order line. The code shown

in Table 4.3-2 demonstrates how this can be done.

[editor:orderline] > nt:unstructured - 'product' (reference)

Mandatory

Table 4.3-2: Node-type and referential integrity

The meaning for this kind of attribution could be

discussed at length but keeping a strong reference

between the product and the order line which

implicates referential integrity does not really make

sense. A product can evolve and this sort of

association would lose its signification. Furthermore

the editor may want to sell in the future a service

instead of a book. Therefore imposing referential

integrity is probably extreme and we can

consequently more realistically accept broken

references between order line and product. The

same comment can be made for the tags which are

made with an association of a similar nature.

SQL Specification

The fact that, in the relational model, the structure is

separated from the content and that it has to be

described leads to creating data models which are a

representation of what will be the final usage of the

application. Furthermore because some integrity

rules are implicit to the model, DBAs generally do not

hesitate in defining all of the integrity rules which will

enclose the preservation of the entire data coherence

at design time.

In practice for the editor’s use case, this means that

some application logic can be translated into integrity

constraints. With check constraints, we could ensure

that the quantity attribute of an order line is always

positive. With referential integrity, we can ensure that

when a tag is deleted that, all the links which concern

this tag are also deleted. The statements in Table

4.3-3 and Table 4.3-4 show how this can be

achieved.

CREATE TABLE IF NOT EXISTS `mydb`.`OrderLine` ( `Order_idOrder` NOT NULL, `Book_isbn` VARCHAR(45) NOT NULL ,

`unitprice` DECIMAL(11) NULL CHECK (unitprice > 0) , `quantity` INT NULL CHECK (quantity > 0) ,

PRIMARY KEY (`Order_idOrder`, `Book_isbn`))

Table 4.3-3: Table and integrity constraints

CREATE TABLE IF NOT EXISTS `mydb`.`Tag_has_Book` (

`Tag_idTag` INT NOT NULL , `Book_idBook` NOT NULL , PRIMARY KEY (`Tag_idTag`, `Book_idBook`) ,

CONSTRAINT `fk_Tag_has_Book_Tag` FOREIGN KEY (`Tag_idTag` ) REFERENCES `mydb`.`Tag` (`idTag` )

ON DELETE CASCADE ON UPDATE CASCADE, CONSTRAINT `fk_Tag_has_Book_Book`

FOREIGN KEY (`Book_idBook` ) REFERENCES `mydb`.`Book` (`isbn` ) ON DELETE CASCADE

ON UPDATE CASCADE)

Table 4.3-4: Table and referential integrity

The advantage of referential integrity constraints is

not negligible. They minimize the efforts made at

application level to ensure the coherence of the data

stored in the database. However in the case of the

tag, if the tag is attributed a thousand times, deleting

one tag will imply a thousand and one write

accesses. If tags are changing a lot, the system will

probably not sustain these integrity checks. A better

policy could be to allow incoherent tag attributions to

Page 22: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

22 Specification comparison

survive in the database and to delete them if they are

incoherent during the next read access.

Specifying all the integrity constraints at a model

level can lead to performance and scalability

problems but it also restricts potential utilizations

which have not been identified at design time.

Implementing a new requirement would impose a

new development cycle which starts from the

implementation model definition and finishes with the

implementation of the user interface.

4.4 Operations and queries

In term of operations and queries, we could consider

the four following requirements. The editor wants to

identify the top 10 best sellers. He also wants to

change the status of all of the orders which respect

some specific conditions. He wants to be able to

retrieve all the books which are under a specific

collection and finally, he wants to perform full text

search on all items stored in the system.

JCR Specification

The abstract query model of JCR is implemented in

several ways for different utilizations. The version 1.0

of JCR uses a common subset of XPATH and SQL

which opens up the opportunity for some interesting

requests. The draft of the version 2.0 declares

XPATH as deprecated and replaces it by a query

language which uses java objects.

The first requirement which is aimed at identifying the

best sellers cannot be easily expressed with JCR in

one request. The reason being is that domains

operations as Max and Min are not included in the

specification, joins only allow the retrieval of books

which have been ordered at least once (Table 4.4-1).

SELECT * FROM editor:book, editor:orderline WHERE editor:book.jcr:path = editor:orderline.product

Table 4.4-1: simple JCR query

As shown in Table 4.4-2, the top 10 can be realized

by doing a query for each book which returns its

number of related orders. Then, the sum of the

results can be used to create the top 10. This is good

for simple queries but if connections which include

domains operations are needed, the complexity of

the code is extensive.

The second requirement which is aimed at changing

the status of some orders cannot be expressed with

a single query. However, the results can be

accessed and modified through the navigation API. If

the selection criteria involves domain conditions or

many connections this kind of query becomes very

complicated.

SELECT * FROM editor:order WHERE date < '+2008-11-02T00:00:00:000TZD'

(…)

NodeIterator ni = queryresult.getNodes(); while (ni.hasNext()) {

Node n = ni.nextNode();

n.setProperty("status", "closed"); }

Table 4.4-2: JCR query and iteration on the result

Retrieving all the books which are stored under a

collection is very easy to implement (Table 4.4-3).

Some properties which relate to the record (path,

uuid, etc.) are accessible through XPATH and SQL.

The strengths of JCR and its features are very

evident in this type of situation.

SELECT * FROM editor:book WHERE jcr:path LIKE '/collections/science/%'

Table 4.4-3: JCR query and hierarchy

JCR offers domain independent functions which

allow the execution of queries on all the properties

stored in nodes. As mentioned, the JCR model is

unstructured, and the nodes do not have to reflect

the same properties. Therefore this is a very powerful

functionality for all the use cases which require full

text searchs. As illustrated in Table 4.4-4 retrieving

the set of nodes which contain a specific sequence of

characters is very simple.

SELECT * FROM nt:base WHERE CONTAINS(*, '*computer*')

Table 4.4-4: JCR query and full-text search

In conclusion, the use cases which are presently

characterized by a lot of join and domain operations

will not really benefits from the features proposed by

JCR. On the other hand, in term of operations and

queries if the use cases characteristically require

hierarchical queries, full text search queries and

search queries in binary content, a java content

repository would be advisable.

Page 23: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

23

SQL Specification

As explained in the last chapter, the relational model

shows all of it power when the requirements need

connecting operations and domain operations.

Furthermore, if the requirements need to perform a

high volume of sequential changes to large volumes

of records the possibilities offered by this model do

not respond favorably to these needs.

The first requirement, retrieving a top 10 of the most

sold books can easily be expressed with SQL. The

Table 4.4-5 shows how this can be done with a

simple join and a group clause.

SELECT b.isbn, b.title, sum(o.quantity)

FROM editor.book b JOIN editor.orderline o ON o.bookIsbn=b.isbn

GROUP BY b.isbn

ORDER BY sum(o.quantity) DESC LIMIT 10;

Table 4.4-5: SQL query and simple join operation

Updating the status of the orders is also quite easy to

implement with one query (Table 4.4-6). This kind of

statements is very useful when sequential

modifications which answer to complex conditions

have to be performed on the dataset.

UPDATE editor.`order` o

SET o.`satus` = ('closed') WHERE o.`date` <= curdate() - INTERVAL 1 YEAR;

Table 4.4-6: SQL update query

The third requirement is more complicated to realize.

In this case, the depth of the hierarchy of collections

is not known in advance and it is not possible to

define an SQL query which takes into account this

unknown parameter. Another possible way to

proceed is to recursively retrieve the collections with

a statement similar to code found in Table 4.4-7,

followed by running a query on all the books stored

under these retrieved collections.

SELECT c1.id FROM collection AS c1 JOIN collection c2 ON c1.parentId = c2.id

WHERE c2.id = $categoryId;

SELECT * FROM book as b

WHERE b.collectionId = $categoryId[0];

OR b.collectionId = $categoryId[1]; OR b.collectionId = $categoryId[n];

Table 4.4-7: SQL query and recursion limitation

Nested sets can be used to avoid recursive calls.

However the performance costs needed to update

the hierarchy are randomized. Nested intervals (12)

solve partially this problem but, as nested sets, they

incur some maintenance complexity. While relational

databases permit the management of hierarchies,

they do not exactly provide the right or effective tools

for this maintenance. Applications programmers tend

to use frameworks to manage these requirements in

a more elegantly manner.

Performing full text search queries on a relational

database require a good knowledge of the structures.

In fact, only the columns specified in the statement

will be considered in the result. For complex models,

alternative solutions with external indexes are often

used to perform this kind of request.

SELECT * FROM book as b WHERE b.title LIKE '%computer%' OR b.description LIKE '%computer%';

SELECT * FROM collection as c

WHERE c.title LIKE '%computer%'

OR c.description LIKE '%computer%'; SELECT * FROM tag as t

WHERE t.title LIKE '%computer%' OR t.description LIKE '%computer%';

Table 4.4-8: SQL query and full-text search limitation

The Table 4.4-9 present the non standardized syntax

proposed by MySQL for full text search.

Unfortunately, a problem linked to the structure is not

really solved and this solution does not support full

text search for multiple tables.

SELECT * FROM book as b WHERE MATCH ( b.title,

b.description, b.isbn, ) AGAINST ('word');

Table 4.4-9: MySQL and full-text search

The first requests in this section show the power that

can be reached by combining different operators in

declarative statements. For complex models which

imply sequential data modification in conjunction with

domain operations, relational databases make more

sense. However, the force engendered by a structure

disappears when the case in use involves features

linked to hierarchies, networks and search on semi

structured data. Therefore a good knowledge of the

whole use case is required before being able to

make a choice between the two options.

Page 24: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

24 Specification comparison

4.5 Navigation

In our use case, the entity ―book‖ has not been

clearly defined. This type of entity is difficult to

concretize. Some other unknown entities are

identifying it as a title, paragraphs, images, pages or

covers. Furthermore these entities can vary from one

book to the other. For the editor’s use case, we could

consider the two following types of books saved in

the system. Firstly one could be considered a roman,

essentially composed of ordered chapters, titles, and

paragraphs. Secondly another one as a comic

composed of ordered cartoon boards or planks.

JCR Specification

Without a doubt, navigation constitutes the main

feature proposed by the JCR specification. Creating

and exploring a tree or a network structures is not

always easy. Navigation simplifies this.

The API proposed by the JCR specification allows

navigation in and through records with direct access

or traversal access. A session is the main entry point

of the repository and provides a traversal access to

the root node and a direct access to each node by

using their uuid or path. Each item of the repository

also provides navigational functionalities which make

use of direct access through relative path or traversal

access through children, properties, references or

parents. This API also provides write features to the

repository. Thus, the Table 4.5-1 show how new

nodes, properties and values can easily be created

and saved.

session.getRootNode();

session.getNodeByUUID("uuid"); session.getItem("path"); Node.getNode(“name”);

Node.getNodes(); Node.getProperty(“name”); Node.getProperties();

Table 4.5-1: JCR navigation API

As mentioned, in our use case the entity ―book‖

cannot be completely defined at design time. That is

why the application programmer should give the user

the ability to decide what a book is at the entry point.

At the moment of creation the application

programmer will not be occupied with what types of

entities are present in a book. He will let the user

define them at a later stage. The book can be

identified by displayed the configuration of its

components.

public void displayBook(Node book) throws RepositoryException {

this.traverse(book); }

public void traverse(Node node) throws RepositoryException {

NodeIterator nodeIterator = node.getNodes();

displayNode(node); while(nodeIterator.hasNext()) { traverse(nodeIterator.nextNode());

}

}

public void displayNode(Node node) { // display logic... }

Table 4.5-2: JCR traversal access

The methods shown in Table 4.5-2 try to schematize

the advantages that can be reached by using

navigation. There are a few possibilities now

accessible to the application programmer. He could

provide tools to let the user store the display logic

Figure 4.4-1: Unstructured entity

Page 25: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

25

directly in the nodes, giving the maximum flexibility.

This kind of strategy can be adopted through

features proposed by the JCR specification. A

framework as Sling can facilitate this task.

SQL Specification

As mentioned, the relational model does not take

navigation into consideration and forces the

responsibility on the programmer to implement these

features. Furthermore, all the entities have to be

defined at design time and semi-structured data is

not catered for.

For the editor’s use case, the implications are that

the application programmer will face some problems

if he is not able to define an abstract entity for the

content of the book. Figure 4.5-2 shows how the

application programmer could choose to design his

relational model to take into account that the

structure of the book appears and can only be

concretized at the time of input.

Figure 4.5-2: SQL and unstructured entity

SQL does not standardize mechanisms which

simplify the navigation through records during a

session. Furthermore, there is no real context of

position in the database which is conserved during a

sessions and which can be reused simply.

To navigate the application programmer is obliged to

build a mechanism which is able to perform dynamic

queries on the model. Therefore even if the model is

extremely abstract and able to take into account all

the possible situations, the application programmer is

forced to develop all the application logic to navigate

the structure. This task is by no means trivial.

It is possible to make an implementation model which

adds artifacts or miscellaneous entities to the records

to create hierarchies, networks or explicit orders.

However, this methodology exposes the application

programmer to some conception failures, which are

very difficult to correct once the system is in

production.

4.6 Transactions

In the current context, we can identify two levels of

transaction. The transactions which deal with one

resource and ensure that a sequence of changes can

be considered as a unit of work can be considered as

local. The others referred to as global transactions

(14), deal with several resources and require a

coordinator or a transaction manager to make sure

that the changes can be committed to the pertinent

resources.

Figure 4.6-1: Global and local transaction

Page 26: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

26 Specification comparison

JCR Specification

The JCR specification includes both cases. In a local

manner, if the application programmer deals with

only one repository instance, he can ensure that a

sequence of changes can be considered as a unit of

work. All the changes between two save calls can be

considered as unit of work.

Session.save(); Item.save();

Table 4.6-1: JCR and local transaction

In an application, a content repository can be used

as a resource in conjunction with other resources as

a relational database, a messaging service or

something else. The specifications mention that a

repository implementation can be used in conjunction

with the Java Transaction API (JTA). In a java

container, when the Transaction API is used, the

changes made on the JCR resource are determined

only at the end of the transaction.

// Get user transaction (for example, through JNDI) UserTransaction utx = ...

// Perform some changes in a java content repository // Perform some changes in a relational database

// Commit the user transaction utx.commit();

Table 4.6-2: JCR and global transaction

SQL Specification

The SQL specification allows the regrouping of

statements as a unit of work. These statements will

only be permanent in the database if they all

succeed. This determines that local transactions as

the one shown in Table 4.6-3 are part of the

standard.

START TRANSACTION; (Statement list…) COMMIT;

Table 4.6-3: SQL and local transaction

However, using the database in conjunction with

other resources is not taken into account by the

specification. Some implementations provide

statements to manage this kind of scenario similarly

to the XA statement of MySQL. All the same, this can

and is more often completed at a higher and more

standardized level. Some APIs provide these

features and most JDBC drivers can therefore be

used with JTA.

4.7 Inheritance

To enrich our use case with a wider panel of

associations, we could consider a subsequent new

requirement which implicates inheritance features.

The editor wants to differentiate between his

collaborators, his partners and his customers but he

also wants to take into consideration that an

individual can have several roles.

JCR Specification

For the inheritance requirement, node-types and

mixin-types can be used. For example let us consider

a Person node-type which has three mixin-types

respectively customer, collaborator and partner. By

taking one or more mixin-type, a node which has

been defined as a person can take on all the roles

encountered in the system.

Figure 4.7-1: inheritance semantic

<editor = 'http://www.editor.com/1.0'> [editor:partner] > editor:person

mixin [editor:collaborator] > editor:person

mixin [editor:customer] > editor:person

Mixin

Table 4.7-1: node-types and inheritance

The primary advantage is that queries made on the

person node-type will return all nodes of this type

and it will also including nodes which inherit from this

node-type. All the properties of the returned nodes

are immediately accessible and a node which was

not considered as a person can also acquire this

status through the mixin-type.

SQL Specification

Inheritance tends to be encountered at application

level. However, some relational databases, for

example PostgreSQL can have extensions which

Page 27: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

27

manage inheritance. However these tools are not

standardized and tend not to be used in practice.

A classical way to administer this requirement

consists of creating tables for each susceptible entity

which will inherit characteristics from the person

entity. The identifier of these sub entities is known as

a foreign key which point to the person table. Figure

4.7-2 visually represents how this could be

implemented with SQL.

Figure 4.7-2: SQL and inheritance

It is quite easy to create a query which retrieves the

entire set of persons and all their inherited properties.

The one depicted in Table 4.7-2: SQL query and

inheritance, shows how this can be done with left

outer joins. Additionally a view can be created to

avoid having to rewrite the query.

SELECT * FROM person p

LEFT OUTER JOIN partner pa ON pa.id=p.id LEFT OUTER JOIN collaborator co ON co.id=p.id LEFT OUTER JOIN customer cu ON cu.id=p.id;

Table 4.7-2: SQL query and inheritance

While JCR seems a more flexible way to express

inheritance, this can lead to the conclusion that both

approaches are approximately equal in expressing

this kind of associations. However in reality it

demonstrates that the advantage in JCR is that each

node can inherit from several mixin node-type. With

the annotation that this advantage relates more to

the semi-structured approach rather than inheritance

problems.

4.8 Access Control

Access control can be defined as the action of

authorizing or denying access, modification and

creation of records. While this is nearly always a

requirement in business applications, specifications

rarely respond to real-time situations.

In the editor’s use case, it was mentioned that a

person should be able to see a digital preview of the

book and under certain conditions the whole book.

This implies that books’ components can have

different access policies.

JCR Specification

Since the 1.0 version of JCR (4), access control is

one of the core feature. In its first release, the

specification only declares how to login to the

repository and how to check the permissions

attributed to the items of the repository. The

hierarchical path of the items stored in the repository

is used as the basis on how to check these

permissions. However, the specification does not

specify how access control should be implemented

and manage.

Repository.login(Credentials cred);

Session.checkPermission(String absPath, String actions);

Table 4.8-1: JCR 1.0 and access control

The version 2.0 of the specification (5) defined how

the concepts of privileges and access control policies

in the repository would function. Each item stores

properties which relates to privileges. These

properties can be modified through the API. Thus the

access control feature can be delegated to the

content repository which is able to manage the list of

permissions at an item level.

Session.getUserManager();

UserManager.addUser(…); UserManager.addGroup(…);

Session.getAccessControlManager(); AccessControlManager.getApplicablePolicies(path); Policy.addEntry(…);

AccessControlManager.setPolicy(path, policy);

Table 4.8-2: JCR 2.0 and access control

In both cases, this means that for the editor’s use

case, the application programmer will only have to

define the structure and to use the repository

Page 28: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

28 Specification comparison

features provided to manage access control. The

access control granularity proposed by the API is

close enough to the data to address all the potential

use cases. Consequently, further access control logic

is not required.

SQL Specification

In SQL, access control is basically managed with the

data stored in the information schema (10). This

provides the ability to grant and deny privileges at a

table or a column level. However, while the base

functionalities provided by SQL allows the

declaration of implementation models which manage

permissions at a record level, there is no inherent

standard solution provided. This comes from the fact

that the identifiers of the records in relational

database can be distributed across several domains.

Conserving this property makes it difficult to specify a

generic way to manage access control at a record

level.

Basically, for the editor’s use-case, managing the

readability of the information of which a book is

composed imposes that access control should be

administered at a record level. This is obligatory

because the SQL specification does not provide this

feature. The application programmer must therefore

include it in his implementation model.

The Figure 4.8-1 shows the solution where each

record has a unique identifier stored in a column. The

record controller table allows for the identification of

accessible resources within the database. The

record_accessor table allows for the identification of

the persons accessing the database, they can then

be stored through out the database in a user or a

group table. This model still means that the

application programmer must manage and

implement the logic which will perform the privilege

checks.

Figure 4.8-1: JCR and access control

4.9 Events

Another requirement often encountered concerns the

observation of the changes which can be applied to a

dataset. At the infrastructure level, messaging

services are common examples of components

which make use of these types of events. Some use

cases benefit from being event driven one such case

would be the management of flows. The editor’s use

case could also benefit from this type of

methodology. For example, the editor may want to

notify some clients each time a new book is added to

a specific collection.

JCR Specification

The JCR specification provides an Event Listener

interface which traces all the imaginable operations

which have to be performed when a specific event

Page 29: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

29

occur. These listeners can be registered for different

types of event for example:

when nodes are added or removed

for events which occur under a particular

path, at a specific level

for events which occurs on the instances of a

node-type or on a single node identified by a

UUID.

The coded example presented in Table 4.9-1 shows

how an event listener can be registered for all the

events which occur when a book is added to the

computer collection.

ObservationManager om =

session.getWorkspace().getObservationManager(); EventListener el = new EventListener() {

@Override public void onEvent(EventIterator ei) { System.out.println("A book has been added");

} };

String[] nt = { "editor:collection" }; om.addEventListener( el, Event.NODE_ADDED,

"/collections/science/computer", true, null, nt, false);

Table 4.9-1: JCR and observation

This observation mechanism allows listening in on

events with a fine granularity. Furthermore, the fact

that the observation mechanism is provided directly

through a java API instead a specific procedural

language allows a high level of interaction between

the application and the repository.

However, an important aspect is that the listeners are

not permanent. This means that if the repository is

restarted, all the listeners have to be reregistered. In

certain situations, especially those which occur when

the event listeners are registered at runtime, the

recovery of the application’s state can be difficult and

complex.

SQL Specification

The SQL specification addresses the observation

problem with triggers. One of the main advantages of

triggers is that they remain in the information

schema. This ensures that the state of the database

including the triggers can be easily recovered.

Triggers can be registered for insert, update or delete

operations which are visible on specific tables. The

body of the trigger generally contains procedural

calls which can be launched before or after queries.

CREATE TRIGGER editor.book_insert AFTER INSERT ON editor.book

FOR EACH ROW BEGIN (Statement list…)

END;

Table 4.9-2: SQL and triggers

For the editor’s use case the trigger shown in Table

4.9-2: SQL and triggers listens in on the registration

of new books. However, it is not possible to listen in

on only the events which occur in a subset of the

table. In addition, there is no standard way to

propagate the event from the procedural language to

the application. Hence triggers are mainly used to

modify data in the database following inserts or

updates.

4.10 Version control

Version control is often an issue when people are

collaborating on the same data. It is therefore

prudent to retain to memory the history of an object

and to give the user access to the evolution of an

object. For the case in question, we could imagine

that after a certain lapse of time, the editor decides to

manage in the system the different versions and

editions of the books.

JCR Specification

Version control characterizes how content

repositories are fully compliant with the JCR

specification. The JCR specification includes

versioning as a part of the standard. It can be

supported for individual items and for hierarchies of

items. This simplifies the life of application

programmers who normally have to deal with these

kind of needs. As shown in Table 4.10-1, managing

versions of a hierarchy does not require an

enormous effort.

// mixin versioning type book.addMixin("mix:versionable");

session.save(); // version creation

book.checkout(); book.addNode("chapter1"); session.save();

book.checkin(); book.checkout();

book.addNode("capter2");

Page 30: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

30 Specification comparison

session.save();

book.checkin(); book.checkout();

book.setProperty("isbn", "0-85131-041-9"); session.save(); book.save();

book.checkin(); // get the second version

VersionIterator vi = book.getVersionHistory().getAllVersions(); Version v;

v = vi.nextVersion(); v = vi.nextVersion();

// restore the second version book.checkout(); book.restore(v, true);

Table 4.10-1: JCR and version control

SQL Specification

Some relational databases implementations provide

versioning functionalities. However, versioning is not

part of the SQL standard. Any person wishing to

build an interoperable application have to include

versioning in their implementation model. Managing

properly complex graphs in relational databases is

quite difficult. So while versioning could be

implemented this task would not be undertaken with

SQL.

4.11 Synthesis

It seems that for both specifications the structural

part and the integrity parts are well defined.

However, while the relational model provides very

clear foundations for operations and queries, the

JCR specification seems to provide operations and

queries on a relatively obscure basis.

The same remark can be made for navigation. While

the JCR specification provide a strong navigational

basis, the last versions of the SQL specification have

difficulty to provide a coherent set of features which

take this factor into consideration. Improvements

could be made in these areas for both models with

recommendations and enhancements being shared

mutually.

As an additional key aspect the differences between

each specification is note worthy. Generally, it

appears that the JCR specification is pragmatic in

relation to the SQL specification. The features

provided by JCR give practical answers to common

and recurrent problems.

Providing a standard way to solve running problems

in a natural and elegant manner is not obligatory but

by doing so this actually protects the application

programmer from conception failures. Failures which

could relate to the managing of versioning or access

control.

While relational databases implemented on the SQL

specifications have the potential to represent all

types of use cases which could appear in real life,

They are often badly constructed due to the

constraints which impact and govern a projects

evolution or lifecycle. This does not detract from the

fact that the relational model does contain a

complete set of main building blocks for a database.

At specification level, SQL makes extensive use of its

base components to express its various extensions.

Conclusions can be drawn from this, principally that a

specification’s foundation should be able to handle

and manage all kinds of use cases and secondly that

a specification should evolve and build onto its

foundation and not away from it.

Page 31: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

31

5 Development process comparison

Another perspective is taken in this chapter to

compare relational databases and java content

repositories. The purpose is to show the key

differences between data models which impact the

application’s development process. These

differences cannot really be measured but are

significant enough to be mentioned.

Agile development processes such as ―Extreme

Programming‖, ―Rational Unified Process‖ or ―Open

Up‖ divide project life cycles into steps such as

inception, elaboration, construction and transition.

These phases can be interactively executed. The

process depicted in Figure 4.11-1: Agile and iterative

development process summarizes a possible

segmentation of the time taken for the Open Up

development process. The following sections will

make reference to these steps. The purpose is to

show where and how both models, the JCR one and

the relational one, can respectively impact this

process.

5.1 Data Understandability

Making architectural and implementation models

understandable is one of the key aspects of the

elaboration phase. Clear architecture which can

easily be communicated allows people to enter more

quickly into the project. It is also easier to define

tasks and duties if the architecture is clear and made

of separate modules.

Generally the architecture is defined or refined by an

architect or an analyst during the elaboration stage.

This actor takes the requirement identified during the

inception phase as input and delivers blueprints

which explain the behavior of the system at different

levels. At an application level, these blueprints

generally include use case diagrams, collaboration

diagrams or class diagrams. To show how the

application’s data persists, these schemas are often

Figure 4.11-1: Agile and iterative development process

Page 32: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

32 /Development process comparison

translated into database schemas which take the

properties of the data model into account.

JCR development

As mentioned, the structure and the content are

indivisible in JCR. However it is possible to define a

semantic which shows how data and structure will

be instantiated. In this semantic, some aspects of

the content can be omitted.

For example, if a semantic item has an unstructured

basis, all the possible and imaginable properties can

be saved under it. Thus, there is no need to mention

them if they are not mandatory or don’t have to

respect specific constraints. It is enough to declare

them in the application’s schemas as made in a

class diagram. Thus, the semantic diagram of a java

content repository says less than the other

architectural diagrams. This impacts its readability.

In fact, reading the semantic of a repository gives a

snapshot of the final application and helps to

understand its general behavior.

Figure 5.1-1: JCR translation

Another interesting aspect is that the complexity of

the JCR semantic is not decupled by many-to-many

relationships. No intermediary nodes or artifacts are

needed to represent these associations. Thus, these

diagrams are very much closed from the other

architectural schema. No translation rules are

needed to create them.

Relational development

Class diagrams can be used as input to generate

relational schemas. Entity-relationship diagrams (15)

or Crow's Foot diagrams are often used to represent

them. Translation rules are generally needed to

produce these schemas. Far from summarizing the

architecture, they enumerate to a high degree all the

aspects of the final application.

Figure 5.1-2: SQL translation

Everything has to be explicitly mentioned in these

database schemas. Only the records which respect

the data structure can be instantiated in a relational

database. Thus, it is necessary to carefully define

this structure and make it fit in perfectly with the

application architecture.

Many-to-many associations cannot be represented

in relational database schemas without reification.

This means that many-to-many associations will

always require intermediary entities. Consequently,

the internal complexity of a relational schema

increases faster than the complexity of the other

architectural diagrams. Thus, they don’t really help

to understand the application. They are more often

used as implementation’s blueprints.

5.2 Coding Efficiency

The construction phase of a development process is

highly influenced by efficiency. Coding requires time,

resources and money. These parameters are very

sensitive. Furthermore, if developers have to write

code twice, there is a high probability that they will

make more than double the programming errors.

Thus, efficiency also impacts quality.

Page 33: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

33

Measuring coding efficiency implies some soft

parameters. The programmer’s education and

knowledge should be taken into account.

Furthermore, the semantic and the readability of the

code are also significant. These parameters make it

difficult to judge the technology’s efficiency. Without

going too deep into these questions, the following

sections contain useful information which can be

taken into consideration when making a decision in

this area.

JCR development

Programmers are not really familiar with the JCR

API and don’t really know the best practice linked to

content repositories. However, the API is in large

part self-explanatory and people generally have the

habit of thinking in terms of hierarchies. These

parameters should give to JCR a good learning

curve.

Some interactions are possible between the query

part of the API and it’s navigational part. One of the

big advantages of JCR is stated in the fact that

these aspects are merged coherently and are not

considered as different abstraction levels.

The code quantity highly relates to the use case. If

complex joining operations are mainly required, JCR

will not be an efficient choice. However, if navigation

is required, the size of the code will be much

smaller. If special requirements such as versioning

or fine grained access control are needed, it

becomes clearly difficult to reach the same level as

the one proposed by JCR.

Relational development

Nearly all programmers are familiar with the

relational model and people have often used it in

recent years. Thus, SQL and API as JDBC are part

of the common language. In real world situations,

this general knowledge often favors the relational

model. Some problems need to be treated in a

specific manner and the intuitive approach often

gives bad results.

If complex operations are required by the use case,

the relational model should not be bypassed. The

completeness of the queries and the panel of

operations made it very efficient in term of code

quantity. However, if the use case implies

requirements such as navigation or versioning, the

developer will have to add some artifacts into his

implementation model to manage parameters such

as tree structure or order. He will also face the

problem of having to implement huge applicative

logic. Thus, in terms of efficiency, the model’s choice

should be driven by an honest analysis of the use

case’s properties.

5.3 Application Changeability

Requirements which appear during the development

process are often difficult to include in previously

defined architecture. Modern software development

processes generally address this problem with

iteration cycles (16). Well managed, iterations

should allow to include efficiently new requirements.

However, because each logic level is generally

impacted by architectural changes made during the

elaboration phase, last iterations are more

expensive than early iterations.

Decoupling clearly logic levels can reduce this

increasing cost. Thus, data models which can

transparently accept changes are really appreciated.

To make this point, we will consider how simple

changes are impacting the data logic of a system.

JCR development

As mentioned in the ―Schema understandability‖

section, repository’s schemas summarize the other

architectural diagrams. While this could appear

meaningless, it is really not the case. Keeping the

repository as weak as possible allows and includes

new requirements without touching the data logic

level. Only the application logic level is impacted.

Thus, adding a property at an application level

doesn’t necessarily require or touch the repository’s

organization.

To be sure, deep changes impact data logic and

JCR, and they do not provide a magic solution

either. The JCR allows for a decoupling of most of

the data logic from the application and the interface

levels. It is also interesting to note that frameworks

like Sling allow decoupling in a similar manner to the

application logic from the interface logic. This

Page 34: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

34 /Development process comparison

approach is clearly an attractive one, especially in

environments driven by changes and agility.

Relational development

Nearly each modification made on the overall

architecture will impact the data logic level. This

comes from the fact that relational databases do not

allow for instantiate elements which have not been

previously defined in the structure. Thus, there is a

great probability that a change made in a formulary

of the interface or in the application logic will require

perform changes on the data model logic.

Some frameworks provide tools to automate these

changes. However, if the system has a production

version, once executed the change will have a big

foot print on all the database’s items. Furthermore,

classical model-view-controller frameworks are not

really decoupling the applications level from the

interface. For example, a change made on a

controller will often impact on views and models.

5.4 Synthesis

At a project level, people are often looking for

solutions which will allow for the quick integration of

changes into their environment. In situations where

some changes have to be performed the semi-

structured nature of JCR will certainly be

appreciated. Furthermore, the inclusion of features

such as navigation, versioning or access control can

gain us a lot of time.

Nevertheless, it is important to keep in mind that the

efficiency of both solutions relates in a large way to

the nature of the use case. The agility of JCR should

not influence this aspect. Furthermore, the agility is

inked in no small way to the project team. Thus,

saying that JCR is a way to achieve agility is a too

big a shortcut.

In all cases, the choice of a database technology

should always be discussed during the inception and

elaboration phases of the first iteration of the

development process. This can be done by leveling

the different parameters. Changing the persistence

technology cannot easily be achieved after the first

iteration. Consequently, this choice will have a

strong impact for the rest of the project.

Page 35: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

35

6 Product comparison

Choosing between database products implies that

we use different criteria. We can mention the

compliance with a standard, the additional features

proposed by the provider, the support offered by a

company or by a community or the scalability of the

solution. All these criteria have an importance. They

should be weighed carefully and a choice made

depending to the situation.

In our context, basic and significant differences

distinguish java content repositories from relational

databases. Thus, a decision to employ one

technology instead of another should be taken at a

lower level. However, in relation to the product,

people often ask in terms of performance, if they

should use a relational database or a java content

repository to manage their hierarchical information.

This section will try to address, and answer this issue

by reminding us of some basic theoretical concepts

which relate to data structures and to the cost of

associations. Then, at a more practical level, a

benchmark of several database products will verify if

these assumptions can be proved.

6.1 Theoretical analysis

In general, database products use basic data

structures to manage their data. This section reminds

us of simple concepts which relate to these

structures and to the cost of associations made

between data items. The goal is to determine if the

product’s performances will be significantly impacted

by the subtended approach.

Hierarchical and network database

In the hierarchical and network models, associations

are made by storing references or pointers between

items. The advantage of this kind of structure is that,

because each node stores direct references with

other nodes, a constant number of read accesses

are needed to go from one node to its target.

Creating an association between two nodes also has

a constant cost because the number of operations

needed to perform this is always the same.

Thus, the cost of crossing and creating associations

is constant and could be noted as O(1) in big O

notation. Some people say that these associations

are pre-computed.

Some strategies allow the representation of directed

graphs such as those needed by the hierarchical and

the network models. The most classical

representations of this are adjacency lists and

adjacency matrixes (17). Generally, the choice

between one approach instead of another is made

simply by analyzing the density of the graph.

If the graph has a number of arcs which are close to

the square of the number of edges, selecting an

adjacency matrix will show a better result. However,

the JCR model is mainly driven by hierarchical

associations. In this context, the number of arcs will

not be a lot taller than the number of edges. Thus, an

adjacency list will show more respect for the memory

usage by requiring only the space needed to store

the associations. It is also interesting to note that this

kind of organization allows, with a certain amount of

ease, the giving of an order to the children of a node.

Page 36: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

36 Product comparison

Figure 6.1-1: A hierarchy and its adjacency matrix

Implementing this with a programming language can

be accomplished by using several data structures

such as arrays, maps or hash-tables. Some other

solutions could also be presented but the main idea

is that crossing an association has a constant cost

and that crossing a graph has a cost which is

proportional to the number of arcs and edges

traversed. Thus, managing this kind of data is cost

effective.

Relational database

In the relational model, associations are made

between relations by computing the matching values

stored in two domains. This allows for the expression

of all imaginable associations between two or more

data sets.

What is the cost implication of computing and

creating associations in a relational database? To

compute an association, a relational database has to

cross the targeted set to find the matching values. In

this case, the cost of the association equals O(n),

with n the number of tuples stored in the source and

in the target. However, most database products

provide indexation facilities such as b-tree indexes.

So, in most cases, finding the matching entries has a

cost of O(log(n)). While b-tree indexes are good,

some articles (18) argue that in the network models,

because associations are pre-computed, it is

possible to reach better performance.

However, in most cases there is no need to use other

comparison operators other than ―= ― or ―≠‖ to

express relationships as these are presented in a

hierarchical or network model. Consequently, hash

indexes can be used on the domains which

constitute the association. If the relational database

provides good hash indexes’ implementations, the

cost of retrieving data through associations will be

close to O(1). It also results in a constant cost of O(1)

when new items are added to the targeted sets and

in the index. Thus, there are virtually no significant

differences between the associations of the relational

model and of the hierarchical model.

6.2 Benchmark

The previous section has summarized very succinctly

and too quickly a huge problem. However, the main

point to keep in mind is that intolerable differences

should not appear if hierarchical data is managed

with a content repository or a relational database.

The following benchmark has been done to verify this

assumption.

Four products are included in this benchmark. CRX

is a native implementation of the JCR specification.

The persistence of the items is managed with a

proprietary technology which is based on the tar file

compression (19) and implemented with java. H2

and Derby are two open source relational databases

written in java. MySQL is one of the most widely

used open source databases.

A simple wrapper has been defined for this

benchmark. This wrapper proposes basic functions

to create trees made of nodes and properties. The

CRX wrapper uses directly the functionalities

provided by the API. The SQL wrapper uses a simple

database schema. One table stores the nodes and

the other table stores the properties. The

associations between items are managed with a

parent foreign key and the default indexes of the

Page 37: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

37

database are used on all fields. JDBC allows

performing queries and prepared statements to avoid

parsing the SQL statements each time.

The benchmark is composed of four parts which all

measure the time required to perform an operation in

hierarchies of different sizes. Each node of these

base hierarchies has 5 sub-nodes and 5 properties

except leaves which only have 5 properties. The first

hierarchy has one level. The following ones always

include one more level. The tests have been

launched 5 times on a Dell Latitude D820 installed

with windows XP (processor: Intel Core Duo 2.00

GHz, virtual memory: 2.00GB). The average result is

used in the following diagrams.

Writing the hierarchy This test measures the time required to create the base hierarchy. The throughputs correspond to the time needed to write one item of the hierarchy. While the differences seem huge, all the throughputs are constant. The assumption that native implementations of JCR and relational databases should be equivalent in term of performance is true in this case. MySQL cannot be embedded in the application. This has a high impact on the result. H2 does not appear in the chart because its performance for write accesses is too good.

Reading the hierarchy This test consists to read once all the items of the base hierarchy from the root to the leaves. The throughputs displayed in the chart correspond to the average time needed to read one item of the hierarchy. For most databases the results seam to be constant. Derby is just out of range. When recursive queries are performed on this database, the results are not tolerable.

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

36 186 936 4686 23436

Mill

ise

con

ds

Items

crx h2 mysql derby

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

36 186 936 4686 23436

Mill

ise

con

ds

Items

crx h2 mysql derby

Page 38: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

38 Product comparison

Randomly writing the hierarchy

The test consists of randomly writing 100 sub-hierarchies in the base hierarchy. Each sub hierarchy has a depth of 2 levels. Each level has two sub nodes and two properties. Thus, each sub hierarchy is composed of 21 items. The throughputs relate to the average time required to create all the items of one sub-hierarchy. The results of the first test are quite similar to this one. The good point is that all the databases have constant results.

Randomly reading the hierarchy

The test consists of randomly reading 100 nodes and their descendants on two levels in the base hierarchy. The throughput relates to the average time required to read one node and its descendant. As in the second test, Derby is just out of range. The same problem is encountered with recursive queries. It appears that CRX is well optimized for these situations. To be really pertinent this test should be launched on bigger hierarchies. However, the difference between the results is constant and relational databases are not showing extremely bad performances for recursive queries.

6.3 Synthesis

As shown in this chapter, performance should not be

used as the main argument to choose one technology

over another. The aspects mentioned in the previous

chapters are more important. The choice should relate

to the nature of the problem which has to be solved

and not to the nature of the product.

The assumption that relational databases are able to

effectively manage hierarchical data is true. However,

this does not mean that java content repositories

should be implemented as a layer over relational

databases. Some base concepts of both

specifications are in a mismatch and make a relational

schema for JCR, which include all the aspects of the

specification, will look unsuitable. More modularity (3)

in the database world could benefit from both

approaches. While this goal is not achieved, native’s

implementation of JCR is probably the better of the

proposed solutions.

0.00

50.00

100.00

150.00

200.00

250.00

300.00

350.00

36 186 936 4686 23436M

illis

eco

nd

sItems

crx h2 mysql derby

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

36 186 936 4686 23436

Mill

ise

con

ds

Items

crx h2 mysql derby

Page 39: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

39

7 Scenario Analysis

The following diagram synthesizes the main aspects pointed out during the whole comparison process. Four use

cases characterized by different features will be shortly analyzed in regard to their respective requirements and

to the presented approaches.

JCR RDBMS

Data Model Level

Structure Unstructured Semi structured Structured

Structured

Integrity Entity integrity Domain integrity Referential integrity Transitive integrity in hierarchies

Entity integrity Domain integrity Referential integrity Tools to manage data coherency

Operations and Queries Selection Equi-join operations Full text search operation Transitive queries on hierarchies

Selection Projection Rename Join operations Domain operation Create, read, update, delete statements

Navigation Navigation API Traversal access Direct access Write access

Not supported

Specification Level

Inheritance Node types inheritance Node inheritance

Not supported

Access control Record level

Table and Column level Record level not supported

Observation Record level Un-persisted event listeners Application interaction supported

Table level Persisted triggers Application interaction not supported

Version control Supported

Not supported

Project Level

Schema understandability DataGuides or Graphs Summarize the architecture Not impacted by many-to-many associations

Entity Relationship Represent the whole architecture Impacted by many-to-many associations

Code complexity Simple for Navigation Complex for Operations

Complex for Navigation Simple for Operations

Changeability More agile Decoupled from the application

More rigid Coupled with the application

Page 40: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

40 Scenario Analysis

7.1 Survey

An agency wants to implement an application which

is able to carry out surveys over the web. This tool

should be able to allow for the collection of data from

questionnaires, to configure the type of answers,

and to aggregate the survey’s results in a suitable

form.

Main characteristics of the application:

All the entities can easily be identified at the

design time. (Structure)

Some verification has to be made on the

data. (Integrity)

The results aggregation implies complex

operations. (Operations and Queries)

Once in production the application will not

evolve to a great degree. (Project)

The choice of a relational database for this kind of

scenario is probably the best alternative. The

features provided by a content repository will not

really be used. Furthermore, programming

operations will only add complexity in the

application.

7.2 Reservation

An event organizer wants a portal which gives the

opportunity to buy tickets for events. The event

organizer should be able to create the events

characterized by a name and a short description.

The customer should be able to browse and search

the event’s catalogue and to order tickets. On the

other hand, the event organizer wants to monitor his

sales and manage his prices depending to the

success of the event.

Main characteristics of the application:

All the entities can easily be identified at

design time. (Structure)

Some verification has to be made on the

data. (Integrity)

Monitoring the sales can imply some

operations on the dataset. (Operations and

Queries)

Browsing and searching the catalogue

require traversal and direct access.

(Navigation)

As a strategic application, the application is

subject to improvements. (Project)

This application has strong needs which relates to

the relational database world. The clear structure

linked to the management of orders and events

could lead us to conclude that a relational database

is the ideal candidate. However the need of

navigation and the potential extensions linked to the

catalogue could benefit from the features of a

content repository.

A balanced approach could consist of storing the

orders in a relational database and using a content

repository for the events catalogue. This also fits in

particularly well with the fact that the catalogue will

mainly be subject to read access and the ticketing

service to write access. This should not be a

problem because complex interactions between the

JCR and the RDBMS can be managed with the Java

Transaction API. Making hybrid decisions can in

certain contexts allow us to benefit from both

applications, thus having the best of both worlds.

Page 41: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

41

7.3 Content management

A publisher wants an application to be able to

manage all the content generated by its

collaborators. The content will be composed of

videos, photos, text or anything else. Several

taxonomies should be available to organize the

content. The main purpose of the publisher is to

offer a coherent set of features which allow for the

easy retrieving of resources for each type and to

enable the reuse of them in different contexts or in

other publications.

Main characteristics of the application:

The editor wants to take into consideration

that new entities of content could appear.

(Structure)

The main verifications regarding data

concerns virus. (Integrity)

Searching requires full text indexation.

(Operations and Queries)

Taxonomies imply simple operations.

(Operations and Queries)

Exploration is needed everywhere.

(Navigation)

Future improvements could imply

versioning, observation and access control.

(Specification features)

The system will continuously evolve with the

enterprise. (Project)

The flexibility and the features provided by JCR are

typically made for these types of scenarios. Content

as understood here is difficult to store in a relational

database. Furthermore, all the complex

requirements such as versioning or access control

can be included during the application life cycle

without too much of a problem.

7.4 Workflow

An editor wants to manage the interactions of his

collaborators. The situation could be the following:

The editor in chief and the board decide which

subjects have to be treated in the next edition of a

publication. These subjects are communicated to the

workforce (journalists and photographers). Once

edited, the articles are sent for proofreading. Once

corrected, the editor in chief is notified. He decides if

the article can be published or not. If the article will

appear in the publication, it is sent to a typography

service which produces a model which includes

pictures. Once the publication integrates all the

articles and all the pictures, the editor in chief will

read it once again and take the decision to publish it

or not.

Main characteristics of the application:

The entities are composite and difficult to

design. (Structure)

The structure mainly involves graphs.

(Structure)

Editing and exploring the process implies

traversal access. (Navigation)

Notifications imply to observe local events.

(Observation)

Notifications imply interactions between the

data model and the application.

(Observation)

This kind of scenario involves semi-structured

models in conjunction with good observation

capabilities. While the other features proposed by

JCR such as versioning or access control do not

directly find an application, the foundations of the

model will really be appreciated in this case. The

workflow structure can be directly designed with

nodes and items and once instantiated the workflow

will clearly benefit from the observation mechanisms

proposed by JCR.

Page 42: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

42 Conclusion

8 Conclusion

The choice of a data model or of a database is often

arbitrary. Sometimes, specific technologies are

imposed by an enterprise policy or simply by

irrational preferences. When the time comes to

choose a technology, the good arguments are not

often put forward. Furthermore, the myth of a general

multi-purpose database is still ingrained in some

minds and people are always looking for a magic

solution which can be used in all imaginable

circumstances.

Today, the cohabitation of several infrastructure

components can be achieved with minimal effort. A

platform such as J2EE provides tools to manage

distributed resources. In this context, the choice of a

data model or of a database should not be reduced

to an arbitrary decision.

As shown in the ―Scenario Analysis‖ chapter, a

pragmatic analysis gives quick results. The

technology which fits in best with the requirements

can be identified and used to the greatest effect. In

some cases, hybrid strategies can also be adopted.

A coherent choice can lead to significant advantages

and this question should always be discussed during

the early phases of each project.

Relational databases have been successfully used

for several years. However, the growing power of the

user and the rigidity of the relational approach make

it difficult to implement features which are actually

required by some applications. It’s possible to push

the boundaries of the model but the constraints of

time and money make it difficult to do so correctly.

Some frameworks are partially effective in solving

these problems. Depending on a middleware layer

for features such as access control, navigation, or

versioning only push the hot potato at a higher level.

This does not really solve the problem but adds

complexity to the overall environment.

Java content repositories cannot replace relational

databases in every situation. Actually, the features

proposed by the API fit very well with all the

requirements encountered in content management

and collaborative applications.

Nevertheless, JCR enriches the debate around

databases and data models in relation to two

important aspects. Primarily JCR includes some

features at a data model and specification level.

Secondly the specification is aware of its

environment and takes into account that java content

repositories can be used in conjunction with other

infrastructure components. This is not the case for a

specification such as SQL.

This tendency seams relatively new but will probably

be consolidated during the next few years. With a

position of precursor, Day can play an important role

in this debate and will gain in notoriety. Some

challenges will arise with the growing popularity of

the JCR specification. Selecting good opportunities

should allow for the database field to make its mark.

This in its turn will create a footprint that will overflow

into the world of infrastructure components.

Page 43: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

43

9 Appendix – JCR and design

As mentioned in the ―data model comparison‖

chapter, a Java Content Repository schema is

dynamic and evolves with the content. The structure

appears when nodes and properties are instantiated.

However, during the development process the need

to establish a semantic for the repository appears.

Several publications which treat semi structured

approaches propose solutions in how to represent

these schemas (20) (21). These representations are

called DataGuide (DG) or Approximate DataGuide

(ADG). The lesser elaborate version can capture

visually the organization of semi-structured

databases. The JCR specification (4) use graphs to

represent the example of the structure which can be

found in the content repository.

DataGuides and other graphs notations fit

particularly well with Java Content Repositories but

are not expressive enough to be used as

implementation blueprints. The goal of this appendix

is to summarize the possibilities offered by JCR to

organize content and to enrich the notation

proposed in the specification which needs to

communicate the whole semantic of a repository.

9.1 Model

The most common relationship provided by the

model is the composition. Semantic items can be

instantiated as node and properties. A node can be

composed of sub-nodes and properties. A property

can only be composed of values. Except for the root

node, all other nodes and properties are

components.

Some as seen allow for the creation of horizontal

relationships between the branches of a hierarchy. A

common relationship is achieved by storing one or

more paths values in a node property. This method

has an advantage because the hierarchical property

of the target can be used in queries. Another

relationship consists to store one or more UUID

values in a node property. The maintains the validity

of the link even if the target is moved. Any one of

these approaches could be used and be appropriate

depending on the context.

9.2 Convention

Semantic items which will be instantiated as node or

properties are respectively represented by circles

and boxes. The circle’s label refers to the node-type,

the box label to the property-type. Without a label,

the circle or the box means that the node can be

found. An empty circle means that everything which

is not mention is allowed under the semantic item

(black list). A barred circle means that everything

which is not mention is not allowed (white list). An

empty box means that the property is simple. A box

which contains a ―M‖ means that the property can

store multiple values.

The composition of associations is represented by

filled arrows which link two semantic items. The

arrow’s label refers to the relative path which links

the two semantic items. Only descendant relative

paths are allowed. Stars (*) and variables

(<variable>) can be used to express pattern in the

path. Without a label, the arrow means that a

semantic item, as the one targeted, can be found

everywhere under the source. The arrow can end

with a cardinality (1..N). Without cardinality the

meaning is N.

Horizontal associations are represented by dotted

arrows. They always start from a box and finish on a

circle. No labels are put on these arrows. They are

Page 44: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

44 Appendix – JCR and design

only used to give implementation information. The

arrow can end with a cardinality (1..N). Without

cardinality the meaning is N.

Inheritance associations between semantic items

can be represented by empty arrows. They should

always go from the bottom to the top. No labels are

put on these arrows. The elements which are

represented with a bold style are mandatory. If

specific constraints have to be declared they can be

shown as comments in the diagram.

9.3 Methodology

Designing a JCR semantic can be made with different approaches. If a development process is used, the

semantic will be obtained by translating the applications diagrams. The approach proposed here consists of six

steps which can be iteratively be executed and which result in a semantic blueprint which can be implemented in

a repository.

Input Output Activity

Step 1 Identifying the semantic items

Existing semantic Requirement

Semantic items Identifying the concepts which relate to the requirement and which have to be localized in the repository.

Step 2 Identifying the inheritance relationships

Existing semantic Requirement Semantic items

Inheritance semantic Identifying inheritance relationships between the semantic items.

Step 3 Identifying the hierarchical relationships

Existing semantic Requirement Semantic items

Hierarchical semantic Identifying hierarchical relationships between the semantic items. Thinking in term of composition.

Step 4 Identifying the horizontal relationships

Existing semantic Requirement Semantic items

Horizontal semantic Identifying horizontal relationships between semantic items. Identifying relationship’s types. Thinking in term of association or aggregation.

Step 5 Defining cool structure artifacts

Existing semantic Requirement Hierarchical semantic Horizontal semantic

Organizational semantic Identifying the patterns which link hierarchical semantic items.

Step 6 Carefully defining the integrity rules

Existing semantic Requirement Semantic items Inheritance semantic Hierarchical semantic Horizontal semantic Organizational semantic

New semantic Only if necessary, declaring in the semantic the level of coherence which has to be preserved at a repository level.

Page 45: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

45

9.4 Application

Based on a very simple use case, this section shows

how the methodology and the notation previously

defined can be applied. The purpose is to deliver a

blueprint which shows how data is organized and all

data aspects required to build the application.

The specifications of the case are as follows: A blog

application deals with posts. A post always stores its

creation date and should contain some information

such as text, images, etc. A post can belongs to zero

or one category and can have zero to an infinite

number of tags. A category can have subcategories.

From any category it should be possible to find all the

posts which relates to it and to its subcategories.

When a category is deleted, the related posts are not

deleted. Anonymous readers can respond to posts

with comments. For navigation, it may be useful to

organize posts by years, months and dates.

Page 46: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

46 Appendix – JCR and design

Output Comments

Step 1 Identifying the semantic items

Properties do not have to be localized in the repository.

Step 2 Identifying the inheritance relationships

The requirement does not contain inheritance but we could imagine this kind of relations.

Step 3 Identifying the hierarchical relationships

Post and categories are not linked with a composition relationship.

Step 4 Identifying the horizontal relationships

To satisfy the requirement, posts are linked to categories with path values and with UUID values to tags.

Step 5 Defining cool structure artifacts

The year, month, year pattern is part of the hierarchical association.

Step 6 Carefully defining the integrity rules

In our case we only have to ensure that a post always has a creation date.

Page 47: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

47

10 Appendix – Going further

Only a few subjects have been mentioned in this

report. This appendix presents three fields which

relates to JCR and to databases in general. These

fields could benefit from being studied in more depth.

Furthermore, some existing products could be

improved if these questions were addressed.

10.1 Queries in semi-structured models

In the JCR Model, the notions of sets, relations and

domains, which provide the means of expressing first

order logic statements over the model, are present

but currently not formally defined. It seems that at the

present, node-types are seen as relations, properties

as domain, nodes as tuples and properties’ values as

attributes.

The fact that these notions are well defined in

relational databases procures advantages. For

example, on this basis, some databases engines are

able to analyze queries and to optimize them in

regard to the structure. In semi-structured databases,

query optimization is a known issue and research is

still being conducted in this area (20).

It is currently not clear if mapping as proposed by

JCR could ensure more efficiency when queries are

performed. Greater work on this question and further

improvements of the JCR’s query model could be a

very interesting field of investigation.

10.2 Queries on transitive

relationships

The model proposed by JCR stores the hierarchical

paths of each node. This allows the performing

queries on transitive relationships in hierarchies by

using the path property. Assuming a tree structure

limits the whole number of paths to the number of

leafs.

Doing the same for horizontal relationships is a bit

more problematic. To summarize, in a network

structure, pre-computing all the paths is not

proportional to the number of leafs but to the square

of the number of nodes (11). The storage capacity

required to store the transitive paths between the

nodes also grows in a similar manner.

Some use cases such as those which involve social

networks need to store these kind of relationships.

Defining a standardized way to manage this could be

very useful in some situations. However, it demands

that some research be made on finding the best

algorithms and solutions which relate to this problem.

Furthermore, query languages based on first order

logic are limited when having to define queries on

transitive closures and transitive relationships in

general. It is in this measure and area that

improvements still have to be accomplished.

10.3 Modular and configurable

databases

As shown in the ―product comparison‖ chapter, the

relational model is able to manage efficiently

hierarchical relationships. Therefore is it really

necessary or intelligent to implement, from the

ground up, a data model which can be constructed

from another, with approximately the same results?

Some reasons could lead to this conclusion.

However, the base differences between JCR and

SQL cannot be omitted. For example, does it make

sense to create a procedural API over a declarative

Page 48: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

48 Appendix – Going further

query language which will be retranslated in

declarative calls in the database? While the cost

relating to the parsing of a query is insignificant, it is

also a good reason indicating that it is preferable not

to proceed in this manner.

In reality databases are presently used with many

different purposes in many different contexts. A few

applications are embedding databases to manage

small data sets in single client applications while

others are dealing with thousands of connections

and scalability problems. In this context, a

multipurpose monolithic database is unimaginable

even mythological. Margo Seltzer promotes a more

modular and configurable approach to build

databases (3). These recommendations lead

developers into using database components at

different level depending on their needs.

JCR and SQL are two high level backend solutions

which have possibilities but also limits. Their

significant differences do not mean they do not have

common denominators. More modularity in their

architecture could give a better understanding of

their behavior. This could also allow them to share

components and to be adapted more easily to

specific requirements and contexts.

Page 49: JCR or RDBMS - Semantic Scholar · JCR or RDBMS why, when, how? Bertil Chapuis 12/31/2008 Creative Commons Attribution 2.5 Switzerland License This paper compares java content repositories

University of Lausanne & Day Software AG JCR or RDBMS

49

11 Bibliography

1. Tsichritzist, D. C. and Lochovsky, H. Hierarchical

Data-Base Management: A Survey. New York, New York :

ACM, 1976.

2. CODD, E. F. A Relational Model of Data for Large

Shared Data Banks. San Jose, California : ACM, 1970.

3. Sestzer, Margo. Beyond Relational Databases. ACM

Queue. New York, New York : s.n., 2005.

4. Nuescheler, David and Piegaze, Peeter. Content

Repository API for Java™ Technology Specification. s.l. :

Java Community Process, 11 May 2005. version 1.0.

5. —. Content Repository API for Java™ Technology

Specification. s.l. : Java Community Process, 2 July 2007.

version 2.0 Public Review.

6. Mazzocchi, Stefano. Data First vs. Structure First.

Stefano’s Linotype. [Online] July 28, 2005.

http://www.betaversion.org/~stefano/linotype/news/93/.

7. Chaudhuri, Surajit. An Overview of Query Optimization

in Relational Systems. Redmond, Washington : ACM,

1998.

8. Buneman, Peter. Semistructured Data. Tucson,

Arizona : ACM, 1997.

9. Aho, Alfred V. and Ullman, Jeffrey D. Universality of

data retrieval languages. San Antonio, Texas : ACM,

1979.

10. Database Language SQL. Information Technology.

[Online] July 30, 1992.

http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.t

xt.

11. Li, Zhe and Ross, Kenneth A. On the cost of

Transitive Closures in Relational Databases. New York,

New York : Columbia University Press, 1993.

12. Tropashko, Vadim. Trees in SQL: Nested Sets and

Materialized Path. DBAzine.com. [Online] April 13, 2005.

http://www.dbazine.com/oracle/or-articles/tropashko4.

13. Bachman, Charles W. The Programmer as Navigator.

Waltham, Massachusetts : ACM, 1973.

14. Distributed Transaction Processing:The XA

Specification. s.l. : The Open Group for distributed

transaction processing, 1991.

15. CHEN, PETER PIN-SHAN. The Entity-Relationship

Model-Toward a Unified View of Data. Cambridge,

Massachusetts : ACM, 1976.

16. Introduction to OpenUP. OpenUp. [Online] October 27,

2008. http://epf.eclipse.org/wikis/openup/.

17. Cormen, Thomas H., et al. Introduction to Algorithms,

Second Edition. Cambridge, Massachusetts : The MIT

Press, 2001.

18. Bates, Duncan. Embedded databases: Why not to

use the relational data model. Embedded Computing

Design. [Online] January 01, 2008. http://www.embedded-

computing.com/articles/id/?2569.

19. Müller, Thomas. CRX Tar PM. dev.day.com. [Online]

Day Software AG, November 11, 2008.

http://dev.day.com/microsling/content/blogs/main/tarpm.ht

ml.

20. Goldman, Roy and Widom, Jennifer. DataGuides:

Enabling Query Formulation and Optimization in

Semistructured Databases. Palo Alto, California : Stanford

University Press, 1997.

21. —. Approximate DataGuides. Palo Alto, California :

Standford University Press, 1999.

22. Nuescheler, David. David's Model: A guide for blissful

content modeling. Jackrabbit Wiki. [Online] August 22,

2007. http://wiki.apache.org/jackrabbit/DavidsModel.

23. Priti, Mishra and Margaret, Eich. Join Processing in

Relational Databases. Dallas, Texas : ACM, 1992.