assisting migration and evolution of relational legacy databases

Assisting

Migration and Evolution

of

Relational Legacy Databases

by

G.N. Wikramanayake

Department of Computer Science,

University of Wales Cardiff,

Cardiff

September 1996

Abstract The research work reported here is concerned with enhancing and preparing databases with limited DBMS capability for migration to keep up with current database technology. In particular, we have addressed the problem of re-engineering heterogeneous relational legacy databases to assist them in a migration process. Special attention has been paid to the case where the legacy database service lacks the specification, representation and enforcement of integrity constraints. We have shown how knowledge constraints of modern DBMS capabilities can be incorporated into these systems to ensure that when migrated they can benefit from the current database technology. To this end, we have developed a prototype conceptual constraint visualisation and enhancement system (CCVES) to automate as efficiently as possible the process of re-engineering for a heterogeneous distributed database environment, thereby assisting the global system user in preparing their heterogeneous database systems for a graceful migration. Our prototype system has been developed using a knowledge based approach to support the representation and manipulation of structural and semantic information about schemas that the re-engineering and migration process requires. It has a graphical user interface, including graphical visualisation of schemas with constraints using user preferred modelling techniques for the convenience of the user. The system has been implemented using meta-programming technology because of the proven power and flexibility that this technology offers to this type of research applications. The important contributions resulting from our research includes extending the benefits of meta-programming technology to the very important application area of evolution and migration of heterogeneous legacy databases. In addition, we have provided an extension to various relational database systems to enable them to overcome their limitations in the representation of meta-data. These extensions contribute towards the automation of the reverse-engineering process of legacy databases, while allowing the user to analyse them using extended database modelling concepts.

Page v

CHAPTER 1

Introduction This chapter introduces the thesis. Section 1.1 is devoted to the background and motivations of the research undertaken. Section 1.2 presents the broad goals of the research. The original achievements which have resulted from the research are summarised in Section 1.3. Finally, the overall organisation of the thesis is described in Section 1.4. 1.1 Background and Motivations of the Research Over the years rapid technological changes have taken place in all fields of computing. Most of these changes have been due to the advances in data communications, computer hardware and software [CAM89] which together have provided a reliable and powerful networking environment (i.e. standard local and wide area networks) that allow the management of data stored in computing facilities at many nodes of the network [BLI92]. These changes have turned round the hardware technology from centralised mainframes to networked file-server and client-server architectures [KHO92] which support various ways to use and share data. Modern computers are much more powerful than the previous generations and perform business tasks at a much faster rate by using their increased processing power [CAM88, CAM89]. Simultaneous developments in the software industry have produced techniques (e.g. for system design and development) and products capable of utilising the new hardware resources (e.g. multi-user environments with GUIs). These new developments are being used for a wide variety of applications, including modern distributed information processing applications, such as office automation where users can create and use databases with forms and reports with minimal effort, compared to the development efforts using 3GLs [HIR85, WOJ94]. Such applications are being developed with the aid of database technology [ELM94, DAT95] as this field too has advanced by allowing users to represent and manipulate advanced forms of data and their functionalities. Due to the program data independence feature of DBMSs the maintenance of database application programs has become easier as functionalities that were traditionally performed by procedural application routines are now supported declaratively using database concepts such as constraints and rules. In the field of databases, the recent advances resulting from technological transformation include many areas such as the use of distributed database technology [OZS91, BEL92], object-oriented technology [ATK89, ZDO90], constraints [DAT83, GRE93], knowledge-based systems [MYL89, GUI94], 4GLs and CASE tools [COMP90, SCH95, SHA95]. Meanwhile, the older technology was dealing with files and primitive database systems which now appear inflexible, as the technology itself limits them from being adapted to meet the current changing business needs catalysed by newer technologies. The older systems which have been developed using 3GLs and in operation for many years, often suffer from failures, inappropriate functionality, lack of documentation, poor performance and are referred to as legacy information systems [BRO93, COMS94, IEE94, BRO95, IEEE95]. The current technology is much more flexible as it supports methods to evolve (e.g. 4GLs, CASE tools, GUI toolkits and reusable software libraries [HAR90, MEY94]), and can share resources through software that allows interoperability (e.g. ODBC [RIC94, GEI95]). This evolution

reflects the changing business needs. However, modern systems need to be properly designed and implemented to benefit from this technology, which may still be unable to prevent such systems themselves being considered to be legacy information systems in the near future due to the advent of the next generation of technology with its own special features. The only salvation would appear to be building in evolution paths in the current systems. The increasing power of computers and their software has meant they have already taken over many day to day functions and are taking over more of these tasks as time passes. Thus computers are managing a larger volume of information in a more efficient manner. Over the years most enterprises have adopted the computerisation option to enable them to efficiently perform their business tasks and to be able to compete with their counterparts. As the performance ability of computers has increased, the enterprises still using early computer technology face serious problems due to the difficulties that are inherent in their legacy systems. This means that new enterprises using systems purely based on the latest technology have an advantage over those which need to continue to use legacy information systems (ISs), as modern ISs have been developed using current technology which provides not only better performance, but also utilises the benefits of improved functionality. Hence, managers of legacy IS enterprises want to retire their legacy code and use modern database management systems (DBMSs) in the latest environment to gain the full benefits from this newer technology. However they want to use this technology on the information and data they already hold as well as on data yet to be captured. They also want to ensure that any attempts to incorporate the modern technology will not adversely affect the ongoing functionality of their existing systems. This means legacy ISs need to be evolved and migrated to a modern environment in such a way that the migration is transparent to the current users. The theme of this thesis is how we can support this form of system evolution. 1.1.1 The Barriers to Legacy Information System Migration Legacy ISs are usually those systems that have stood the test of time and have become a core service component for a business’s information needs. These systems are a mix of hardware and software, sometimes proprietary, often out of date, and built to earlier styles of design, implementation and operation. Although they were productive and fulfilled their original performance criteria and their requirements, these systems lack the ability to change and evolve. The following can be seen as barriers to evolution in legacy IS [IEE94].

• The technology used to build and maintain the legacy IS is obsolete, • The system is unable to reflect changes in the business world and to support new needs, • The system cannot integrate with other sub-systems, • The cost, time and risk involved in producing new alternative systems to the legacy IS.

The risk factor is that a new system may not provide the full functionality of the current system for a period because of teething problems. Due to these barriers, large organisations [PHI94] prefer to write independent sub-systems to perform new tasks using modern technology which will run alongside the existing systems, rather than attempt to achieve this by adapting existing code or by writing a new system that replaces the old and has new facilities as well. We see the following immediate advantages of this low risk approach.

• The performance, reliability and functionality of the existing system is not affected, • New applications can take advantage of the latest technology, • There is no need to retrain those staff who only need the facilities of the old system.

However with this approach, as business requirements evolve with time, more and more new needs arise, resulting in the development and regular use of many diverse systems within the same organisation. Hence, in the long term the above advantages are overshadowed by the more serious disadvantages of this approach, such as:

• The existing systems continue to exist and are legacy IS running on older and older

technology, • The need to maintain many different systems to perform similar tasks increases the

maintenance and support costs of the organisation, • Data becomes duplicated in different systems which implies the maintenance of redundant data

with its associated increased risk of inconsistency between the data copies if updating occurs, • The overall maintenance cost for hardware, software and support personnel increases as many

platforms are being supported, • The performance of the integrated information functions of the organisation decreases due to

the need to interface many disparate systems. To address the above issues, legacy ISs need to be evolved and migrated to new computing environments, when their owning organisation upgrades. This migration should occur within a reasonable time after the upgrade occurs. This means that it is necessary to migrate legacy ISs to new target environments in order to allow the organisation to dispose of the technology which is becoming obsolete. Managers of some enterprises have chosen an easy way to overcome this problem, by emulating [CAM89, PHI94] the current environment on the new platforms (e.g. AS/400 emulators for IBM S/360 and ICL’s DME emulators for 1900 and System 4 users). An alternative strategy is achieved by translating [SHA93, PHI94, SHE94, BRO95] the software to run in new environments (i.e. code-to-code level translation). The emulator approach perpetuates all the software deficiencies of the legacy ISs although successfully removing the old-fashioned hardware technology and so it does enjoy the increased processing power of the new hardware. The translation approach takes advantage of some of the modern technological benefits in the target environment as the conversions - such as IBM’s JCL and ICL’s SCL code to Unix shell scripts, Assembler to COBOL, COBOL to COBOL embedded with SQL, and COBOL data structures to relational DBMS tables - are also done as part of the translation process. This approach, although a step forward, still carries over most of the legacy code as legacy systems are not evolved by this process. For example, the basic design is not changed. Hence the barrier to change and/or integration to a common sub-system still remains, and the translated systems were not designed for the environment they are now running in, so they may not be compatible with it. There are other approaches to overcoming this problem which have been used by enterprises [SHA93, BRO95]. These include re-implementing systems under the new environment and/or upgrading existing systems to achieve performance improvements. As computer technology continues to evolve at an ever quicker pace the need to migrate arises more rapidly. This means, most small organisations and individuals are left behind and are forced to work in a technologically

obsolete environment, mainly due to the high cost of frequently migrating to newer systems and/or upgrading existing software, as this process involves time and manpower which cost money. The gap between the older and newer system users will very soon create a barrier to information sharing unless some tools are developed to assist the older technology users’ migration to new technology environments. This assistance for the older technology users may take many forms, including tools for: analysing and understanding existing systems; enhancing and modifying existing systems; migrating legacy ISs to newer platforms. The complete migration process for a legacy IS needs to consider these requirements and many other aspects, as recently identified by Brodie and Stonebraker in [BRO95]. Our work was primarily motivated by these business oriented legacy database issues and by work in the area of extending relational database technology to enable it to represent more knowledge about its stored data [COD79, STO86a, STO86b, WIK90]. This second consideration is an important aspect of legacy system migration, since if a graceful migration is to be achieved we must be able to enhance a legacy relational database with such knowledge to take full advantage of the new system environment. 1.1.2 Heterogeneous Distributed Environments As well as the problem of having to use legacy ISs, most large enterprises are faced with the problem of heterogeneity and the need for interoperability between existing ISs [IMS91]. This arises due to the increased use of different computer systems and software tools for information processing within an organisation as time passes. The development of networking capabilities to manage and share information stored over a network has made interoperability a requirement and local area networks finding broad acceptance in business enterprises has enhanced the need to perform this task within organisations. Network file servers, client-server technology and the use of distributed databases [OZS91, BEL92, KHO92] are results of these challenging innovations. This technology is currently being used to create and process information held in heterogeneous databases, which involves linking different databases in an interoperable environment. An aspect of this work is legacy database interoperation, since as time passes these databases will have been built using different generations of software. In recent years, the demand for distributed database capabilities has been fuelled mostly by the decentralisation of business functions in large organisations to address customer needs, and by mergers and acquisitions that have taken place in the corporate world. As a consequence, there is a strong requirement among enterprises for the ability to cross-correlate data stored in different existing heterogeneous databases. This has led to the development of products referred to as gateways, to enable users to link different databases together, e.g. Microsoft’s Open Database Connectivity (ODBC) drivers can link Access, FoxPro, Btrieve, dBASE and Paradox databases together [COL94, RIC94]. There are similar products for other database vendors, such as Oracle1 [HOL93] and others2 [PUR93, SME93, RIC94, BRO95]. Database vendors have targetted cross-platform compatibility via SQL access protocols to support interoperability in a heterogeneous environment. As heterogeneity in distributed systems may occur in various forms ranging from

1 For IBM’s DB2, UNISYS’s DMS, DEC RMS. 2 For INGRES, SYBASE, Informix and other popular SQL DBMSs.

3 During the life-time of this project the SQL-3 standards moved from a preliminary draft, through several modifications before being finalised in 1995.

different hardware platforms, operating systems, networking protocols and local database systems, cross-platform compatibility via SQL provides only a simple form of heterogeneous distributed database access. The biggest challenge comes in addressing heterogeneity due to differences in local databases [OZS91, BEL92]. This challenge is also addressed in the design and development of our system. Distributed DBMSs have become increasingly popular in organisations as they offer the ability to interconnect existing databases, as well as having many other advantages [OZS91, BEL92]. The interconnection of existing databases leads to two types of distributed DBMS, namely: homogeneous and heterogeneous distributed DBMSs. In homogeneous systems all of the constituent nodes run the same DBMS and the databases can be designed in harmony with each other. This simplifies both the processing of queries at different nodes and the passing of data between nodes. In heterogeneous systems the situation is more complex, as each node can be running a different DBMS and the constituent databases can be designed independently. This is the normal situation when we are linking legacy databases, as the DBMS and the databases used are more likely to be heterogeneous since they are usually implemented for different platforms during different technological eras. In such a distributed database environment, heterogeneity may occur in various forms, at different levels [OZS91, BEL92], namely :

• The logical level (i.e. involving different database designs), • The data management level (i.e. involving different data models), • The physical level, (i.e. involving different hardware, operating systems and network

protocols), and • At all three or any pair of these levels.

1.1.3 The Problems and Search for a Solution The concept of heterogeneity itself is valuable as it allows designers a freedom of choice between different systems and design approaches, thus enabling them to identify those most suitable for different applications. The exploitation of this freedom over the years in many organisations has resulted in the creation of multiple local and remote information systems which now need to be made interoperable to provide an efficient and effective information service to the enterprise managers. Open database connectivity (ODBC) [RIC94, GEI95] and its standards has been proposed to support interoperability among databases managed by different DBMSs. Database vendors such as Oracle, INGRES, INFORMIX and Microsoft have already produced tools, engines and connectivity products to fulfil this task [HOL93, PUR93, SME93, COL94, RIC94, BRO95]. These products allow limited data transfer and query facilities among databases to support interoperability among heterogeneous DBMSs. These features, although they permit easy, transparent heterogeneous database access, still do not provide a solution to legacy IS where a primary concern is to evolve and migrate the system to a target environment so that obsolete support systems can be retired. Furthermore, the ODBC facilities are developed for current DBMSs and hence may not be capable of accessing older generation DBMSs, and, if they are, are unlikely to be able to enhance them to take advantage of the newer technologies. Hence there is a need to create tools that will allow ODBC equivalent functionality for older generation DBMSs. Our work provides such functionality for all the DBMSs we have chosen for this research. It also provides the ability to enhance and evolve legacy databases.

In order to evolve an information system, one needs to understand the existing system’s structure and code. Most legacy information systems are not properly documented and hence understanding such systems is a complex process. This means that changing any legacy code involves a high risk as it could result in unexpected system behaviour. Therefore one needs to analyse and understand existing system code before performing any changes to the system. Database system design and implementation tools have appeared recently which have the aim of helping new information system development. Reverse and re-engineering tools are also appearing in an attempt to address issues concerned with existing databases [SHA93, SCH95]. Some of these tools allow the examination of databases built using certain types of DBMSs, however, the enhancements they allow are done within the limitation of that system. Due to continuous ongoing technology changes, most current commercial DBMSs do not support the most recent software modelling techniques and features (e.g. Oracle version 7 does not support Object-Oriented features). Hence a system built using current software tools is guaranteed to become a legacy system in the near future (i.e. when new products with newer techniques and features begin to appear in the commercial market place). Reverse engineering tools [SHA93] are capable of recreating the conceptual model of an existing database and hence they are an ideal starting point when trying to gain a comprehensive understanding of the information held in the database and its current state, as they create a visual picture of that state. However, in legacy systems the schemas are basic, since most of the information used to compose a conceptual model is not available in these databases. Information such as constraints that show links between entities is usually embedded in the legacy application code and users find it difficult to reverse engineer these legacy ISs. Our work addresses these issues while assisting in overcoming this barrier within the knowledge representation limitations of existing DBMSs. 1.1.4 Primary and Secondary Motivations The research reported in this thesis therefore was primarily promoted by the need to provide, for a logically heterogeneous distributed database environment, a design tool that allows users not only to understand their existing systems but also to enhance and visualise an existing database’s structure using new techniques that are either not yet present in existing systems or not supported by the existing software environment. It was also motivated by: a) Its direct applicability in the business world, as the new technique can be applied to incrementally

enhance existing systems and prepare them to be easily migrated to new target environments, hence avoiding continued use of legacy information systems in the organisation.

Although previous work and some design tools address the issue of legacy information system analysis, evolution and migration, these are mainly concerned with 3GL languages such as COBOL and C [COMS94, BRO95, IEEE95]. Little work has been reported which addresses the new issues that arise due to the Object-Oriented (O-O) data model or the extended relational data model [CAT94]. There are no reports yet of enhancing legacy systems so that they can migrate to O-O or extended relational environments in a graceful migration from a relational system. There has been

some work in the related areas of identifying extended entity relationship structures in relational schemas, and some attempts at reverse-engineering relational databases [MAR90, CHI94, PRE94]. b) The lack of previous research in visualising pre-existing heterogeneous database schemas and

evolving them by enhancing them with modern concepts supported in more recent releases of software.

Most design tools [COMP90, SHA93] which have been developed to assist in Entity-Relationship (E-R) modelling [ELM94] and Object Modelling Technique (OMT) modelling [RUM91] are used in a top-down database design approach (i.e. forward engineering) to assist in developing new systems. However, relatively few tools attempt to support a bottom-up approach (i.e. reverse engineering) to allow visualisation of pre-existing database schemas as E-R or OMT diagrams. Among these tools only a very few allow enhancement of the pre-existing database schemas, i.e. they apply forward engineering to enhance a reverse-engineered schema. Even those which do permit this action to some extent, always operate on a single database management system and work mostly with schemas originally designed using such systems (e.g. CASE tools). The tools that permit only the bottom-up approach are referred to as reverse-engineering tools and those which support both (i.e. bottom-up and top-down) are called re-engineering tools [SHA93]. This thesis is primarily concerned with creating re-engineering tools that assist legacy database migration.

The commercially available re-engineering tools are customised for particular DBMSs and are not easily usable in a heterogeneous environment. This barrier against widespread usability of re-engineering tools means that a substantial adaptation and reprogramming effort (costing time and money) is involved every time a new DBMS appears in a heterogeneous environment. An obvious example that reflects this limitation arises in a heterogeneous distributed database environment where there may be a need to visualise each participant database’s schema. In such an environment if the heterogeneity occurs at the database management level (where each node uses a different DBMS, for example, one node uses INGRES [DAT87] and another uses Oracle [ROL92]), then we have to use two different re-engineering tools to display these schemas. This situation is exacerbated for each additional DBMS that is incorporated into the given heterogeneous context. Also, legacy databases are migrated to different DBMS environments as newer versions and better database products have appeared since the original release of their DBMS. This means that a re-engineering tool that assists legacy database migration must work in an heterogeneous environment so that its use will not be restricted to particular types of ISs.

Existing re-engineering tools provide a single target graphical data model (usually the E-R model or a variant of it), which may differ in presentation style between tools and therefore inhibits the uniformity of visualisation that is highly desirable in an interoperable heterogeneous distributed database environment. This limitation means that users may need to use different tools to provide the required uniformity of display in such an environment. The ability to visualise the conceptual model of an information system using a user-preferred graphical data model is important as it ensures that no inaccurate enhancements are made to the system due to any misinterpretation of graphical notations used.

c) The need to apply rules and constraints to pre-existing databases to identify and clean inconsistent

legacy data, as preparation for migration or as an enhancement of the database’s quality.

The inability to define and apply rules and constraints on early database systems due to system limitations resulted in them not using constraints to increase the accuracy and consistency of the data held by these systems. This limitation is now a barrier to information system migration as a new target DBMS is unable to enforce constraints on a migrated database until all violations are investigated and resolved either by omitting the violating data or by cleaning it. This investigation may also show that a constraint has to be adjusted as the violating data is needed by the organisation. The enhancement of such a system by rules and constraints provides knowledge that is usable to determine possible data violations. The process of detecting constraint violations may be done by applying queries that are generated from these enhanced constraints. Similar methods have been used to implement integrity constraints [STO75], optimise queries [OZS91] and obtain intensional answers [FON92, MOT89]. This is essential as constraints may have been implemented at the application coding level and that can lead to their inconsistent application. d) An awareness of the potential contribution that knowledge-based systems and meta-programming

technologies, in association with extended relational database technology, have to offer in coping with semantic heterogeneity.

The successful production of a conceptual model is highly dependent on the semantic information available, and on the ability to reason about these semantics. A knowledge-based system can be used to assist in this task, as the process to generalise effective exploitation of semantic information for pre-existing heterogeneous databases needs to undergo three sub-processes, namely: knowledge acquisition, representation and manipulation. The knowledge acquisition process extracts the existing knowledge from a database’s data dictionaries. This knowledge may include subsequent enhancements made by the user, as the use of a database to store such knowledge will provide easy access to this information along with its original knowledge. The knowledge representation process represents existing and enhanced knowledge. The knowledge manipulation process is concerned with deriving new knowledge and ensuring consistency of existing knowledge. These stages are addressable using specific processes. For instance, the reverse-engineering process used to produce a conceptual model can be used to perform the knowledge acquisition task. Then the derived and enhanced knowledge can be stored in the same database by adopting a process that will allow us to distinguish this knowledge from its original meta-data. Finally, knowledge manipulation can be done with the assistance of a Prolog based system [GRA88], while data and knowledge consistency can be verified using the query language of the database.

1.2 Goals of the Research The broad goals of the research reported in this thesis are highlighted here, with detailed aims and objectives presented in section 2.4. These goals are to investigate interoperability problems, schema enhancement and migration in a heterogeneous distributed database environment, with particular emphasis on extended relational systems. This should provide a basis for the design and implementation of a prototype software system that brings together new techniques from the areas of knowledge-based systems, meta-programming and O-O conceptual data modelling with the aim of facilitating schema enhancement, by means of generalising the efficient representation of constraints using the current standards. Such a system is a tool that would be a valuable asset in a logically heterogeneous distributed extended relational database environment as it would make it possible for

global users to incrementally enhance legacy information systems. This offers the potential for users in this type of environment to work in terms of such a global schema, through which they can prepare their legacy systems to easily migrate to target environments and so gain the benefits of modern computer technology. 1.3 Original Achievements of the Research The importance of this research lies in establishing the feasibility of enhancing, cleaning and migrating heterogeneous legacy databases using meta-programming technology, knowledge-based system technology, database system technology and O-O conceptual data modelling concepts, to create a comprehensive set of techniques and methods that form an efficient and useful generalised database re-engineering tool for heterogeneous sets of databases. The benefits such a tool can bring are also demonstrated and assessed. A prototype Conceptual Constraint Visualisation and Enhancement System (CCVES) [WIK95a] has been developed as a result of the research. To be more specific, our work has made four important contributions to progress in the database topic area of Computer Science: 1) CCVES is the first system to bring the benefits of meta-programming technology to the very

important application area of enhancing and evolving heterogeneous distributed legacy databases to assist the legacy database migration process [GRA94, WIK95c].

2) CCVES is also the first system to enhance existing databases with constraints to improve their

visual presentation and hence provide a better understanding of existing applications [WIK95b]. This process is applicable to any relational database application, including those which are unable to naturally support the specification and enforcement of constraints. More importantly, this process does not affect the performance of an existing application.

3) As will be seen later, we have chosen the current SQL-3 standards [ISO94] as the basis for

knowledge representation in our research. This project provides an extension to the representation of the relational data model to cope with automated reuse of knowledge in the re-engineering process. In order to cope with technological changes that result from the emergence of new systems or new versions of existing DBMSs, we also propose a series of extended relational system tables conforming to SQL-3 standards to enhance existing relational DBMSs [WIK95b].

4) The generation of queries using the constraint specifications of the enhanced legacy systems is an

easy and convenient method of detecting any constraint violating data in existing systems. The application of this technique in the context of a heterogeneous environment for legacy information systems is a significant step towards detecting and cleaning inconsistent data in legacy systems prior to their migration. This is essential if a graceful migration is to be effected [WIK95c].

1.4 Organisation of the Thesis

The thesis is organised into 8 chapters. This first chapter has given an introduction to the research done, covering background and motivations, and outlining original achievements. The rest of the thesis is organised as follows: Chapter 2 is devoted to presenting an overview of the research together with detailed aims and objectives for the work undertaken. It begins by identifying the scope of the work in terms of research constraints and development technologies. This is followed by an overview of the research undertaken, where a step by step discussion of the approach adopted and its role in a heterogeneous distributed database environment is given. Finally, detailed aims and objectives are drawn together to conclude the chapter. Chapter 3 identifies the relational data model as the current dominant database model and presents its development along with its terminology, features and query languages. This is followed by a discussion of conceptual data models with special emphasis on the data models and symbols used in our project. Finally, we pay attention to key concepts related to our project, mainly the notion of semantic integrity constraints and extensions to the relational model. Here, we present important integrity constraint extensions to the relational model and its support using different SQL standards. Chapter 4 addresses the issue of legacy information system migration. The discussion commences with an introduction to legacy and our target information systems. This is followed by migration strategies and methods for such ISs. Finally, we conclude by referring to current techniques and identify the trends and existing tools applicable to database migration. Chapter 5 addresses the re-engineering process for relational databases. Techniques currently used for this purpose are identified first. Our approach, which uses constraints to re-engineer a relational legacy database is described next. This is followed by a process for detecting possible keys and structures of legacy databases. Our schema enhancement and knowledge representation techniques are then introduced. Finally, we present a process to detect and resolve conflicts that may occur due to schema enhancement. Chapter 6 introduces some example test databases which were chosen to represent a legacy heterogeneous distributed database environment and its access processes. Initially, we present the design of our test databases, the selection of our test DBMSs and the prototype system environment. This is followed by the application of our re-engineering approach to our test databases. Finally, the organisation of relational meta-data and its access is described using our test DBMSs. Chapter 7 presents the internal and external architecture and operation of our conceptual constraint visualisation and enforcement system (CCVES) in terms of the design, structure and operation of its interfaces, and its intermediate modelling system. The internal schema mappings, e.g. mapping from INGRES QUEL to SQL and vice-versa, and internal database migration processes are presented in detail here. Chapter 8 provides an evaluation of CCVES, identifying its limitations and improvements that could be made to the system. A discussion of potential applications is presented. Finally we conclude the

chapter by drawing conclusions about the research project as a whole.

CHAPTER 2

Research Scope, Approach, Aims and Objectives

This chapter describes, in some detail, the aims and objectives of the research that has been undertaken. Firstly, the boundaries of the research are defined in section 2.1, which considers the scope of the project. Secondly, an overview of the research approach we have adopted in dealing with heterogeneous distributed legacy database evolution and migration is given in section 2.2. Next, in section 2.3, the discussion is extended to the wider aspects of applying our approach in a heterogeneous distributed database environment using the existing meta-programming technology developed at Cardiff in other projects. Finally, the research aims and objectives are detailed in section 2.4, illustrating what we intend to achieve, and the benefits expected from achieving the stated aims. 2.1 Scope of the Project We identify the scope of the work in terms of research constraints and the limitations of current development technologies. An overview of the problem is presented along with the drawbacks and limitations of database software development technology in addressing the problem. This will assist in identifying our interests and focussing the issues to be addressed. 2.1.1 Overview of the Problem In most database designs, a conceptual design and modelling technique is used in developing the specifications at the user requirements and analysis stage of the design. This stage usually describes the real world in terms of object/entity types that are related to one another in various ways [BAT92, ELM94]. Such a technique is also used in reverse-engineering to portray the current information content of existing databases, as the original designs are usually either lost, or inappropriate because the database has evolved from its original design. The resulting pictorial representation of a database can be used for database maintenance, for database re-design, for database enhancement, for database integration or for database migration, as it gives its users a sound understanding of an existing database’s architecture and contents. Only a few current database tools [COMP90, BAT92, SHA93, SCH95] allow the capture and presentation of database definitions from an existing database, and the analysis and display of this information at a higher level of abstraction. Furthermore, these tools are either restricted to accessing a specific database management system’s databases or permit modelling with only a single given display formalism, usually a variant of the EER [COMP90]. Consequently there is a need to cater for multiple database platforms with different user needs to allow access to a set of databases comprising a heterogeneous database, by providing a facility to visualise databases using a preferred conceptual modelling technique which is familiar to the different user communities of the heterogeneous system. The fundamental modelling constructs of current reverse and re-engineering tools are entities, relationships and associated attributes. These constructs are useful for database design at

a high level of abstraction. However, the semantic information now available in the form of rules and constraints in modern DBMSs provides their users with a better understanding of the underlying database as its data conforms to these constraints. This may not necessarily be true for legacy systems, which may have constraints defined that were not enforced. The ability to visualise rules and constraints as part of the conceptual model increases user understanding of a database. Users could also exploit this information to formulate queries that more effectively utilise the information held in a database. Having these features in mind, we concentrated on providing a tool that permits specification and visualisation of constraints as part of the graphical display of the conceptual model of a database. With modern technology increasing the number of legacy systems and with increasing awareness of the need to use legacy data [BRO95, IEEE95], the availability of such a visualisation tool will be more important in future as it will let users see the full definition of the contents of their databases in a familiar format. Three types of abstraction mechanism, namely: classification, aggregation and generalisation, are used in conceptual design [ELM94]. However, most existing DBMSs do not maintain sufficient meta-data information to assist in identifying all these abstraction mechanisms within their data models. This means that reverse and re-engineering tools are semi-automated, in that they extract information, but users have to guide them and decide what information to look for [WAT94]. This requires interactions with the database designer in order to obtain missing information and to resolve possible conflicts. Such additional information is supplied by the tool users when performing the reverse-engineering process. As this additional information is not retained in the database, it must be re-entered every time a reverse engineering process is undertaken if the full representation is to be achieved. To overcome this problem, knowledge bases are being used to retain this information when it is supplied. However, this approach restricts the use of this knowledge by other tools which may exist in the database’s environment. The ability to hold this knowledge in the database itself would enhance an existing database with information that can be widely used. This would be particularly useful in the context of legacy databases as it would enrich their semantics. One of the issues considered in this thesis is how this can be achieved. Most existing relational database applications record only entities and their properties (i.e. attribute names and data types) as system meta-data. This is because these systems conformed to early database standards (e.g. the SQL/86 standard [ANSI86], supported by INGRES version 5 and Oracle version 5). However, more recent relational systems record additional information such as constraint and rule definitions, as they conform to the SQL/92 standards [ANSI92] (e.g. Oracle version 7). This additional information includes, for example, primary and foreign key specifications, and can be used to identify classification and aggregation abstractions used in a conceptual model [CHI94, PRE94, WIK95b]. However, the SQL/92 standard does not capture the full range of modelling abstractions, e.g. inheritance representing generalisation hierarchies. This means that early relational database applications are now legacy systems as they fail to naturally represent additional information such as constraint and rule definitions. Such legacy database systems are being migrated to modern database systems not only to gain the benefits of the current technology but also to be compatible with new applications built with the modern technology. The SQL standards are currently subject to review to permit the representation of extra knowledge (e.g. object-oriented features), and we have anticipated some of these proposals in our work - i.e. SQL-33 [ISO94] will be adopted by commercial systems and thus the current modern DBMSs

will become legacy databases in the near future or already may be considered to be legacy databases in that their data model type will have to be mapped onto the newer version. Having experienced the development process of recent DBMSs it is inevitable that most current databases will have to be migrated, either to a newer version of the existing DBMS or to a completely different newer technology DBMS for a variety of reasons. Thus the migration of legacy databases is perceived to be a continuing requirement, in any organisation, as technology advances continue to be made. Most migrations currently being undertaken are based on code-to-code level translations of the applications and associated databases to enable the older system to be functional in the target environment. Minimal structural changes are made to the original system and database, thus the design structures of these systems are still old-fashioned, although they are running in a modern computing environment. This means that such systems are inflexible and cannot be easily enhanced with new functions or integrated with other applications in their new environment. We have also observed that more recent database systems have often failed to benefit from modern database technology due to inherent design faults that have resulted in the use of unnormalised structures, which cause omission of the features enforcing integrity constraints even when this is possible. The ability to create and use databases without the benefit of a database design course is one reason for such design faults. Hence there is a need to assist existing systems to be evolved, not only to perform new tasks but also to improve their structure so that these systems can maximise the gains they receive from their current technology environment and any environment they migrate to in the future. 2.1.2 Narrowing Down the Problem Technological advances in both hardware and software have improved the performance and maintenance functionality of all information systems (ISs), and as a result, older ISs suffer from comparatively poor performance and inappropriate functionality when compared with more modern systems. Most of these legacy systems are written in a 3GL such as COBOL, have been around for many years, and run on old-fashioned mainframes. Problems associated with legacy systems are being identified and various solutions are being developed [BRO93, SHE94, BRO95]. These systems basically have three functional components, namely: interface, application and a database service, which are sometimes inter-related to each other, depending on how they were used during the design and implementation stages of the IS development. This means that the complexity of a legacy IS depends on what occurred during the design and implementation of the system. These systems may range from a simple single user database application using separate interfaces and applications, to a complex multi-purpose unstructured application. Due to the complex nature of the problem area we do not address this issue as a whole, but focus only on problems associated with one sub-component of such legacy information systems, namely the database service. This in itself is a wide field, and we have further restricted ourselves to legacy ISs using a specific DBMS for their database service. We considered data models ranging from original flat file and relational systems, to modern relational DBMSs and object-oriented DBMSs. From these data models we have chosen the traditional relational model for the following reasons.

• The relational model is currently the most widely used database model.

• During the last two decades the relational model has been the most popular model; therefore it has been used to develop many database applications and most of these are now legacy systems.

• There have been many extensions and variations of the relational model, which has resulted in many heterogeneous relational database systems being used in organisations.

• The relational model can be enhanced to represent additional semantics currently supported only by modern DBMSs (e.g. extended relational systems [ZDO90, CAT94]).

As most business requirements change with time, the need to enhance and migrate legacy information systems exists for almost every organisation. We address problems faced by these users while seeking for a solution that prevents new systems becoming legacy systems in the near future. The selection of the relational model as our database service to demonstrate how one could achieve these needs means that we shall be addressing only relational legacy database systems and not looking at any other type of legacy information systems. This decision means we are not considering the many common legacy IS migration problems identified by Brodie [BRO95] (e.g. migration of legacy database services such as flat-file structures or hierarchical databases into modern extended relational databases; migration of legacy applications with millions of lines of code written in some COBOL-like language into a modern 4GL/GUI environment). However, as shown later, addressing the problems associated with relational legacy databases has enabled us to identify and solve problems associated with more recent DBMSs, and it also assists in identifying precautions which if implemented by designers of new systems will minimise the chance of similar problems being faced by these systems as IS developments occur in the future. 2.2 Overview of the Research Approach Having presented an overview and narrowing down of our problem, we identify the following as the main functionalities that should be provided to fulfil our research goal:

• Reverse-engineering of a relational legacy database to fully portray its current information content.

• Enhancing a legacy database with new knowledge to identify modelling concepts that should be available to the database concerned or to applications using that database.

• Determining the extent to which the legacy database conforms to its existing and enhanced descriptions.

• Ensuring that the migrated IS will not become a legacy IS in the future. We need to consider the heterogeneity issue in order to be able to reverse-engineer any given relational legacy database. Three levels of heterogeneity are present for a particular data model, namely: at a physical, logical and data management level. The physical level of heterogeneity usually arises due to different data model implementation techniques, use of different computer platforms and use of different DBMSs. The physical / logical data independence of DBMSs hides implementation differences from users, hence we need only address how to access databases that are built using different DBMSs, running on different computer platforms.

Differences in DBMS characteristics lead to heterogeneity at the logical level. Here, the different DBMSs conform to a particular standard (e.g. SQL/86 or SQL/92), which supports a particular database query language (e.g. SQL or QUEL) and different relational data model features (e.g. handling of integrity constraints and availability of object-oriented features). To tackle heterogeneity at the logical level, we need to be aware of different standards, and to model ISs supporting different features and query languages. Heterogeneity at the data management level arises due to the physical limitations of a DBMS, differences in the logical design and inconsistencies that occurred when populating the database. Logical differences in different database schemas have to be resolved only if we are going to integrate them. The schema integration process is concerned with merging different related database applications. Such a facility can assist the migration of heterogeneous database systems. However any attempt to integrate legacy database schemas prior to the migration process complicates the entire process as it is similar to attempting to provide new functionalities within the system which is being migrated. Such attempts increase the chance of failure of the overall migration process. Hence we consider any integration or enhancements in the form of new functionalities only after successfully migrating the original legacy IS. However, the physical limitations of a DBMS and data inconsistencies in the database need to be addressed beforehand to ensure a successful migration. Our work addresses the heterogeneity issues associated with database migration by adopting an approach that allows its users to incrementally increase the number of DBMSs it could handle without having to reprogram its main application modules. Here, the user needs to supply specific knowledge about DBMS schema and query language constructs. This is held together with the knowledge of the DBMSs already supported and has no effect on the application’s main processing modules. 2.2.1 Meta-Programming Meta-programming technology allows the meta-data (schema information) of a database to be held and processed independently of its source specification language. This allows us to work on a database language independent environment and hence overcome many logical heterogeneity issues. Prolog based meta-programming technology has been used in previous research at Cardiff in the area of logical heterogeneity [FID92, QUT94]. Using this technology the meta-translation of database query languages [HOW87] and database schemas [RAM91] has been performed. This work has shown how the heterogeneity issues of different DBMSs can be addressed without having to reprogram the same functionality for each and every DBMS. We use meta-programming technology for our legacy database migration approach as we need to be able to start with a legacy source database and end with a modern target database where the respective database schema and query languages may be different from each other. In this approach the source database schema or query language is mapped on input into an internal canonical form. All the required processing is then done using the information held in this internal form. This information is finally mapped to the target schema or query language to produce the desired output. The advantage of this approach is that processing is not affected by heterogeneity as it is always performed on data held in the canonical form. This canonical form is an enriched collection of semantic data modelling features.

2.2.2 Application We view our migration approach as consisting of a series of stages, with the final stage being the actual migration and earlier stages being preparatory. At stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display (cf. paths A-1 and A-2 of figure 2.1). However, in legacy systems much of the information needed to present the database schema in this way is not available as part of the database meta-data and hence these links which are present in the database cannot be shown in this conceptual model. In modern systems such links can be identified using constraint specifications. Thus, if the database does not have any explicit constraints, or it does but these are incomplete, new knowledge about the database needs to be entered at stage 2 (cf. path B-1 of figure 2.1), which will then be reflected in the enhanced schema appearing in the graphical display (cf. path B-2 of figure 2.1). This enhancement will identify new links that should be present for the database concerned. These new database constraints can next be applied experimentally to the legacy database to determine the extent to which it conforms to them. This process is done at stage 3 (cf. paths C-1 and C-2 of figure 2.1). The user can then decide whether these constraints should be enforced to improve the quality of the legacy database prior to its migration. At this point the three preparatory stages in the application of our approach are complete. The actual migration process is then performed. All stages are further described below to enable us to identify the main processing components of our proposed system as well as to explain how we deal with different levels of heterogeneity. Stage 1: Reverse Engineering In stage 1, the data definition of the selected database is reverse-engineered to produce a graphical display of the database. To perform this task, the database’s meta-data must be extracted (cf. path A-1 of figure 2.1). This is achieved by connecting directly to the heterogeneous database. The accessed meta-data needs to be represented using our internal form. This is achieved through a schema mapping process as used in the SMTS (Schema Meta-Translation System) of Ramfos [RAM91]. The meta-data in our internal formalism then needs to be processed to derive the graphical constructs present for the database concerned (cf. path A-2 of figure 2.1). These constructs are in the form of entity types and the relationships and their derivation process is the main processing component in stage 1. The identified graphical constructs are mapped to a display description language to produce a graphical display of the database.

Figure 2.1: Information flow in the 3 stages of our approach prior to migration

Internal Processing

Enhanced Constraints

Enforced Constraints

Schema Visualisation

(EER or OMT) with Constraints

Stage 1 (Reverse Engineering) Stage 2 (Knowledge Augmentation)

Stage 3 (Constraint Enforcement)

B-1

B-2 A-2

C-1

C-2

A-1

B-3

Heterogeneous Databases

a) Database connectivity for heterogeneous database access Unlike the previous Cardiff meta-translation systems [HOW87, RAM91, QUT92], which addressed heterogeneity at the logical and data management levels, our system looks at the physical level as well. While these previous systems processed schemas in textual form and did not access actual databases to extract their DDL specification, our system addresses physical heterogeneity by accessing databases running on different hardware / software platforms (e.g. computer systems, operating systems, DBMSs and network protocols). Our aim is to directly access the meta-data of a given database application by specifying its name, the name and version of the host DBMS, and the address of the host machine4. If this database access process can produce a description of the database in DDL formalism, then this textual file is used as the starting point for the meta-translation process as in previous Cardiff systems [RAM91, QUT92]. We found that it is not essential to produce such a textual file, as the required intermediate representation can be directly produced by the database access process. This means that we could also by-pass the meta-translation process that performs the analysis of the DDL text to translate it into the intermediate representation5. However the DDL formalism of the schema can be used for optional textual viewing and could also serve as the starting point for other tools6 developed at Cardiff for meta-programming database applications. The initial functionality of the Stage 1 database connectivity process is to access a heterogeneous database and supply the accessed meta-data as input to our schema meta-translator

4 We assume that access privileges for this host machine and DBMS have been granted. 5 A list of tokens ready for syntactic analysis in the parsing phase is produced and processed

based on the BNF syntax specification of the DDL [QUT92]. 6 e.g. The Schema Meta-Integration System (SMIS) of Qutaishat [QUT92].

(SMTS). This module needs to deal with heterogeneity at the physical and data management levels. We achieve this by using DML commands of the specific DBMS to extract the required meta-data held in database data dictionaries treated like user defined tables. Relatively recently, the functionalities of a heterogeneous database access process have been provided by means of drivers such as ODBC [RIC94]. Use of such drivers will allow access to any database supported by them and hence obviate the need to develop specialised tools for each database type as happened in our case. These driver products were not available when we undertook this stage of our work. b) Schema meta-translation The schema meta-translation process [RAM91] accepts input of any database schema irrespective of its DDL and features. The information captured during this process is represented internally to enable it to be mapped from one database schema to another or to further process and supply information to other modules such as the schema meta-visualisation system (SMVS) [QUT93] and the query meta-translation system (QMTS) [HOW87]. Thus, the use of an internal canonical form for meta representation has successfully accommodated heterogeneity at the data management and logical levels. c) Schema meta-visualisation Schema visualisation using graphical notation and diagrams has proved to be an important step in a number of applications, e.g. during the initial stages of the database design process; for database maintenance; for database re-design; for database enhancement; for database integration; or for database migration; as it gives users a sound understanding of an existing database’s structure in an easily assimilated format [BAT92, ELM94]. Database users need to see a visual picture of their database structure instead of textual descriptions of the defining schema as it is easier for them to comprehend a picture. This has led to the production of graphical representations of schema information, effected by a reverse engineering process. Graphical data models of schemas employ a set of data modelling concepts and a language-independent graphical notation (e.g. the Entity Relationship (E-R) model [CHE76], Extended/Enhanced Entity Relationship (EER) model [ELM94] or the Object Modelling Technique (OMT) [RUM91]). In a heterogeneous environment different users may prefer different graphical models, and an understanding of the database structure and architecture beyond that given by the traditional entities and their properties. Therefore, there is a need to produce graphical models of a database’s schema using different graphical notations such as either E-R/EER or OMT, and to accompany them with additional information such as a display of the integrity constraints in force in the database [WIK95b]. The display of integrity constraints allows users to look at intra- and inter-object constraints and gain a better understanding of domain restrictions applicable to particular entities. Current reverse engineering tools do not support this type of display. The generated graphical constructs are held internally in a similar form to the meta-data of the database schema. Hence using a schema meta visualisation process (SMVS) it is possible to map the internally held graphical constructs into appropriate graphical symbols and coordinates for the graphical display of the schema. This approach has a similarity to the SMTS, the main

difference being that the output is graphical rather than textual. Stage 2: Knowledge Augmentation In a heterogeneous distributed database environment, evolution is expected, especially in legacy databases. This evolution can affect the schema description and in particular schema constraints that are not reflected in the stage 1 (path A-2) graphical display as they may be implicit in applications. Thus our system is designed to accept new constraint specifications (cf. path B-1 of figure 2.1) and add them to the graphical display (cf. path B-2 of figure 2.1) so that these hidden constraints become explicit. The new knowledge accepted at this point is used to enhance the schema and is retained in the database using a database augmentation process (cf. path B-3 of figure 2.1). The new information is stored in a form that conforms with the enhanced target DBMS’s methods of storing such information. This assists the subsequent migration stage. a) Schema enhancement Our system needs to permit a database schema to be enhanced by specifying new constraints applicable to the database. This process is performed via the graphical display. These constraints, which are in the form of integrity constraints (e.g. primary key, foreign key, check constraints) and structural components (e.g. inheritance hierarchies, entity modifications) are specified using a GUI. When they are entered they will appear in the graphical display. b) Database augmentation The input data to enhance a schema provides new knowledge about a database. It is essential to retain this knowledge within the database itself, if it is to be readily available for any further processing. Typically, this information is retained in the knowledge base of the tool used to capture the input data, so that it can be reused by the same tool. This approach restricts the use of this knowledge by other tools and hence it must be re-entered every time the re-engineering process is applied to that database. This makes it harder for the user to gain a consistent understanding of an application, as different constraints may be specified during two separate re-engineering processes. To overcome this problem, we augment the database itself using the techniques proposed in SQL-3 [ISO94], wherever possible. When it is not possible to use SQL-3 structures we store the information in our own augmented table format which is a natural extension of the SQL-3 approach. When a database is augmented using this method, the new knowledge is available in the database itself. Hence, any further re-engineering processes need not make requests for the same additional knowledge. The augmented tables are created and maintained in a similar way to user-defined tables, but have a special identification to distinguish them. Their structure is in line with the international standards and the newer versions of commercial DBMSs, so that the enhanced database can be easily migrated to either a newer version of the host DBMS or to a different DBMS supporting the latest SQL standards. Migration should then mean that the newer system can enforce the constraints. Our approach should also mean that it is easy to map our tables for

holding this information into the representation used by the target DBMS even if it is different, as we are mapping from a well defined structure. Legacy databases that do not support explicit constraints can be enhanced by using the above knowledge augmentation method. This requirement is less likely to occur for databases managed by more recent DBMSs as they already hold some constraint specification information in their system tables. The direction taken by Oracle version 6 was a step towards our augmentation approach, as it allowed the database administrator to specify integrity constraints such as primary and foreign keys, but did not yet enforce them [ROL92]. The next release of Oracle, i.e. version 7, implemented this constraint enforcement process. Stage 3: Constraint Enforcement The enhanced schema can be held in the database, but the DBMS can only enforce these constraints if it has the capability to do so. This will not normally be the case in legacy systems. In this situation, the new constraints may be enforced via a newer version of the DBMS or by migrating the database to another DBMS supporting constraint enforcement. However, the data being held in the database may not conform to the new constraints, and hence existing data may be rejected by the target DBMS in the migration, thus losing data and / or delaying the migration process. To address this problem and to assist the migration process, we provide an optional constraint enforcement process module which can be applied to a database before it is migrated. The objective of this process is to give users the facility to ensure that the database conforms to all the enhanced constraints before migration occurs. This process is optional so that the user can decide whether these constraints should be enforced to improve the quality of the legacy data prior to its migration, whether it is best left as it stands, or whether the new constraints are too severe. The constraint definitions in the augmented schema are employed to perform this task. As all constraints held have already been internally represented in the form of logical expressions, these can be used to produce data manipulation statements suitable for the host DBMS. Once these statements are produced, they are executed against the current database to identify the existence of data violating a constraint. Stage 4: Migration Process The migration process itself is incrementally performed by initially creating the target database and then copying the legacy data over to it. The schema meta-translation (SMTS) technique of Ramfos [RAM91] is used to produce the target database schema. The legacy data can be copied using the import / export tools of source and target DBMS or DML statements of the respective DBMSs. During this process, the legacy applications must continue to function until they too are migrated. To achieve this an interface can be used to capture and process all database queries of the legacy applications during migration. This interface can decide how to process database queries against the current state of the migration and re-direct those newly related to the target database. The query meta-translation (QMTS) technique of Howells [HOW87] can be used to convert these queries to the target DML. This approach will facilitate transparent migration for legacy databases. Our work does not involve the development of an interface to capture and

process all database queries, as interaction with the query interface of the legacy IS is embedded in the legacy application code. However, we demonstrate how to create and populate a legacy database schema in the desired target environment while showing the role of SMTS and QMTS in such a process. 2.3 The Role of CCVES in Context of Heterogeneous Distributed Databases Our approach described in section 2.2 is based on preparing a legacy database schema for graceful migration. This involves visualisation of database schemas with constraints and enhancing them with constraints to capture more knowledge. Hence we call our system the Conceptualised Constraint Visualisation and Enhancement System (CCVES). CCVES has been developed to fit in with the previously developed schema (SMTS) [RAM91] and query (QMTS) [HOW87] meta-translation systems, and the schema meta-visualisation system (SMVS) [QUT93]. This allows us to consider the complementary roles of CCVES, SMTS, QMTS and SMVS during Heterogeneous Distributed Database access in a uniform way [FID92, QUT94]. The combined set of tools achieves semantic coordination and promotes interoperability in a heterogeneous environment at logical, physical and data management levels. Figure 2.2 illustrates the architecture of CCVES in the context of heterogeneous distributed databases. It outlines in general terms the process of accessing a remote (legacy) database to perform various database tasks, such as querying, visualisation, enhancement, migration and integration. There are seven sub-processes: the schema mapping process [RAM91], query mapping process [HOW87], schema integration process [QUT92], schema visualisation process [QUT93], database connectivity process, database enhancement process and database migration process. The first two processes together have been called the Integrated Translation Support Environment [FID92], and the first four processes together have been called the Meta-Integration/Translation Support Environment [QUT92]. The last three processes were introduced as CCVES to perform database enhancement and migration in such an environment. The schema mapping process, referred to as SMTS, translates the definition of a source schema to a target schema definition (e.g. an INGRES schema to a POSTGRES schema). The query mapping process, referred to as QMTS, translates a source query to a target query (e.g. an SQL query to a QUEL query). The meta-integration process, referred to as SMIS, tackles heterogeneity at the logical level in a distributed environment containing multiple database schemas (e.g. Ontos and Exodus local schemas with a POSTGRES global schema) - it integrates the local schemas to create the global schema. The meta-visualisation process, referred to as SMVS, generates a graphical representation of a schema. The remaining three processes, namely: database connectivity, enhancement and migration with their associated processes, namely: SMVS, SMTS and QMTS, are the subject of the present thesis, as they together form CCVES (centre section of figure 2.2). The database connectivity process (DBC), queries meta-data from a remote database (route

A-1 in figure 2.2) to supply meta-knowledge (route A-2 in figure 2.2) to the schema mapping process referred to as SMTS. SMTS translates this meta-knowledge to an internal representation which is based on SQL schema constructs. These SQL constructs are supplied to SMVS for further processing (route A-3 in figure 2.2) which results in the production of a graphical view of the schema (route A-4 in figure 2.2). Our reverse-engineering techniques [WIK95b] are applied to identify entity and relationship types to be used in the graphical model. Meta-knowledge enhancements are solicited at this point by the database enhancement process (DBE) (route B-1 in figure 2.2), which allows the definition of new constraints and changes to the existing schema. These enhancements are reflected in the graphical view (route B-2 and B-3 in figure 2.2) and may be used to augment the database (route B-4 to B-8 in figure 2.2). This approach to augmentation makes use of the query mapping process, referred to as QMTS, to generate the required queries to update the database via the DBC process. At this stage any existing or enhanced constraints may be applied to the database to determine the extent to which it conforms to the new enhancements. Carrying out this process will also ensure that legacy data will not be rejected by the target DBMS due to possible violations. Finally, the database migration process, referred to as DBMI, assists migration by incrementally migrating the database to the target environment (route C-1 to C-6 in figure 2.2). Target schema constructs for each migratable component are produced via SMTS, and DDL statements are issued to the target DBMS to create the new database schema. The data for these migrated tables are extracted by instructing the source DBMS to export the source data to the target database via QMTS. Here too, the queries which implement this export are issued to the DBMS via the DBC process. 2.4 Research Aims and Objectives Our relational database enhancement and augmentation approach is important in three respects, namely:

1) by holding the additional defining information in the database itself, this information is usable by any design tool in addition to assisting the full automation of any future re-engineering of the same database;

2) it allows better user understanding of database applications, as the associated constraints are shown in addition to the traditional entities and attributes at the conceptual level;

3) the process which assists a database administrator to clean inconsistent legacy data ensures a safe migration. To perform this latter task in a real world situation without an automated support tool is a very difficult, tedious, time consuming and error prone task. Therefore the main aim of this project has been the design and development of a tool to assist database enhancement and migration in a heterogeneous distributed relational database environment. Such a system is concerned with enhancing the constituent databases in this type of environment to exploit potential knowledge both to automate the re-engineering process and to assist in evolving and cleaning the legacy data to prevent data rejection, possible losses of data and/or delays in the migration process. To this end, the following detailed aims and objectives have been pursued in our research: 1. Investigation of the problems inherent in schema enhancement and migration for a heterogeneous distributed relational legacy database environment, in order to fully understand these processes. 2. Identification of the conceptual foundation on which to successfully base the design and development of a tool for this purpose. This foundation includes:

• A framework to establish meta-data representation and manipulation. • A real world data modelling framework that facilitates the enhancement of existing working

systems and which supports applications during migration. • A framework to retain the enhanced knowledge for future use which is in line with current

international standards and techniques used in newer versions of relational DBMSs. • Exploiting existing databases in new ways, particularly linking them with data held in other

legacy systems or more modern systems. • Displaying the structure of databases in a graphical form to make it easy for users to

comprehend their contents. • The provision of an interactive graphical response when enhancements are made to a

database. • A higher level of data abstraction for tasks associated with visualising the contents,

relationships and behavioural properties of entities and constraints. • Determining the constraints on the information held and the extent to which the data

conforms to these constraints. • Integrating with other tools to maximise the benefits of the new tool to the user community.

3. Development of a prototype tool to automate the re-engineering process and the migration assisting tasks as far as possible. The following development aims have been chosen for this system:

• It should provide a realistic solution to the schema enhancement and migration assistance process.

• It should be able to access and perform this task for legacy database systems. • It should be suitable for the data model at which it is targeted. • It should be as generic as possible so that it can be easily customised for other data models. • It should be able to retain the enhanced knowledge for future analysis by itself and other

tools. • It should logically support a model using modern data modelling techniques irrespective of

whether it is supported by the DBMS in use. • It should make extensive use of modern graphical user interface facilities for all graphical

displays of the database schema. • Graphical displays should also be as generic as possible so that they can be easily enhanced or customised for other display methods.

CHAPTER 3 Database Technology, Relational Model,

Conceptual Modelling and Integrity Constraints The origins and historical development of database technology are initially presented here to focus the evolution of ISs and the emergence of database models. The relational data model is identified as currently the most commonly used database model and some terminology for this data model, along with its features including query languages is then presented. A discussion of conceptual data models with special emphasis on EER and OMT is provided to introduce these data models and the symbols used in our project. Finally, we pay attention to crucial concepts relating to our work, namely the notion of semantic integrity constraints, with special emphasis on those used in semantic extensions to the relational model. The relational database language SQL is also discussed, identifying how and when it supports the implementation of these semantic integrity constraints. 3.1 Origins and Historical Developments The origin of data management goes back to the 1950’s and hence, this section is sub divided into two parts: the first part describes database technology prior to the relational data model, and the second part describes developments since. This division was chosen as the relational model is currently the most dominant database model for information management [DAT90]. 3.1.1 Database Technology Prior to the Relational Data Model Database technology emerged from the need to manipulate large collections of data for frequently used data queries and reports. The first major step in mechanisation of information systems came with the advent of punched card machines which worked sequentially on fixed-length fields [SEN73, SEN77]. With the appearance of stored program computers, tape-oriented systems were used to perform these tasks with an increase in user efficiency. These systems used sequential processing of files in batch mode, which was adequate until peripheral storage with random access capabilities (e.g. DASD) and time sharing operating systems with interactive processing appeared to support real-time processing in computer systems. Access methods such as direct and indexed sequential access methods (e.g. ISAM, VSAM) [BRA82, MCF91] were used to assist with the storage and location of physical records in stored files. Enhancements were made to procedural languages (e.g. COBOL) to define and manage application files, making the application program dependent on the organisation of the file. This technique caused data redundancy as several files were used in systems to hold the same data (e.g. emp_name and address in a payroll file; insured_name and address in an insurance file; and depositors_name and address in a bank file). These stored data files used in the applications of the 1960's are now referred to as conventional file systems, and they were maintained using third generation programming languages such as COBOL and PL/1. This evolution of mechanised information systems was influenced by the hardware and software developments which occurred in the 1950’s and early 1960’s. Most long existing legacy ISs are based on this technology. Our work does not address this type of IS as they do not use a DBMS for their data management. The evolution of databases and database management systems [CHA76, FRY76, SIB76,

Chapter 3 Database Technology, Relational Model, Conceptual Modelling and Integrity

Constraints

Page 29

SEN77, KIM79, MCG81, SEL87, DAT90, ELM94] was to a large extent the result of addressing the main deficiencies in the use of files, i.e. by reducing data redundancy and making application programs less dependent on file organisation. An important factor in this evolution was the development of data definition languages which allowed the description of a database to be separated from its application programs. This facility allowed the data definition (often called a schema) to be shared and integrated to provide a wide variety of information to the users. The repository of all data definitions (meta data) is called data dictionaries and their use allows data definitions to be shared and widely available to the user community. In the late 1960's applications began to share their data files using an integrated layer of stored data descriptions, making the first true database, e.g. the IMS hierarchical database [MCG77, DAT90]. This type of database was navigational in nature and applications explicitly followed the physical organisation of records in files to locate data using commands such as GNP - get next under parent. These databases provided centralised storage management, transaction management, recovery facilities in the event of failure and system maintained access paths. These were the typical characteristics of early DBMSs. Work on extending COBOL to handle databases was carried out in the late 60s and 70s. This resulted in the establishment of the DBTG (i.e. DataBase Task Group) of CODASYL and the formal introduction of the network model along with its data manipulation commands [DBTG71]. The relational model was proposed during the same period [COD70], followed by the 3 level ANSI/SPARC architecture [ANSI75] which made databases more independent of applications, and became a standard for the organisation of DBMSs. Three popular types of commercial database systems7 classified by their underlying data model emerged during the 70s [DAT90, ELM94], namely: • hierarchical • network • relational and these have been the dominant types of DBMS from the late 60s on into the 80s and 90s. 3.1.2 Database Technology Since the Relational Data Model At the same time as the relational data model appeared, database systems introduced another layer of data description on top of the navigational functionality of the early hierarchical and network models to bring extra logical data independence8. The relational model also introduced the use of non-procedural (i.e. declarative) languages such as SQL [CHA74]. By the early 1980's many relational database products, e.g. System R [AST76], DB2 [HAD84], INGRES [STO76] and Oracle were in use and due to their growing maturity in the mid 80s and the complexity of programming, navigating, and changing data structures in the older DBMS data models, the relational data model was able to take over the commercial database market with the result that it is now dominant.

7 Other types such as flat file, inverted file systems were also used. 8 This allows changes to the logical structure of data without changing the application programs.


Constraints

Page 30

The advent of inexpensive and reliable communication between computer systems, through the development of national and international networks, has brought further changes in the design of these systems. These developments led to the introduction of distributed databases, where a processor uses data at several locations and links it as though it were at a single site. This technology has led to distributed DBMSs and the need for interoperability among different database systems [OZS91, BEL92]. Several shortcomings of the relational model have been identified, including its inability to perform efficiently compute-intensive applications such as simulation, to cope with computer-aided design (CAD) and programming language environments, and to represent and manipulate effectively concepts such as [KIM90]: • Complex nested entities (e.g. design and engineering objects), • Unstructured data (e.g. images, textual documents), • Generalisation and aggregation within a data structure, • The notion of time and versioning of objects and schemas, • Long duration transactions. The notion of a conceptual schema for application-independent modelling introduced by the ANSI/SPARC architecture led to another data model, namely: the semantic model. One of the most successful semantic models is the entity-relationship (E-R) model [CHE76]. Its concepts include entities, relationships, value sets and attributes. These concepts are used in traditional database design as they are application-independent. Many modelling concepts based on variants/extensions to the E-R model have appeared since Chen’s paper. The enhanced/extended entity-relationship model (EER) [TEO86, ELM94], the entity-category-relationship model (ECR) [ELM85], and the Object Modelling Technique (OMT) [RUM91] are the most popular of these. The DAPLEX functional model [SHI81] and the Semantic Data Model [HAM81] are also semantic models. They capture a richer set of semantic relationships among real-world entities in a database than the E-R based models. Semantic relationships such as generalisation / specialisation between a superclass and its subclass, the aggregation relationship between a class and its attributes, the instance-of relationship between an instance and its class, the part-of relationship between objects forming a composite object, and the version-of relationship between abstracted versioned objects are semantic extensions supported in these models. The object-oriented data model with its notions of class hierarchy, class-composition hierarchy (for nested objects) and methods could be regarded as a subset of this type of semantic data model in terms of its modelling power, except for the fact that the semantic data model lacks the notion of methods [KIM90] which is an important aspect of the object-oriented model. The relational model of data and the relational query language have been extended [ROW87] to allow modelling and manipulation of additional semantic relationships and database facilities. These extensions include data abstraction, encapsulation, object identity, composite objects, class hierarchies, rules and procedures. However, these extended relational systems are still being evolved to fully incorporate features such as implementation of domain and extended data types, enforcement of primary and foreign key and referential integrity checking, prohibition of duplicate rows in tables and views, handling missing information by supporting four-valued predicate logic


Constraints

Page 31

(i.e. true, false, unknown, not applicable) and view updatability [KIV92], and they are not yet available as commercial products. The early 1990's saw the emergence of new database systems by a natural evolution of database technology, with many relational database systems being extended and other data models (e.g. the object-oriented model) appearing to satisfy more diverse application needs. This opened opportunities to use databases for a greater diversity of applications which had not been previously exploited as they were not perceived as tractable by a database approach (e.g. Image, medical, document management, engineering design and multi-media information, used in complex information processing applications such as office automation (OA), computer-aided design (CAD), computer-aided manufacturing (CAM) and hyper media [KIM90, ZDO90, CAT94]). The object-oriented (O-O) paradigm represents a sound basis for making progress in these areas and as a result two types of DBMS are beginning to dominate in the mid 90s [ZDO90], namely: the object-oriented DBMS, and the extended relational DBMS. There are two styles of O-O DBMS, depending on whether they have evolved from extensions to an O-O programming language or by evolving a database model. Extensions have been created for two database models, namely: the relational and the functional models. The extensions to existing relational DBMSs have resulted in the so-called Extended Relational DBMSs which have O-O features (e.g. POSTGRES and Starburst), while extensions to the functional model have produced PROBE and OODAPLEX. The approach of extending O-O programming language systems with database management features has resulted in many systems (e.g. Smalltalk into GemStone and ALLTALK, and C++ into many DBMSs including VBase / ONTOS, IRIS and O2). References to these systems with additional information and references can be found in [CAT94]. Research is currently taking place into other kinds of database such as active, deductive and expert database systems [DAT90]. This thesis focuses on the relational model and possible extensions to it which can represent semantics in existing relational database information systems in such a way that these systems can be viewed in new ways and easily prepared for migration to more modern database environments. 3.2 Relational Data Model In this section we introduce some of the commonly used terminology of the relational model. This is followed by a selective description of the features and query languages of this model. Further details of this data model can be found in most introductory database text books, e.g. [MCF91, ROB93, ELM94, DAT95]. A relation is represented as a table (entity) in which each row represents a tuple (record), the number of columns being the degree of the relation and the number of rows being its cardinality. An example of this representation is shown in figure 3.1, which shows a relation holding Student details, with degree 3 and cardinality 5. This table and each of its columns are named, so that a unique identity for a table column of a given schema is achieved via its table name and column name. The columns of a table are called attributes (fields) each having its own domain (data type) representing its pool of legal data. Basic types of domains are used (e.g. integer, real, character, text, date) to define the domains of attributes. Constraints may be enforced to further restrict the pool of legal


Constraints

Page 32

values for an attribute. Tables which actually hold data are called base tables to distinguish them from view tables which can be used for viewing data associated with one or more base tables. A view table can also be an abstraction from a single base table which is used to control access to parts of the data. A column or set of columns whose values uniquely identify a row of a relation is called a candidate key (key) of the relation. It is customary to designate one candidate key of a relation as a primary key (e.g. SNO in figure 3.1). The specification of keys restricts the possible values the key attribute(s) may hold (e.g. no duplicate values), and is a type of constraint enforceable on a relation. Additional constraints may be imposed on an attribute to further restrict its legal values. In such cases, there should be a common set of legal values satisfying all the constraints of that attribute, ensuring its ability to accept some data. For example, a pattern constraint which ensures that the first character of SNO is ‘S’ further restricts the possible values of SNO - see figure 3.1. Many other concepts and constraints are associated with the relational model although most of them are not supported by early relational systems as, indeed, some of the more recent relational systems (e.g. a value set constraint for the Address field as shown in figure 3.1).

Figure 3.1: The Student relation

Domain (type character)

Tup

les

Attributes

Student

S1 S2 S3 S4 S5

Jones Smith Gray Brown Jones

Cardiff Bristol Swansea Cardiff Newport

SNO Name Address

Relation

Degree

:

: Car

dina

lity

Value Set Constraint

Pattern Constraint (all values begin with 'S')

Primary Key (unique values)

3.2.1 Requisite Features of the Relational Model During the early stages of the development of relational database systems there were many requisite features identified which a comprehensive relational system should have [KIM79, DAT90]. We shall now examine these features to illustrate the kind of features expected from early relational database management systems. They included support for:

• Recovery from both soft and hard crashes, • A report generator for formatted display of the results of queries, • An efficient optimiser to meet the response-time requirements of users, • User views of the stored database,


Constraints

Page 33

• A non-procedural language for query, data manipulation / definition / control, • Concurrency control to allow sharing of a database by multiple users and applications, • Transaction management, • Integrity control during data manipulation, • Selective access control to prevent one user’s database being accessed by unauthorised users, • Efficient file structures to store the database, and • Efficient access paths to the stored data.

Many early relational DBMSs originated at universities and research institutes, and none of them were able to provide all the above features. These systems mainly focussed on optimising techniques for query processing and recovery from soft and hard crashes, and did not pay much attention to the other features. Few of these database systems were commercially available, and for those that were the marketing was based on specific features such as report generation (e.g. MAGNUM), and views with selective access control (e.g. QBE). The early commercial systems did not support the full range of features either. Since the mid 1980’s many database products have appeared which aim to provide most of the above features. The enforcement of features such as concurrency control was embodied in these systems, while features such as views, access and integrity control were provided via non-procedural language commands. Systems which were unable to provide these features via a non-procedural language offered procedural extensions (e.g. C with embedded SQL) to perform such tasks. This resulted in the use of two types of data manipulation languages, i.e. procedural and non-procedural, to perform database system functions. In procedural languages a sequence of statements is issued to specify the navigation path needed in the database to retrieve the required data, thus they are navigational languages. This approach was used by all hierarchical and network database systems and by some relational systems. However, most relational systems offer a non-procedural (i.e. non-navigational) language. This allows retrieval of the required data by using a single retrieval expression, which in general has a degree of complexity corresponding to the complexity of the retrieval (e.g. SQL). 3.2.2 Query Language for the Relational Model Querying or the retrieval of information from a database is perhaps the aspect of relational languages which has received the most attention. A variety of approaches to querying has emerged, based on relational calculus, relational algebra, mapping-oriented languages and graphic-oriented languages [CHA76, DAT90]. During the first decade of relational DBMSs, there were many experimental implementations of relational systems in universities and industry, particularly at IBM. The initial projects were aimed at proving the feasibility of relational database systems supporting high-level non-procedural retrieval languages. The Structured Query Language (SQL9) [AST75] emerged from an IBM research project. Later projects created more comprehensive relational DBMSs and among the more important of these systems were probably the System R project at IBM [AST76] and the INGRES project (with its QUEL query language) at the University of California at Berkeley [STO76].

9 Initially called SEQUEL, and later pronounced as SEQUEL.


Constraints

Page 34

Standards for relational query languages were introduced [ANSI86] so that a common language could be used to retrieve information from a database. SQL became the standard query language for relational databases. These standards were reviewed regularly [ANSI89a, ANSI92] and are being reviewed [ISO94] to incorporate technological changes that meet modern database requirements. Hence, the standard query language SQL is evolving, and although some of the recent database systems conform to [ANSI92] standards they will have to be upgraded to incorporate even more recent advances such as the object-oriented paradigm additions to the language [ISO94]. This means that different database system query languages conform to different standards, and provide different features and facilities to their users even though they are of the same type. Hence, information systems developed during different eras will have used different techniques to perform the same task, with early systems being more procedural in their approach than more recent ones. Query languages, including SQL, have three categories of statements, i.e. the data manipulation language (DML) statements to perform all retrievals, updates, insertions and deletions, the data definition language (DDL) statements to define the schema and its behavioural functions such as rules and constraints, and the data control language (DCL) statements to specify access control which is concerned with the privileges to be given to database users. 3.3 Conceptual Modelling The conceptual model is a high level representation of a data model, providing an identification and description of the main data objects (avoiding details of their implementation). This model is hardware and software independent, and is represented using a set of graphical symbols in a diagrammatic form. As noted in part ‘c’ of stage 1 of section 2.2.2, different users may prefer different graphical models and hence it is useful to provide them with a choice of models. We consider two types of conceptual model in this thesis, namely: the enhanced entity-relationship model (EER), which is based on the popular entity-relationship model, and the object-modelling technique (OMT), which uses the more recent concepts of object-oriented modelling as opposed to the entities of the E-R model. These were chosen as they are among the currently most widely used conceptual modelling approaches and they allow representation of modelling concepts such as generalisation hierarchies. 3.3.1 Enhanced Entity-Relationship Model (EER) The entity-relationship approach is considered to be the first useful proposal [CHE76] on the subject of conceptual modelling. It is concerned with creating an entity model which represents a high-level conceptual data model of the proposed database, i.e. it is an abstract description of the structure of the entities in the application domain, including their identity, relationship to other entities and attributes, without regard for eventual implementation details. Thus an E-R diagram describes entities and their relationships using distinctive symbols, e.g. an entity is a rectangle and a relationship is a diamond. Distinctive symbols for recent modelling concepts such as generalisation, aggregation and complex structures have been introduced into these models by practitioners. Despite its popularity, no standard has emerged or been defined for this model. Hence different authors use different notations to represent the same concept. Therefore we have to define our symbols for these concepts: we have based our definitions on [ROB93] and


Constraints

Page 35

[ELM94]. a) Entity An entity in the E-R model corresponds to a table in the relational environment and is represented by a rectangle containing the entity name, e.g. the entity Employee of figure 3.2. b) Attributes Attributes are represented by a circle that is connected to the corresponding entity by a line. Each attribute has a name located near the circle10, e.g. attributes EmpNo, Name and Address of the Employee entity in figure 3.2. Key attributes of a relation are indicated using a colour to fill in the circle (red on the computer screen or shaded dark in this thesis) (e.g. the attribute EmpNo of Employee in figure 3.2). Attributes usually have a single value in an entity occurrence although multivalued attributes can occur and other types such as derived attributes can be represented in the conceptual model (see appendix B for a comprehensive list of the symbols used in EER models in this thesis). c) Relationships A relationship is an association between entities. Each relationship is named and represented by a diamond-shaped symbol. Three types of relationships (one-to-many or 1:M, many-to-many or M:N, and one-to-one or 1:1) are used to describe the association between entities. Here 1 means that an instance of this entity relates to only one instance of the other entity (e.g. an employee works for only one department), and M or N means that an instance of an entity may relate to more than one instance of the other entity (e.g. a department can have many employees working for it - see figure 3.2), through this relationship (the same entities can be linked in more than one relationship). The relationship type is determined by the participating entities and their associated properties. In the relational model a separate entity is used for M:N relationship types (e.g. a composite entity as in the case of the entity ComMem of figure 3.2), and the other relationship types (i.e. 1:1 and 1:M) are represented by repeated attributes (e.g. relationship WorksFor of figure 3.2 is established from the attribute WorksFor of the entity Employee).

10 We do not place the attribute name inside the circle to avoid the use of large circles or ovals in

our diagrams.


Constraints

Page 36

Figure 3.2: EER diagram for part of the University Database

(1,N)

(4,N)

(1,N)

(1,N)

Address

Title Committee

YearJoined

(1,1)

(1,1)

(Relationships)

(Weak Entity)

(Entity)

(Weak Relationship)

(Attributes)

ComMem

Department

Faculty

(Key)

N 1

(Composite Entity)

NameEmpNo

Office

WorksFor

Fcom

(Generalised Entity)

(Specialised Entity) of Office

d

Employee

Head(1,1)(0,1)

A relationship’s degree indicates the number of associated entities (or participants) there are in the relationship. Relationships with 1, 2 and 3 participants are called unary, binary and ternary relationships, respectively. In practice most relationships are binary (e.g. relationship WorksFor in figure 3.2) and relationships of higher order (e.g. four) occur very rarely as they are usually simplified to a series of binary relationships. The term connectivity is used to describe the relationship classification and it is represented in the diagram by using 1 or N near the related entity (see for example, the WorksFor relationship in figure 3.2). Alternatively, a more detailed description of the relationship is specified using cardinality, which expresses the specific number of entity occurrences associated with one occurrence of the related entity. The actual number of occurrences depends on the organisation’s policy and hence, can differ from that of another organisation, although both may model the same information. The cardinality has upper and lower limits indicating a range and is represented in the diagram within brackets near the related entity (see the WorksFor relationship in figure 3.211). Cardinality is a type of constraint and in appendix B.2 we provide more details about the symbols and notations used to represent these types of constraints. Thus in the WorksFor relationship: (1,1) indicates an employee must belong to a department (4,N) indicates a department must have at least 4 employees N indicates a department has many employees 1 indicates an employee may work for only one department d) Other Relationship and Entity Types The original E-R model of Chen did not contain relationship attributes and did not use the concept of a composite entity. We use this concept as in [ROB93], because the relational model requires the use of an entity composed of the primary keys of other entities to connect and represent M:N relationships. Hence, a composite entity (also called a link [RUM91] or regular [CHI94] entity)

11 In practise in a diagram only one of these types is shown depending on availability of information on cardinality limits.


Constraints

Page 37

representing an M:N relationship is represented using a diamond inside the rectangle, indicating that it as an entity as well as a relationship (e.g. ComMem of figure 3.2). In this type of relationship, the primary key of the composite entity is created by using the keys of the entities which it connects. This is usually a binary or 2-ary relationship involving two referenced entities and is a special case of n-ary relationship which connects with n entities. Some entity occurrences cannot exist without an entity occurrence with which they have a relationship. Such entities are called weak entities and are represented by a double rectangle (e.g. Committee in figure 3.2). The relationship formed with this entity type is called a weak relationship and is represented by a double diamond (e.g. Fcom relationship of figure 3.2). In this type of relationship, the primary key of the weak entity is a proper subset of the key of the entity which it depends on and the remaining attributes (called dangling attributes) of the primary key do not contain a key of any other entity. When a relationship exists between occurrences of the same entity set (e.g. a unary relationship) it forms a recursive relationship (e.g. a course may have pre-requisites courses). e) Generalisation / Specialisation / Inheritance Most organisations employ people with a wide range of skills and special qualifications (e.g. a university employs academics, secretaries, porters, research associates, etc.) and it may be necessary to record additional information for certain types of employee (e.g. qualifications of academics). Representing such additional information in the employee table results in the use of null values in this attribute for other employees as this additional information is not applicable for these employees. To overcome this, common characteristics for all employees are chosen to define the employee entity as a generalised entity, and the additional information is put in a separate entity, called a specialised entity, which inherits all the properties of its parent entity (i.e. the generalised entity), creating a parent-child or is-a relationship (also called a generalised hierarchy). The higher level of this relationship is a supertype entity (i.e. generalised entity) and the lower-level is a subtype entity (i.e. specialised entity). A supertype entity set is usually composed of several unique and disjoint (non-overlapping) subtype entity sets. However some supertypes contain overlapping subtypes (e.g. an employee may also be a student and hence we get two subtypes of person in an overlapping relationship). There are constraints applicable for generalised hierarchies and special symbols / notations are used in these cases (see appendixes B.1 figure ‘e’ and B.2 figure ‘b’). In figure 3.2, the entities Office, Department and Faculty form a generalised hierarchy with Office being the Supertype entity and Department and Faculty being the subtype entities. Subtype and supertype entities have a 1:1 relationship although we view it differently, i.e. as a hierarchy. The subtypes described above inherit from a single supertype entity. However, there may be cases where a subtype inherits from multiple supertypes (e.g. an empstudent entity representing employees who are also students may inherit from employee and student entities). This is known as multiple inheritance. In such cases the subtype may represent either an intersection or a union. The concept of inheritance was taken from the O-O paradigm and hence it does not occur in the original E-R model, but is included in the EER model. 3.3.2 Object Modelling Technique (OMT)


Constraints

Page 38

The Object Modelling Technique (OMT) is an O-O development methodology. It creates a high-level conceptual data model of a proposed system without regard for eventual implementation details. This model is based on objects. The notations of OMT used here are taken from [RUM91] and those used in our work are described in appendix B, where they are compared with their EER equivalents. Hence we do not describe this model in depth here. The diagrams produced by this method are known as object diagrams. They combine O-O concepts (i.e. classes and inheritance) with information modelling concepts (i.e. entities and relationships). Although the terminology used differs from that used in the EER model, both create similar conceptual models, although using different graphical notations. The main notations used in OMT are rectangles with text inside (e.g. for classes and their properties, as opposed to the EER where attributes appear outside the entity). This makes OMT easier to implement than EER in a graphical computing environment. OMT is used for most O-O modelling (e.g. in [COO93, IDS94]), and so it is a widely known technique. 3.4 Semantic Integrity Constraints A real world application is always governed by many rules which define the application domain and are referred to as integrity constraints [DAT83, ELM94]. An important activity when designing a database application is to identify and specify these integrity constraints for that database and if possible to enforce them using the DBMS constraint facilities. The term integrity refers to the accuracy, correctness or validity of a database. The role of the integrity constraint enforcer is to ensure that the data in the database is accurate by guarding it against invalid updates, which may be caused by errors in data entry, mistakes on the part of the operator or the application programmer, system failures, and even due to deliberate falsification by users. This latter case is the concern of the security system which protects the database from unauthorised access (i.e. it implements authorisation constraints). The integrity system uses integrity rules (integrity constraints) to protect the database from invalid updates supplied by authorised users and to maintain the logical consistency of the database. Integrity is sometimes used to cover both semantic and transaction integrity. The latter case deals with concurrency control (i.e. the prevention of inconsistencies caused by concurrent access by multiple users or applications to a database), and recovery techniques which prevent errors due to malfunctioning of system hardware and software. Protection against this type of integrity-violation is dealt with by most commercially available systems and is not an issue of this thesis. Here we shall use the terms integrity and constraints to refer only to semantic integrity constraints. Integrity rules cannot detect all types of errors, for instance when dealing with percentage marks, there is no way that the computer can detect the fact that an input value of 45 for a student mark should really be 54. However, on the other hand, a value of 455 could be detected and corrected. Consistency is another term used for integrity. However, this is normally used in cases where two or more values in the database are required to be in agreement with each other in some way. For example, the DeptNo in an Employee record should tally with the DeptNo appearing in some Department record (referential integrity in relational systems), or the Age of a Person must be


Constraints

Page 39

equal to the difference in years between today’s date and their date of birth (a property of a derived attribute). In order to check for invalid data, DBMSs use an integrity subsystem to monitor transactions and detect integrity violations. In the event of a violation the DBMS takes appropriate actions such as rejecting the operation, reporting the violation, or assisting in correcting the error. To perform such a task, the integrity subsystem must be provided with a set of rules that define what errors to check for, when to do the checking, and what to do if an error is detected. Most early DBMSs did not have an integrity subsystem (mainly due to unacceptable database system performance when integrity checking was performed in older technological environments) and hence such checking was not implemented in their information systems. These information systems performed integrity checking using procedural language extensions of the database to check for invalid entries during the capture of data via their user interface (e.g. data entry forms). Here too, due to technological limitations and poor database performance, only specific types of constraints (e.g. range check, pattern matching), and a limited number of checks were allowed for an attribute. As these rules were coded in application programs they violated program / data (rule) independence for constraint specification. However, most recent DBMSs attempt to support such specifications using their DDL and hence they achieve program / rule independence. The original SQL standard specifications [ANSI86] were subsequently enhanced so that constraints could be specified using SQL [ANSI89a]. Current commercial DBMSs are seeking to meet these standards by targeting the implementation of the SQL-2 standards [ANSI92] in their latest releases. Systems such as Oracle now conform to these standards, while others such as INGRES and POSTGRES have taken a different path by extending their systems with a rule subsystem, which performs similar tasks but using a procedural style approach where the rules and procedures are retained in data dictionaries. Integrity constraints can be identified for the properties of a data model and for the values of a database application. We examine both to present a detailed description of the types of constraint associated with databases and in particular those used for our work. 3.4.1 Integrity Constraints of a Data Model Some constraints are automatically supported by the data model itself. These constraints are assumed to hold by the definition of that data model (i.e. they are built into the system and not specified by a user). They are called the inherent constraints of the data model. There are also constraints that can be specified and represented in a data model. These are called the implicit constraints of the model and they are specified using DDL statements in a relational schema, or graphical constructs in an E-R model. Table 3.1 gives some examples of implicit and inherent constraints for relational and EER data models. The constraint types used in this table are described in detail in section 3.5. The structure of a data model represents inherent constraints implicitly and is also capable of representing implicit constraints. Hence, constraints represented in these two ways are referred to as structural constraints. Data models differ in the way constraints are handled. Hierarchical and network database constraints are handled by being tightly linked to structural concepts (records, sets,


Constraints

Page 40

segment definitions), of which the parent-child and owner-member relationships are logical examples. The classical relational model, on the other hand, has two constraints represented structurally by its relations or tables (namely, relations consist of a certain number of simple attributes and have no duplicate rows). Hence only specific types of structural constraint are defined for a particular data model (e.g. parent-child relationships are not defined for the relational model).

Data Model Implicit constraint Inherent constraint

Table 3.1: Structural constraints of selected data models

• A relation consists of a certain number of simple attributes, • An attribute value is atomic, • No duplicate tuples are allowed.

• Primary key attributes, • Attribute structural constraints, • Relationship structural constraints, • Superclass /subclass relationship, • Disjointness /totality constraints on specialisation /generalisation.

• Domain constraints, • Key constraints, • Relationship structural constraints.

• Every relationship instance of an n-ary relationship type R relates exactly one entity from each entity type participating in R in a specific role, • Every entity instance in a subclass must also exist in its superclass.

EER

Relational

Every data model has a set of concepts and rules (or assertions) that specify the structure and the implicit constraints of a database describing a miniworld. A given implementation of a data model by a particular DBMS will usually support only some of the structural (inherent and implicit) constraints of the data model automatically and the rest must be defined explicitly. These additional constraints of a data model are called explicit or behavioural constraints. They are defined using either a procedural or a declarative (non-procedural) approach, which is basically not part of the data model per se. 3.4.2 Integrity Constraints of a Database Application In database applications, integrity constraints are used to ensure the correctness of a database. A change to a database application takes place during an update transaction and constraints are used at this stage to ensure that the database is in a consistent state before and after that transaction. This type of constraint is called a state (static) constraint as it applies to a particular state of the database and should hold for every state where the database is not in transition, i.e. not in the process of being updated. Constraints that are applicable to a database and which change from one state to another are called transition (dynamic) constraints (e.g. the age of a person can only be increased, meaning that the new value of age is greater than the old value). In general, transition constraints occur less frequently than state constraints and are usually specified explicitly. The discussion above classifies the types of semantic integrity constraints used in data models and database applications. We summarise them in figure 3.3 to highlight the basic classification of integrity constraints. We separate the two approaches using a dotted line as they are independent of each other. However, most constraints are common to both categories as they are implemented using a particular data model for a database application. Data models used for conceptual modelling are more concerned with structural constraints as opposed to the value constraints of database applications.


Constraints

Page 41

structural constraints

inherent constraints

implicit constraints

state (static)

constraints

explicit (behavioural)

constraints

transistion (dynamic) constraints

Figure 3.3: Classification of integrity constraints

Data Model Database Application

Integrity Constraints

3.5 Constraint Types We consider constraint types in more detail here so that we can later relate them to data models and database applications. Initially, we describe value constraints (i.e. domain and key constraints) which are applicable to database values (i.e. attributes). Then, we describe structural constraints, namely: attribute structural, relationship structural and superclass/subclass structural. These constraints are often associated with data models and some of them have been mentioned in section 3.4.1. In this section, we look at them with respect to their structural properties and are concerned with identifying differences within a structure, in addition to the relationships (e.g. between entities) formed by them. Finally, constraints that do not fall into either of these categories are described. As most of these constraints are state constraints we shall refer to the constraint type only when type distinction is necessary. All structural constraints are shown in a conceptual model as this model is used to describe the structure of a database. Not all value constraints (e.g. check constraints) are shown as they are not associated with the structure of a database and are described using a DML. However, our work includes presenting them at optional lower levels of abstraction which involves software dependent code. This code is based on current SQL standards and may be replaced using equivalent graphical constructs if necessary12. Here for each type of an SQL statement, we could introduce a suitable graphical representation and hence increase its readability. All value constraints are implicitly or explicitly defined when implementing an application. Most constraints considered here are implicit constraints as they may be specified using the DDL of a modern DBMS. In such cases the DBMS will monitor all changes to the data in the database to ensure that no constraint has been violated by these changes. 3.5.1 Domain Constraints Domain constraints are specified by associating every simple attribute type with a value set.

12 This idea is beyond the scope of this thesis.


Constraints

Page 42

The value set of a simple attribute is initially defined using its data type (integer, real, char, boolean) and length, and later is further restricted using appropriate constraints such as range (minimum, maximum) and pattern (letters and digits). For example, the value set for the Deptno attribute of the entity Department could be specified as data type character of length 5, and the Salary attribute of the entity Staff as data type decimal of length 8 with 2 decimal places, in the range 3000 to 20000. Nonnull constraints can be seen as a special case of domain constraints, since they too restrict the domain of attributes. These constraints are used to eliminate the possibility of missing, or unknown values of an attribute occurring in the database. A domain constraint is usually used to restrict the value of an attribute, e.g. an employee’s age is ≥ 18 (i.e. a state constraint), however they may also be used to compare values of two states, e.g. an employee’s new salary is ≥ to their current salary (i.e. a transition constraint). 3.5.2 Key Constraints Key constraints specify key attribute(s) that can uniquely identify an instance of an entity. These constraints are also called candidate key or uniqueness constraints. For example, stating Deptno is a key of Department will ensure that no two departments will have the same Deptno. When a set of attributes form a key, then that key is called a composite key, as we are dealing with a composite attribute. When a nonnull constraint is added to a key uniqueness constraint then such keys are referred to as primary keys. An entity may have several candidate keys and in such cases one is called the primary key and the others alternate keys. Primary key attributes are shown in the EER model (see appendix B.2, figure ‘b’). The OMT model uses object identities (oids) to uniquely identify objects and as they are usually system generated they are not shown in this model. However, when modelling relational databases we do not use the concept of oid, instead we have primary keys which are shown in our diagrams (see appendix B.2, figure ‘b’) as they carry important information about a relational database. 3.5.3 Structural Constraints on Attributes Attribute structural constraints specify whether an attribute is single valued or multivalued. Multivalued attributes with a fixed number of possible values are sometimes defined as composite attributes. For example, name can be a composite attribute composed of first name, middle initial and last name. However composite attributes cannot be constructed for multivalued attributes like a student’s course set, where the student can attend several courses. In such a case one would have to use an alternative solution, such as recording all possibilities in one long string or using a separate data type like sets. This type of constraint is not generally supported by most traditional DBMSs. In the relational model we use a separate entity to hold multiple values and these are related to the correct entity through an identical primary key [ELM94]. 3.5.4 Structural Constraints on Relationships Structural constraints on relationships specify limitations on the participation of entities in relationship instances. Two types of relationship constraints occur frequently. They are called


Constraints

Page 43

cardinality ratio constraints and participation constraints. The cardinality ratio constraint specifies the number of relationship instances that an entity can participate in using 1 and N (many). For example, the constraints every employee works for exactly one department and a department can have many employees working for it has the cardinality ratio of 1:N, meaning that each department entity can be related to numerous employee entities. A participation constraint specifies whether the existence of an entity depends on its being related to another entity via a certain relationship type. If all the instances of an entity participate in a relationship of this type then the entity has total participation (existence dependency) in the relationship. Otherwise the participation is partial, meaning only some of the instances of that entity participate in a relationship of this type. For example, the constraint every employee works for exactly one department means that an Employee entity has a total participation in the relationship WorksFor (see figure 3.2), and the constraint an employee can be the head of a department, means that the Employee entity has a partial participation in the relationship Head (see figure 3.2) (i.e. not all employees are head of a department). Referential integrity constraints are used to specify a type of structural relationship constraint. In relational databases, foreign keys are used to define referential integrity constraints. A foreign key is defined on attributes of a relation. This relation is known as the referencing table. The foreign key attribute of the referencing table (e.g. WorksFor of Employee in figure 3.4) will always refer to an attribute(s) of another relation, where the attribute(s) are the primary or alternate key (e.g. DeptCode of Department in figure 3.4). We refer to this relation as the referenced table. The foreign key attribute(s) of the referenced table have a uniqueness property, and may be the primary or alternate key of that relation. This means that references from one relation to another are achieved by using foreign keys, which indicate a relationship between two entities. Also this establishes an inclusion dependency between the two entities. Here the values of the attribute of the referencing entity (e.g. Employee.WorksFor) are a subset of the values of the attribute of the referenced entity (e.g. Department.DeptCode). Only recent DBMSs such as Oracle version 7 support the specification of foreign keys using DDL statements.

Employee Department ...Attributes...WorksFor DeptCode ...Attributes... COMMA COMMA MATHS ELSYM COMMA MATHS COMMA MATHS (5 records) (3 records) WorksFor is a foreign key attribute of referencing entity Employee. This attribute refers to the referenced entity Department whose attribute DeptCode is its primary key .

Figure 3.4: A foreign key example 3.5.5 Structural Constraints on Specialisation/Generalisation Disjointness and completeness constraints are defined on specialisation/generalisation structures. The disjointness constraint specifies that the subclasses (superclass) of the specialisation (generalisation) must be disjoint. This means that an entity can be a member of at most one of the subclasses of the specialisation. When an entity is a member of more than one of the subclasses, then we get an overlapping situation. The completeness constraint on specialisation (generalisation) defines total/partial specialisation (generalisation). A total specialisation specifies


Constraints

Page 44

the constraint that every entity in the superclass must be a member of some subclass in the specialisation. For example: Lecturer, Secretary and Technician are some of the job types of an Employee. They describe disjoint subclasses of the entity Employee, having a partial specialisation as there could be an employee with another job type. Generalisation is a feature supported by many object-oriented (O-O) systems, but it has yet to be adopted by commercial relational DBMSs. However, with O-O features being incorporated into the relational model (e.g. SQL-3) we can expect to see this feature in many RDBMSs in the near future. 3.5.6 General Semantic Constraints There are general semantic integrity constraints that do not fall into one of the above categories (e.g. the constraint the salary of an employee must not be greater than the salary of the head of the department that the employee works for, or the salary attribute of an employee can only be increased). These constraints can be either state or transition constraints, and are generally specified as explicit constraints. The transition constraint mentioned above is a single-step transition constraint. Here, a constraint is evaluated on a pair of pre-transaction and post-transaction states of a database, e.g. in the employee’s salary example, the current salary will be the pre-transaction state and the new salary will become the post-transaction state. However, there are transition constraints that are not limited to a single-step, e.g. temporal constraints specified using the temporal qualifiers always and sometimes [CHO92]. Other forms of constraint exist, including those defined for incomplete data (e.g. employees having similar jobs and experience must have almost equal salary) [RAJ88]. These can also be considered as a type of semantic constraint, mainly as they are not implicitly supported by the most frequently used (i.e. relational) data model. They may need a special constraint specification language to support them. 3.6 Specifying Explicit Constraints Explicit constraints are generally defined using either a procedural or a declarative approach. 3.6.1 Procedural Approach In the procedural approach (or the coded constraints technique), constraint checking statements are coded into appropriate update transactions of the application by the programmer. For example to enforce the constraint, the salary of an employee must not be greater than the salary of the head of the department that the employee works for, one has to check every update transaction that may violate this constraint. This includes any transaction that modifies the salary of an employee, any transaction that links an employee to a department, and any transaction that assigns a new employee or manager to a department. Thus in all such transactions appropriate code has to be included that will check for possible violations of this constraint. When a violation occurs the transaction has to be aborted, and this is also done by including appropriate code in the application. The above technique for handling explicit constraints is used by many existing DBMSs. This


Constraints

Page 45

technique is general because the checks are typically programmed in a general-purpose programming language. It also allows the programmer to code in an effective way. However, it is not very flexible and places an extra burden on the programmer, who must know where the constraints may be violated and include checks at each and every place that a violation may occur. Hence, a misunderstanding, omission or error by the programmer may leave the database able to get into an inconsistent state. Another drawback of specifying constraints procedurally is that they can change with time as the rules of a real world situation change. If a constraint changes, it is the responsibility of the DBA to instruct appropriate programmers to recode all the transactions that are affected by the change. This again opens the possibility of overlooking some transactions, and hence the chance that errors in constraint representation will render the database inconsistent. Another source of error is that it is possible to include conflicting constraints in procedural specifications that will cause incorrect abortion of correct transactions. This error may occur in other types of constraint specification, e.g. whenever we attempt to define multiple constraints on the same entity. However, such errors can be detected more easily in a declarative approach, as it is possible to evaluate constraints defined in that form to identify conflicts between them. The procedural approach is usually adopted only when the DBMS cannot declaratively support the same constraint. In all early DBMSs the procedural code was part of the application code and was not retained in the database’s system catalog. However, some recent DBMSs (e.g. INGRES) provide a rule subsystem where all defined procedures are stored in system catalogs and are fired using rules which detect update transactions associated with a particular constraint. This approach is a step towards the declarative approach as it overcomes some of the deficiencies of the procedural approach described above (e.g. the maintenance of constraints), although the code is still of procedural type which for example, prevents the detection of conflicting constraints. Some DBMSs (e.g. INGRES) do not support the specification of foreign key constraints through their DDL. Hence, for these systems such constraints have to be explicitly defined using a procedural approach. In section ‘a’ of table 3.2, we show how a procedure is used in INGRES to implement a foreign key constraint. Here the existence of a value in the referenced table is checked and the statement is rejected if it does not exist. For comparison purposes, we include the definition of the same constraint using an SQL-3 DDL specification (implicit) in section ‘b’ of table 3.2, and the declarative approach (explicit) in section ‘c’ of table 3.2. When comparing these three approaches, it is clear that the procedural one is most unfriendly and more error-prone. The DDL approach is the best and most efficient approach as it is specified and managed implicitly by the DBMS.


Constraints

Page 46

Procedural CREATE PROCEDURE Employee_FK_Dept (WorksFor CHAR(5)) AS Approach DECLARE (Explicit) msg = VARCHAR(70) NOT NULL; check_value = INTEGER; BEGIN IF WorksFor IS NOT NULL THEN SELECT COUNT(*) INTO :check_value FROM Department WHERE DeptCode = :WorksFor; IF check_value = 0 THEN msg = ‘Error 1: Invalid Department Code: ‘ + :WorksFor; RAISE ERROR 1 :msg; RETURN ENDIF ENDIF END; DDL CONSTRAINT Employee_FK_Dept Approach FOREIGN KEY (WorksFor) REFERENCES Department (DeptCode); (Implicit) Note: This constraint is defined in the Employee table. Declarative CREATE ASSERTION Employee_FK_Dept Approach CHECK (NOT EXISTS (SELECT * FROM Employee (Explicit) WHERE WorksFor IN (SELECT DeptCode FROM Department))); Note: The schema on which these constraints are defined is in figure 6.2.

Approach SQL Statements

Table 3.2: Different Approaches to defining a Constraint

(a)

(b)

(c)

3.6.2 Declarative Approach This more formal technique for representing explicit constraints is to use a constraint specification language, usually based on some variation of relational calculus. This is used to specify or declare all the explicit constraints. In this declarative approach there is a clean separation between the constraint base in which the constraints are stored, in a suitable encoded form, and the integrity control subsystem of the DBMS, which accesses the constraints to apply them to transactions. When using this technique, constraints are often called integrity assertions, or simply assertions, and the specification language is called an assertion specification language. The term assertion is used in place of explicit constraints, and the assertions are specified as declarative statements. These constraints are not attached to a specific table as in the case of the implicit constraint types introduced in section 3.5. This approach is supported by SQL-3. For example, the constraint head of departments’ salary must be greater than that of his employees, can be specified as, CREATE ASSERTION Salary_Constraint AFTER INSERT, UPDATE ON Employee E H, Department

CHECK (E.Salary < H.Salary and E.Dept = DeptCode and Head = H.EmpNo) Assertions can also be used to define implicit constraints, like examination mark is between 0 and 100; or referential integrity constraints, as shown in table 3.2 part ‘b’. However, it is easier and more efficient (i.e. consumes less computer resources) to monitor and enforce implicit constraints using the DDL approach as such constraints are attached to an entity and used only when an update transaction is applied to that entity, as opposed to checking whenever an update transaction is performed on the database in general.


Constraints

Page 47

In many cases it is convenient to specify the type of action to be taken when a constraint is violated or satisfied, rather than just having the options of aborting or silently performing the transaction. In such cases, it is useful to include the option of informing a responsible person regarding the need to take action or notifying them of the occurrence of that transaction (e.g. in referential constraints, it is sometimes necessary to perform an action like update or delete on a table to amend its contents, instead of aborting the transaction). This facility is supported either by an optional trigger option on an existing DDL statement or by defining triggers [DAT93]. Triggers can combine the declarative and procedural approaches, as the action part can be procedural, while the condition part is always declarative (INGRES rules are a form of trigger). A trigger can be used to activate a chain of associated updates, that will ensure database integrity (e.g. update total quantity when new stock arrives or when stock is dispatched). An alerter, which is a variant of the trigger mechanism, is used to notify users of important events in the database. For example, we could send a message to the head calling to his attention a purchase transaction for £1,000 or more made from department funds. In this section we have introduced concepts from INGRES which also appear in other DBMSs, namely triggers and alerters. These constructs provide further information about database contents, but are beyond the scope of this project. They are related to constraints, so could be utilised in a similar fashion. 3.7 SQL Approach to Implementing Constraints In SQL-3, a constraint is either a domain constraint, a table constraint or a general constraint [ISO94]. It is described by a constraint descriptor, which is either a domain constraint descriptor (cf. sections 3.7.1 and A.11), a table descriptor (cf. sections 3.7.2 and A.4) or a general descriptor (cf. sections 3.7.3 and A.12). Every constraint descriptor includes: the name of the constraint, an indication of whether or not the constraint is deferrable, and an indication of whether or not the initial constraint mode is deferred or immediate (see section A.3). Constraint descriptors are optional in that they can be assigned with system generated names (except for the general constraint case, where a name must be given). A constraint has an action which is either deferrable or non-deferrable. This can be set using the constraint mode option (see section A.13). Usually, most constraints are immediate as the default constraint mode is immediate, and in these cases they are checked at the end of an SQL transaction. To deal with deferred constraints, all constraints are effectively checked at the end of an SQL session or when an SQL commit statement is executed. Whenever a constraint is detected as being violated, an exception condition is raised and the transaction concerned is terminated by an implicit SQL rollback statement to ensure the consistency of the database system. 3.7.1 SQL Domain Constraints In SQL, domain constraints are specified by means of the CREATE DOMAIN statement (see section A.11) and can be added to or dropped from an existing domain by means of the ALTER DOMAIN statement [DAT93]. These constraints are associated with a specific domain and apply to every column that is defined using that domain. They allow users to define new data types, which in turn are used to define the structure of a table. For example, a domain Marks may be defined as shown in figure 3.5. This means SQL will recognise the data type Marks which permits integers


Constraints

Page 48

between 0 and 100, thus giving a natural look to that data type when it is used.

Figure 3.5: An SQL domain constrant

CREATE DOMAIN Marks INTEGER CONSTRAINT icmarks CHECK(VALUE BETWEEN (0,100));

3.7.2 SQL Table Constraints In SQL, table constraints (i.e. constraints on base tables) are initially specified by means of the CREATE TABLE statement (see section A.4) and can be added to or dropped from an existing base table by means of the ALTER TABLE statement [DAT93]. These constraints are associated with a specific table, as they cannot exist without a base table. However, this does not mean that such constraints cannot span multiple tables as in the case of foreign keys. Constraints defined on specific columns of a base table are a type of table constraint, but are usually referred to as column constraints. Three types of table constraints are defined here, namely: candidate key constraints, foreign key constraints and check constraints. Their definitions may appear next to their respective column definitions or at the end (i.e. after all column definitions have been defined). We now describe an example that uses all three types of constraints, using figure 3.6. The PRIMARY KEY (only one per table) (see section A.6) and UNIQUE (the value in a row position is unique) (see section A.5) are used to define candidate keys. A FOREIGN KEY definition (see section A.8) defines a referential integrity constraint and may also include an action part (which is not used in figure 3.6 for simplicity). Check constraints are defined using a CHECK clause (see section A.9) and may contain any logical expression. The check constraint CHECK(Name IS NOT NULL) is usually defined using a shorthand form NOT NULL next to the column Name, as shown in figure 3.6. We have also included a check constraint spanning multiple tables in figure 3.6. Such table constraints can be included only after the tables have been created, and hence in practice they are added using ALTER TABLE statements.

Figure 3.6: SQL table constrants

CREATE TABLE Employee( EmpNo CHAR(5) PRIMARY KEY, Name CHAR(20) NOT NULL, Address CHAR(80) Age INTEGER CHECK(Age BETWEEN (21,65)), WorksFor CHAR(5) FOREIGN KEY REFERENCES (Department), Salary DECIMAL, CHECK (E.Salary <= (SELECT H.Salary FROM Department D, Employee H E WHERE D.Head=H.EmpNo AND E.WorksFor=H.WorksFor), UNIQUE (Name, Address) );

3.7.3 SQL General Constraints In SQL, general constraints are specified by means of the CREATE ASSERTION statement (see section A.12) and are dropped by means of the DROP ASSERTION statement. These constraints must be associated with a user defined constraint name as they are not attached to a specific table


Constraints

Page 49

and are used to constrain arbitrary combinations of columns in arbitrary combinations of base tables in a database. The constraint defined in section ‘c’ of table 3.2 belongs to this type. Domain and table constraints are implicit constraints of a database, while assertions used to define general constraints are explicit constraints (using a declarative approach). SQL data types have their own constraint checking, which rejects for example string values being entered into a numeric column definition. This type of constraint checking can be considered as inherent as it is supported by the SQL language itself. All integrity constraints discussed above are deterministic and independent of the application and system environments. Hence, no parameters, host variables and built in system functions (e.g. CURRENT_DATE) are allowed in these definitions as they make the database inconsistent. For example CURRENT_DATE will give different values on different days. This means the validity of a data entry may not hold during its lifetime despite no changes being made to its original entry. This makes the task of maintaining the consistency of the database more difficult. Also it makes it difficult to distinguish these errors from the traditional errors discussed in section 3.5. Hence attributes such as age, which involves use of CURRENT_DATE should be derived attributes whose value is computed during retrieval. 3.8 SQL Standards To conclude this chapter, we compare different SQL standards to chronicle when respective constraint specification statements were introduced to the language. A standard for the database language SQL was first introduced in 1986 [ANSI86], and this is now called SQL/86. The SQL/86 standard specified two levels, namely: level 1 and level 2 (referred to as level 1 and 2 respectively in table 3.3, column 2); where level 2 defined the complete SQL database language, while level 1 was a subset of level 2. In 1989, the original standard was extended to include the integrity enhancement feature [ANSI89a]. This standard is called SQL/89 and is referred to as level Int. in table 3.3, column 2. The current standard, SQL/92 [ANSI92], is also referred to as SQL-2. This standard defines three levels, namely: Entry, Intermediate and Full SQL (referred to as level E, I and F, respectively, in table 3.3, column 4); where Full SQL is the complete SQL database language, Intermediate SQL is a proper subset of Full SQL, and Entry SQL is a proper subset of Intermediate SQL. The purpose of introducing 3 levels was to enable database vendors who had incorporated the SQL/89 extensions into their original SQL/86 implementations to claim SQL/92 Entry level status. As Intermediate extensions were more straightforward additions than the rest, they were separated from the Full SQL/92 extensions. However, SQL/92 is also constantly being reviewed [ISO94], mainly to incorporate O-O features into the language, and this latest release is called SQL-3 (referred to as level O-O in table 3.3, column 5). Until recently, relational DBMSs supported only the SQL/86 standard and even now most support only up to the Entry level of SQL/92. Hence ISs developed using these relational DBMSs are not capable of supporting modern features introduced in the newest standards. Our work focuses on providing these newer features for existing relational legacy database systems so that features such as primary / foreign key specification can be supported for relational databases conforming to SQL/86 standards; and sub / super type features can be specified for all relational products.


Constraints

Page 50

Table 3.3: SQL Standards and introduction of constraints

SQL Version SQL/86 SQL/89 SQL/92 SQL-3 Level 1 2 Int. E I F O-O

Data Type Identifier length Not Null Unique Key Primary Key Foreign Key Check Constraint Default Value Domain Assertion Trigger Sub/SuperType

= = = = x x x x - - - -

x x x ---- - - - - -

+ + + x ---- - - - -

= = = = = = = = - - - -

+ + = + + = = = x - - -

+ = = = = + + = + x - -

+ = = = = + = = = + x x

The integrity features discussed in previous sections were thus gradually introduced into the SQL language standards as we can see from table 3.3. In this table ‘x’ indicates when the feature was first introduced. The ‘+’ sign indicates that some enhancements were made to a previous version, the ‘=’ sign indicates that the same feature was used in a later version, and the ‘-’ sign indicates that the feature was not provided in that version. For example, the Primary Key constraint for a table was first introduced in SQL/89 (cf. appendix A.6) and later enhanced (i.e. in SQL/92 Intermediate) by merging it with the Unique constraint (cf. appendix A.5) to introduce a candidate key constraint (cf. appendix A.7). Thus, SQL standards for Primary Key are shown in table 3.3 as: ‘-’ for SQL/86; ‘x’ for SQL/89; ‘=’ for SQL/92 Entry level; ‘+’ for SQL/92 Intermediate level; and ‘=’ for all subsequent versions. Simple attributes are defined using their data type and length (cf. section 3.5.1). These specifications are used as inherent domain constraints. The first two rows of table 3.3 show that they were among the first constraint features introduced via SQL standards (i.e. SQL/86). The Not Null constraint, which is a special domain constraint, was also introduced in the initial SQL standard. The key constraints (cf. section 3.5.2), which specify unique and primary keys, were introduced in a subsequent standard (i.e. SQL/89) as shown in table 3.3. The referential constraint which specifies a type of a structural relationship constraint uses foreign keys, and this constraint was also introduced in SQL/89, along with default values for attributes and check constraints. Later, more complex forms of constraints were introduced in SQL/92. These include defining new domains for an attribute (e.g. child as a domain having an inherent constraint of age being less than 18 years), and specifying domain constraints spanning multiple tables (i.e. assertions). Constraints which are activated when an event occurs (i.e. triggers), and structural constraints on specialisation / generalisation (i.e. sub/super type, cf. section 3.5.5) are among other enhancements proposed in the draft SQL-3 standards. A detailed description of the syntax of statements used to provide the features identified in table 3.3 is given in appendix A. For further details we refer the reader to the standards themselves [ANSI86, ANSI89a, ANSI92, ISO94].

CHAPTER 4

Migration of Legacy Information Systems This chapter addresses in detail the migration process and issues concerned with legacy information systems (ISs). We identify the characteristics and components of a typical legacy IS and present the expected features and functions of a migration target IS. An overview of some of the strategies and methods used for migration of a legacy IS to a target IS is presented along with a detailed study of migration support tools. Finally, we introduce our tool to assist the enhancement and migration of relational legacy databases. 4.1 Introduction Rapid technological advances in recent years have changed the standard computer hardware technology from centralised mainframes to network file-server and client/server architectures, and software data management technology from files and primitive database systems to powerful extended relational distributed DBMSs, 4GLs, CASE tools, non-procedural application generators and end-user computing facilities. Information systems (ISs) built using old-fashioned technology are inflexible, as that technology limits them from being changed and evolving to meet changing business needs, which adjust rapidly to the potential of technological advances. It also means they are expensive to maintain, as older systems suffer from failures, inappropriate functionality, lack of documentation, poor performance and problems in training support staff. Such older systems are called legacy ISs [BRO93, PHI94, BEN95, BRO95], and they need to be evolved and migrated to a modern technological environment so that their existence remains beneficial to their user community and organisation, and their full potential to the organisation can be realised. 4.2 Legacy Information Systems Technological advances in hardware and software have improved the performance and maintainability of new information systems. Equipment and techniques used by older ISs are obsolete and prone to suffer from frequent breakdowns along with ever increasing maintenance costs. In addition, older ISs may have other deficiencies depending on the type of system. Common characteristics of these systems are [BRO93, PHI94, BEN95, BRO95] that they are: • scientifically old, as they were built using older technology, • written in a 3GL, • use an old fashioned database service (e.g. a hierarchical DBMS), • have a dated style of user interface (e.g. command driven). Furthermore, in very large organisations additional negative characteristics may be present making the intended migration process even more complex and difficult. These include [BRO93, AIK94, BEN95, BRO95]: being very large (e.g. having millions of lines of code); being mission critical (e.g. an on-line monitoring system like customer billing); and being operational all the time (i.e. 24 hours a day and 7 days a week). These characteristics are not present in most smaller scale legacy information systems, and hence the latter are less complex to maintain. Our work may not

Chapter 4 Migration of legacy ISs

Page 52

assist all types of legacy IS as it deals with one particular component of a legacy IS only (i.e. the database service). Information systems consist of three major functional components, namely: interfaces, applications and a database service. In the context of a legacy IS these components are, accordingly, referred to as [BRO93, BRO95] the: • legacy interface, • legacy application, • legacy database service. These functional components are sometimes inter-related depending on how they were designed and implemented in the IS’s development. They may exist independently of each other, having no functional dependencies (i.e. all three components are decomposable - see section ‘a’ of figure 4.1); they may be semi-decomposable (e.g. the interface may be separate from the rest of the system - see section ‘b’ of figure 4.1); or they may be totally non-decomposable (i.e. the functional components are not discrete but are intertwined and used as a single unit - section ‘c’ of figure 4.1). This variety makes the legacy IS environment complex to deal with. Due to the potential complexity of entire legacy ISs, we have focussed on one particular functional component only, namely: the legacy database service. In order to restrict our attention to the database service component, we have to treat the other components, namely the interface and application, as black boxes. This can be done successfully when a legacy IS has decomposable modules as in section ‘a’ of figure 4.1. However, when the legacy database service is not fully decomposable from both the legacy interface and the legacy application, treating them as black boxes may result in loss of information since relevant database code may also appear in other components. In such cases, attempts must be made by the designer to decompose or restructure the legacy code. The designer needs to investigate the legacy code of the interface and application modules to detect any database service code, then move it to the database service module (e.g. legacy code used to specify and enforce integrity constraints in the interface or application components is identified and transferred to this module). This will assist in the conversion of legacy ISs of the types shown in sections ‘b’ and ‘c’ of figure 4.1 to conform to the structure of the IS type in section ‘a’ of figure 4.1. The identification and transfer of any legacy database code left in the legacy application or interface modules can be done at any stage (e.g. even after the initial migration) as the migration process can be repeated iteratively. Also, the existence of legacy database code in the application does not affect the operation of the IS as we are not going to change any existing functionalities during the migration process. Hence, treating a legacy interface or a legacy application having legacy database code as a black box does not harm migration.


Page 53

Figure 4.1 : Functional Components of a Legacy IS

D - Database Service

A -Application

I - Interface

I

I

A

(a) (b)

A/D

(c)

I/A/D

D

4.2.1 Legacy Interfaces Early information systems were developed for system users who were computer literate. These systems did not have system or user level interfaces because they were not regarded as essential since the system users did these tasks themselves. However, when the business community and others wanted to use these systems, the need for user interfaces was identified and they have been incorporated into more recent ISs. The introduction of DBMSs in the late 1960’s facilitated easy access to computer held data. However, in the early DBMSs, the end user had no direct access to their database and their interactions with the database were done through verbal communication with a skilled database programmer [ELM94]. All user requests were made via the programmer, who then coded the task as a batch program using a procedural language. Since the introduction of query languages such as SQL [CHA74, CHA76], QBE [ZLO77] and QUEL [HEL75], provision of interfaces for database access has rapidly improved. These interfaces are provided not only to encourage laymen to use the system but also to hide technical details from users. A presentation layer consisting of forms [ZLO77] was the initial approach to supporting interaction with the end user. Modern user interfaces rely on multiple screen windows and iconic (pictorial) representations of entities manipulated by pull-down or pop-up menus and pointing devices such as cursor mice [SHN86, NYE93, HEL94, QUE93]. The current trend is towards using separate interfaces for all decomposable application modules of an IS. Some Graphical User Interface (GUI) development tools (e.g. Vision for graphics and user interfaces [MEY94]) allow the construction of advanced GUIs supporting portability to various toolkits. This is an important step towards building the next generation of ISs. Changes in the user interface and operating environment result in the need for user training, an additional factor in system evolution costs. As defined by Brodie [BRO93, BRO95], we shall use the term legacy interfaces in the context of all ISs whose applications have no interfaces or use old fashioned user / system interfaces. In figures 4.1a and 4.1b, these interfaces are distinct and separable from the rest of the system as they are decomposable modules. However, interfaces can be non-decomposable as shown in figure 4.1c. Migration issues concerning user interfaces have been addressed in [BRO93, BRO95, MER95], and as mentioned in section 4.2, our work does not address problems associated with such interface migration. 4.2.2 Legacy Applications


Page 54

Originally, ISs were written using 3GLs, usually the COBOL or PL/1 programming languages. These languages had many software engineering deficiencies due to the state of the technology at the time of their development. Techniques such as structured programming, modularity, flexibility, reusability, portability, extensibility and tailorability [YOU79, SOM89, BOO94, ELM94] were not readily available until subsequent advances in software engineering occurred. The lack of these has made 3GL based ISs appear to be inflexible and, hence, difficult and expensive to maintain and evolve to meet changing business needs. The unstructured and non-modular nature of 3GLs resulted in the use of non-decomposable application modules13 in the development of early ISs. However, with the introduction of software engineering techniques such as structured modular programming these 3GLs were enhanced, and new languages, such as Pascal [WIR71], Simula [BIR73], and more recently C++ [STR91] and Eiffel [MEY92], gradually emerged to support these changing software engineering requirements. The emergence of query languages in the 1970’s as standard interfaces to databases saw the development and use of embedded query languages in host programming languages for large software application program development. Embedded QUEL for INGRES [RTI90a] and Embedded SQL for many relational DBMSs [ANSI89b] are examples of this gendre. The emergence of 4GLs is an evolution which allows users to give a high-level specification of an application expressed entirely in 4GL code. A tool then automatically generates the corresponding application code from the 4GL code. For example, in the INGRES Application-by-Forms interface [RTI90b], the application designer develops a system by using forms displayed on the screen, instead of writing a program. Similar software products are offered by other vendors, such as Oracle [KRO93]. Information systems developed recently have partially or totally adopted modern software engineering practices. As a result, decomposable modules exist in some recent ISs, i.e. their architecture is as in section ‘a’ of figure 4.1. Applications which do not use the concept of modularity are non-decomposable (e.g. section ‘c’ of figure 4.1), while those partially using it are semi-decomposable (section ‘b’ of figure 4.1). Semi- and non- decomposable applications are referred to as legacy applications and need to be converted into fully-decomposable systems to simplify maintenance and make it easier for them to evolve and support future business needs. Some aspects of legacy application migration need tools to analyse code. These are discussed in [BIG94, NIN94, BRA95, WON95]. They are beyond the scope of this thesis, except insofar as we are interested in any legacy code that is relevant to the provisions of a modern database service (e.g. integrity constraints). 4.2.3 Legacy Database Service Originally, many ISs were developed on centralised mainframes using COBOL and PL/1 based file systems [FRY76, HAN85]. These ISs had no DBMS and their data was managed by the system using separate files and programs for each file handling task [HAN85]. Later ISs were based

13 often containing calculated or parameter-driven GOTO statements preventing a reasonable

analysis of their structure.


Page 55

on using database technology with limited DBMS capabilities. These systems typically used hierarchical or network DBMSs for their data management [ELM94, DAT95], such as IMS [MCG77] and IDMS [DAT86, ELM94]. The introduction and rapid acceptance of the relational model for DBMSs in recent years has resulted in most applications now being developed with original relational DBMSs (e.g. System R [AST76], DB2 [HAD84], SQL/DS [DAT88], INGRES [STO76, DAT87]). The steady evolution of the relational data model has resulted in the emergence of extended relational DBMSs (e.g. POSTGRES [STO91]) and newer versions of existing products (e.g. Oracle [ROL92], INGRES [DAT87] and SYBASE [DAT90]) which have been used for most recent database applications. This relational data model has been widely accepted as the dominant current generation standard for supporting ISs. The rapidly expanding horizon of applications means that DBMSs are now expected to cater for diverse data handling needs such as management of image, spatial, statistical or temporal databases [ELM94, DAT95] and it is in its support of these that they are often weak. This highlights the different range of functionality that is supported by various DBMSs. Thus applications using older database services support modern database functionalities by means of application modules. This is a typical characteristic of a legacy IS. Hence, the structure of such ISs is more complex and is poorly understood as it is not adequately engineered in accordance with current technology. The database services offered by most hierarchical, network and original relational DBMSs are now considered to be primitive, as they fail to support many functions (e.g. backup, recovery, transaction support, increased data independence, security, performance improvements and views [DAT77, DAT81, DAT86, DAT90, ELM94]) found in modern DBMSs. These functions facilitate the system maintenance of databases developed using modern DBMSs. Hence, the database services provided by early DBMSs and file based systems are now referred to as legacy database services, since they do not fulfil many current requirements and expectations of such services. The specifications of a database service are described by a database schema which is held in the database using data dictionaries. Analysis of the contents of these data dictionaries will provide information that is useful in constructing a conceptual model for a legacy system. Our approach focuses on using the data dictionaries of a relational legacy database to extract as much information as possible about the specifications of that database. 4.2.4 Other Characteristics The programming constructs of 3GL programs are less powerful than the data manipulation features offered by 4GLs. As 4GL code uses the higher level DML code of modern DBMSs, it uses less code (e.g. about 20% less) than its predecessors to accomplish even more powerful applications. A typical program of a 3GL based information system is large and may consist of over a hundred thousand lines of 3GL code. This means that a 20% reduction is a considerable saving in quantity of code to be maintained [BRO93, BRO95]. Code translation tools [SHA93, SHE94] are being built to automate as far as possible the conversion between 3GL and 4GL. These translations sometimes optimise the translated code. Some of these techniques were mentioned in section 1.1.1. The long lifetime of some ISs also leads to deficiencies in documentation. This may be due to non-existent, out of date or lost documentary materials. The extent of this deficiency was only


Page 56

realised recently when people tried to transform ISs. To address this problem in the future, CASE tools are being developed to automatically produce suitable documentation for current ISs developed using them [COMP90]. However, this solution does not apply to legacy ISs as they were not built using such tools and it is impossible to use these tools on the legacy systems. Thus we must solve this problem in another way. Sometimes, certain critical functions of an IS are written for high performance, often using a specific, machine dependent set of instructions on some obsolete computer. This results in the use of mysterious and complex machine code constructs which need to be deciphered to enable the code to be ported to other computer systems. Such code is usually not convertible using generalised translation tools. In general, the performance of legacy ISs is poor as most of their functions are not optimised. This is inevitable, due to the state of the technology at the time of their original development. Thus problems arise when we try to translate 3GL code into 4GL equivalent code in a straightforward manner. Solving the problems identified above is the overall concern when assisting the migration and evolution of legacy ISs. However, our aim is to address only some of the problems concerning legacy ISs, as the complete task is beyond the scope of our project. 4.2.5 Legacy Information System Architecture Having considered the characteristics of the components of legacy ISs, we can conclude that a typical IS consists of many application modules, which may or may not use an interface for user / system interactions, and may use a legacy database service to manage legacy data. This database service may use a DBMS to manage its database. Hence, in general, the architecture of most legacy ISs is not strictly decomposable, semi-decomposable or non-decomposable, as they may have evolved several times during their lifetime. As a result, parts of the system may belong to any of the three categories shown in figure 4.2. This means that the general architecture of a legacy IS is a hybrid one, as defined in [BRO93, BRO95, KAR95]. Figure 4.2 suggests that some interfaces and application modules are inseparable from the legacy database service while others are modular and independent of each other. This legacy IS architecture highlights the database service component, as our interactions are with this layer to extract the legacy database schema and any other database services required. 4.3 Target Information System A legacy IS can be migrated to a target IS with an associated computing environment. This target environment is intended to take maximum advantage of the benefits of rightsized computers, client/server network architecture, and modern software including relational DBMSs, 4GLs and CASE tools. In this section we present the hardware and software environments needed for the target ISs. 4.3.1 Hardware Environment The target environment must be equipped with modern technology supporting current


Page 57

business needs which should be flexible enough to evolve and fulfil future requirements. The fundamental goal of a legacy IS migration is that the target IS must not itself become a legacy IS in the near future. Thus, the target hardware environment needs to be flexibly networked (e.g. client-server architecture) to support a distributed multi-user community. This type of environment includes a desk top computer for each target IS user with an appropriate server machine(s) controlling and resourcing the network provision. A PC (e.g. IBM PC or compatible) or a workstation (e.g. Sun SPARC) may be used as the desk top computer (i.e. client / local machine), each being connected using a local area network (LAN) (e.g. Ethernet) to the server(s).

Mn

Figure 4.2 : Legacy IS Architecture

Legacy Database Service

• • • •

Im+1 InImIl+1

• •• •

Mm+1

DatabaseLegacy

DatabaseLegacy

Data

Al+1..Am

I1..Il

A1..Al

non-decomposable semi-decomposable decomposable

I - Interface module A - Application module with legacy database services M - Decomposed application module

4.3.2 Software Environment The target database software is typically based on a relational DBMS with a 4GL, SQL, report writers and CASE tools (e.g. Oracle v7 with Oracle CASE). Use of such software provides many benefits to its users, such as an increase in program / data independence, introduction of modularised software components, graphical user interfaces, reduction in code, ease of maintenance, support for future evolution and integration of heterogeneous systems. The target database can be centralised on a single server machine or distributed over multiple servers in a networked environment. The target system may consist of application modules representing the decomposed system components, each having its corresponding graphical user interface (see figure 4.3). A typical architecture for a modern IS consists of layers for each of the system functions (e.g. interface, application, database, network) as identified in [BRO93, BRO95]. In figure 4.3 we introduce such an architecture with special emphasis on the database service, which will be a modern DBMS.


Page 58

• •Mn

Figure 4.3 : Target IS Architecture

• • • •

GUIn

• •

Target Database

Target Database

Target Database

GUIiGUI1

M1 Mi

Target DBMS

GUI - graphical user interface module M - Decomposed application module

The complete migration process involves significant changes, not only in hardware and software of the applications but also in the skills required by users and management. Thus they will have to be trained or replaced to operate the target IS. These changes must be done in some organised manner as the complete migration process itself is complex, and may take months or even years depending on the size and complexity of the legacy IS. The number of persons involved in the migration process and the resources available also contribute towards determining the ultimate duration and cost of the migration. 4.4 Migration Strategies The migration process for legacy ISs may take one of two main forms [BRO93, BRO95]. The first approach involves rewriting a legacy IS from scratch to produce the target IS using modern software techniques (i.e. a complete migration). The other approach involves gradually migrating the legacy IS in small steps until the desired long term objective is reached (i.e. incremental migration). The approach of complete rewriting carries substantial risks and has failed many times in large organisations as it is not an easy task to perform, especially when dealing with systems that must remain operational throughout the process, or large ISs [BRO93, BEN95, BRO95]. However, if the incremental approach fails, then only the failed step must be repeated rather than redoing the entire migration. Hence, it is argued [BRO95] that the latter approach involves a lower risk and is more appropriate in most situations. These approaches are further described in the next two subsections. Our work is directed towards assisting this incremental migration approach. 4.4.1 Complete Migration The process of complete migration involves rewriting a legacy IS from scratch to produce the intended target IS. This approach carries a substantial risk. We discuss some of the reasons for this risk to explain why this approach is not considered to be suitable by us. These are: a) A better system is expected.


Page 59

A 1-1 rewrite of a complex IS is nearly impossible to undertake, as additional functions not present in the original system are expected to be provided by the target IS. Besides, it is a significant problem to evolve a developing replacement IS in step with an evolving legacy IS and to incorporate in both ongoing changes in business requirements. Changes to and requests to evolve ISs may occur at any time, without warning, and hence, it is difficult to incorporate any minor / major ad hoc changes to the new system as they may not fit into its design. Also, an attempt to change this design may violate its original goals and contribute towards a never ending cycle of development changes. b) Specifications rarely exist for the current system. Documentation for the current system is often non-existent and typically available only in the form of the code14 itself. Due to the evolution of the IS many undocumented dependencies will have been added to the system without the knowledge of the legacy IS owners (i.e. uncontrolled enhancements have occurred). These additions and dependencies must be identified and accommodated when rewriting from scratch. This adds to the complexity of a complete rewriting process and raises the risk of unpredicted failure of dependent ISs when we rewrite a legacy system as they are dependent on undocumented features of that system. c) Information system is too critical to the organisation. Many legacy ISs must be operational almost all the time and cannot be dormant during evolution. This means that migrating live data from a legacy IS to the target IS may take more time than the business can afford to be without its mission critical information. Such situations often prohibit complete rewriting altogether and make this approach a non-starter. It also means that a carefully thought out staged migration plan must be followed in this situation. d) Management of large projects is hard. The management of large projects involves managing more and more people. This normally results in less and less productive work because of the effort required to manage organisational complexity. As a result the timing of most large projects is seriously under-estimated. Frequently, this results in partial or complete abortion of the project, as the inability to keep up with original targets due to time lost is not always tolerated by an impatient company management. 4.4.2 Incremental Migration An incremental migration process involves a series of steps, each requiring a relatively small resource allocation (e.g. a few person weeks or months in the case of small or medium scale systems), and a short time to produce a specific small result towards the desired goal. This is in sharp contrast to the complete rewrite approach which may involve a large resource allocation (e.g. several person months or years), and a development project spanning several years to produce a single massive result. To perform a migration involving a series of steps, it is important to identify

14 This code is sometimes provided only in the form of executable code, as ISs are often in-house

developments.


Page 60

independent increments (i.e. portions of the legacy interfaces, applications and databases that can be migrated independently of each other), and sequence them to achieve the desired goal. However, as legacy ISs have a wide spectrum of forms from well-structured to unstructured, this process is complex and usually has to deal with unavoidable problems due to dependencies between migration steps. The following are the most important steps to apply in an incremental migration approach: a) Iteratively migrate the computing environment. The target environment must be selected, tested and established based on the total target IS requirements. To determine the target IS requirements, it may be necessary to partially or totally analyse and decompose the legacy IS. The installation of the target environment typically involves installing a desk top computer for each target IS user and appropriate server machines, as identified in section 4.3.1. The process of replacing dumb terminals with a PC or a workstation and connecting them with a local area network can be done incrementally. This process allows the development of the application modules and GUIs on an inexpensive local machine by downloading the relevant code from a server machine, rather than by working on the server itself to develop this software. Software and hardware changes are gradually introduced in this approach along with the necessary user and management training. Hence, although we explicitly refer to a particular process there are many processes that take place simultaneously. This is due to the involvement of many people in the overall migration activity, with each person contributing towards the desired migration goal in a controlled and measurable way. Our work is concerned with iteratively migrating part of the legacy software (i.e. the database service) and not the computing environment. Therefore we worked on a preinstalled target software and hardware environment. b) Iteratively analyse the legacy information system. The purpose of this process is to understand the legacy IS in detail so that ultimately the system can be modified to consist of decomposable modules. Any existing documentation, along with the system code are used for this analysis. Knowledge and experience from people who support and manage the legacy IS is also used to document the existing and the target IS requirements. This knowledge has played a key role in other migration projects [DED95]. Some existing code analysis tools such as Reasoning Systems' Software Refinery and Bachman Information Systems' Product Set [COMP90, CLA92, BRO93, MAR94] can be used to assist in this process. It may be useful to conduct experiments to examine the current system using its known interfaces and available tools (e.g. CASE tools), so that the information gathered with one tool can be reused by other tools. Here, functions and the data content of the current system are analysed to extract as much information as possible about the legacy IS. Other available information for this type of analysis includes: documentation, discussions with users, dumps (system backups), the history of system operation and the services it provides. We do not perform any code analysis as part of our work. However, the analysis we do by automated conceptual modelling identifies the usage of the data structures of the legacy IS. Our


Page 61

analysis assists in identifying the structural components of the legacy IS and their functional dependencies. This information may then be used to restructure the legacy code. c) Iteratively decompose the legacy information system structure. The objective of this process is to construct well-defined interfaces between the modules and the database services of the legacy IS. The process may involve restructuring the legacy IS and removing dependencies between modules. It will thereby simplify the migration, that otherwise must support all these dependencies. This step may be too complex in the worst case, when the legacy IS will have to remain in its original form. Such a situation will complicate the migration process and may result in increased cost, reduced performance and additional risk. However, in such cases an attempt to perform even limited restructuring could facilitate the migration, and is preferable to totally avoiding the decomposition step altogether. We investigate supporting some structural changes in order to improve the existing structures of the legacy database (e.g. introduction of inheritance to represent hierarchical structures and specification of relationship structures). These changes eventually affect the application modules and the interfaces of the legacy IS. Hence there is a direct dependency with respect to decomposing the legacy database service and an indirect dependency with respect to decomposing the other components of the legacy IS. The actual testing of this indirect dependency was not considered due to its involvement with the application module. However, the ability to define referential integrity constraints and assertions spanning multiple tables allows us to redefine functional dependencies in the form of constraints or rules. When these constraints are stored in the database, it is possible to remove such dependencies from the legacy applications. This assists decomposition of some functional components of a legacy IS. d) Iteratively migrate the identified portions. An identified portion of the legacy IS may be an interface, application or a database service. These components are individually migrated to the target environment. When this is done the migrated portion will then run in the target environment with the remaining parts of the legacy system continuing to operate. A gateway is used to encapsulate system components undergoing changes. The objective of this gateway is to hide the ongoing changes in the application and the database service from the system users. Obviously any change made to the appearance of any user interface components will be noticeable, along with any significant performance improvements in application modules processing large volumes of data. Our work is applicable only to a legacy database service and hence any incremental migration of interfaces or application modules is not considered at this stage. The complete migration of legacy data takes a significant amount of time from hours to days depending on the volume of data held. During this process no updates or additions can be made to the data as they cause inconsistency problems. This means all functions of the database application have to be stopped to perform a complete data migration in one go. For large organisations this type of action is not appropriate. Hence iterative migration of selected data portions is desirable. To ensure a successful migration, each chosen portion needs to be validated for consistency and guarded against being rejected by the target database. When migrating data in stages it is necessary to be aware of


Page 62

the two sources of data as queries involving a migrated portion need to be re-directed to the target system while other queries must continue to access the legacy database. This process may cause a delay when a response for the query involves both the legacy and target databases. Hence it is important to minimise this delay by choosing independent portions wherever possible for the migration process. 4.5 Migration Method A formal approach to migrating legacy ISs has been proposed by Brodie [BRO93, BRO95] based on his experience in the field of telecommunication and other related projects. These methods, referred to as forward, backward/reverse and general (a combination of forward and backward) migration, are based on his “chicken little” incremental migration process. A forward migration incrementally analyses the legacy IS and then decomposes its structure, while incrementally designing the target IS or installing the target environment. In this approach the database is migrated prior to the other IS components and hence unchanged legacy applications are migrated forward onto a modern DBMS before they are migrated to new target applications. Conversely, backward migration creates the target applications and allows them to access the legacy data as the database migration is postponed to the end. General migration is more complex as it uses a combination of both these methods based on the characteristics of the legacy application and databases. However, this is more suitable for most ISs as the approach can be tailored at will. The incremental migration process consists of a number of migration steps that together achieve the desired migration. Each step is responsible for a specific aspect of the migration (i.e. computer environment, legacy application, legacy data, system and user interfaces). The selection and ordering of each aspect of the migration may differ as it depends on the application, as well as the money and effort allocated for each process. Independent components can be migrated sequentially or in parallel. As we see here, the migration methods of Brodie deal with all components of a legacy IS. Our interest in this project is to focus on a particular component, namely the database service, and as a result a detailed review of Brodie’s migration methods is not relevant here. However, our approach has taken account of his forward migration approach as it first deals with the migration of the legacy database service and then allows the legacy applications to access the post-migrated data management environment through a forward database gateway. 4.6 Migration Support Tools There is no tool that is capable of performing a complete automatic migration for a given IS. However, there are tools that can assist at various stages of this process. Hence, categorising tools by their functions according to the stages of a migration process can help in identifying and selecting those most appropriate. There are three main types of tools, namely: gateways, analysers and migration tools, which can be of use at different stages of migration [BRO95]. For large migration projects, testing and configuration management tools are also of use. a) Gateways


Page 63

The function of a gateway is to coordinate between different components of an IS and hide ongoing changes (i.e. to interfaces, data, applications and other system components being migrated) from users. One of the main functions of these tools is to intercept calls on an application or database service and direct them to the appropriate part of the legacy or target IS. To incrementally migrate a legacy IS to a target IS, we need to select independent manageable portions, replicate them in the target environment and give control to the new target modules while the legacy system is still operational. To perform such a transition in a fashion transparent to users, we need a software module (i.e. a gateway) which encapsulates system components that are undergoing change behind an unchanged interface. Such a software module manages information flow between two different environments, the legacy and target environments. Functions such as retrieval, processing, management and representation of data from various systems are expected from gateways. These expectations from a gateway managing a migration process are similar to those we have of DBMS’s to manage data. DBMSs were designed to provide general purpose data management and similarly the gateway needs to manage the migration process in a generalised form. Development of such a gateway is beyond the scope of this project as it may take several man years of effort. Hence our work will focus on some selected functionalities of a gateway, which are sufficient to produce a realistic prototype. We aim to provide a simplified form of the functionality of a gateway, which permits the evolution of an existing IS at the logical level, by creating a target database and managing an incremental migration of the existing database service in a way transparent to its users. This facility should be provided not only for centralised database systems, but also for heterogeneous distributed databases. This means our gateway functionality should support databases built using different types of DBMS. We expect some of this functionality to be incorporated in future DBMSs as part of their system functionality. Functions such as schema evolution, management of heterogeneous distributed databases and schema integration are expected capabilities of modern database products. b) Analysers These tools employ a wide variety of techniques in exploring the technical, functional and architectural aspects of an application program or database service to provide graphical and textual information about an IS. The functions of reverse and forward engineering are provided by these tools. Many tools are used in this way to analyse the different components of an IS. Most of these tools are semi-automatic as some form of user interaction is required to successfully complete their task. For example, an application or database translation process is automatic if the source program and data conform to all the standards supported by the tool. Otherwise, the translation process will be terminated with the unconvertible portions indicated, leaving the database administrator to complete the job manually by either correcting or re-programming those unconvertible portions of the source program into target language code. We experienced this situation when attempting to migrate an Oracle version 6 database to Oracle version 7, using the Oracle tools. In this case, Oracle failed to successfully convert date functions used to check the constraints of its version 6 databases to the equivalent coding in version 7 (Note: Oracle version 6’s use of non-standard date functions was the cause of this problem).


Page 64

c) Migration tools These tools are responsible for creating the components of the target IS, including interfaces, data, data definitions and applications. d) Testing An important task is to ensure that the migrated target IS performs in the same way as its legacy original, with possibly better performance. For this task we need test beds to check the most amount of logic using the least amount of data. There are tools that allow for easy manipulation of testing functions like break points and data values. However, they do not help with the generation of test beds or validation of the adequacy of the testing process. Comparing the results that are generated using both systems will assist the achievement of a reasonable form of testing. This may not be sufficient to test new features such as the introduction of distributed computing functionality to our systems. It is up to the person involved to ensure that a reasonable amount of testing has been done to ensure the functionality and the accuracy of the new IS. e) Configuration management This type of tool is needed for large migration projects involving many people, to coordinate functions such as documentation, synchronisation, keeping track of changes made (auditing), management of revisions to system elements (version control), and automatic building of a particular version of a system component. Our work focuses on bringing these tools together into a single environment. We wish to analyse a legacy database service, hence the functions of reverse and forward engineering are of particular interest. We integrate these functions with some forward gateway and migration functions as they are the relevant components for us to address the enhancement and migration of a database service. Thus, we are not interested in all the features associated with migration support tools. The classification of reverse and re-engineering tools given in [SHA93, BRO95] provides a broad description of the functions of existing CASE tools. These include maintaining, understanding and enhancing existing systems; converting / migrating existing systems to new technology; and building new replacement systems. There are many tools which perform some of these functions. However, none of them is capable of performing the integrated task of all the above functions. This is one of the important requirements for future CASE tools. As it is practically impossible to produce a single tool to perform all these tasks, the way to overcome this deficiency is to provide a gateway that permits multiple tools to exchange information and hence provide the required integrated facility. The need to integrate different software components (i.e. database, spreadsheet, word processing, e-mail and graphic applications) has resulted in the production of some integration tools, such as DEC’s Link Works and Dictionary Software’s InCASE [MAY94]. However, what we need is to integrate data extraction and downloading tools with schema enhancement and evolution functions as they are together vital in the context of enhancing and migrating legacy databases.


Page 65

Support for interoperability among various DBMSs and the ability to re-engineer a DBMS are important functions for a successful migration process. Of these two, the former has not been given any attention until very recently, and there has been some progress relating to the latter in the form of CASE tools. However, among the many CASE tools available only a handful support the re-engineering process. The reason for this is that most CASE tools focus on forward-engineering. In this situation, new or replacement software systems are being designed and appropriate code generated. The re-engineering process is a combination of both forward-engineering and reverse-engineering. The reverse-engineering process analyses the data structures of the databases of existing systems to produce a higher level representation of these systems. This higher level representation is usually in diagrammatic form and may be an entity-relationship diagram, data-flow diagram, cross-reference diagram or structure chart. We came across some tools that are commercially available for performing various tasks of the migration process. These include data access and / or extraction tools for Oracle [BAR90, HOL93, KRO93] and INGRES [RTI92] - two of our test DBMSs. Some other tools, mainly those capable of performing the initial reverse engineering task, are also identified here. These tools are not suitable for legacy ISs in general, as they fail to support a variety of DBMSs or the re-engineering of most pre-existing databases. Among the different tools available, tools such as gateways play a more significant role than others. When different database products are used in an organisation, there may be a need to use multiple tools for a single step of a migration process, conversely some tools may be of use for multiple steps. The process of using multiple tools for a migration is complex and difficult as most vendors have not yet addressed the need for tool interoperability. The survey carried out in [COMP90] identifies many reverse-engineering products. Among the 40 vendor products listed there, only three claimed to be able to reverse engineer Oracle, INGRES or POSTGRES databases (our test databases - see section 6.1) or any SQL based database products. These three products were: Deft CASE System, Ultrix Programming Environment and Foundation Vista. Of these products only Deft and Vista produced E-R diagrams. None of the products in the complete list supported POSTGRES, which was then a research prototype. Of the two products identified above, only Deft was able to read both Oracle and INGRES databases, while Vista could read only INGRES databases. This analysis indicated that interoperability among our preferred databases was rare and that it is not easy to find a tool that will perform the re-engineering process and support interoperability among existing DBMSs. Although the information published in [COMP90] may be now outdated, the literature published since then [SHA93, MAY94, SCH95] does not show that modern CASE tools have addressed the task of re-engineering existing ISs along with interoperability, both of which are essential for a successful migration process. However, the functionality of accessing and sharing information from various DBMSs via gateways like ODBC is a step towards achieving this task. One of the reasons for progress limitation is the inability to customise existing tools, which in turn prevents them being used in an integrated environment. This is confirmed to some extent by the failure of the leading Unix database vendor - Oracle - to provide such tools. Brodie and Stonebraker, in their book [BRO95], present a study of the migration of large legacy systems. It identifies an approach (chicken-little) and the commercial tools required to


Page 66

support this approach for legacy ISs in general. In this project we have developed a set of tools to support an alternative approach for migrating legacy database services in particular. Thus Brodie and Stonebraker take account of the need to migrate the application processes with a database, using commercial tools, while in this thesis we concentrate on the development of integrated tools for enhancing and migrating a database service. 4.7 The Migration Process Having identified the migration strategies and methods applicable to our work, we can review our migration process. This process must start with a legacy IS as in figure 4.2 and end with a target IS as shown in figure 4.3. However, as we are not addressing the application and interface components of a legacy IS, their conversion is not part of this project. Our conceptualised constraint visualisation and enhancement system (CCVES) described in section 2.2 was designed to assist in preparing legacy databases for such a migration. Hence our migration process can be performed by connecting the legacy and target ISs using CCVES. This is shown in figure 4.4. The three essential steps performed by CCVES before the actual migration process occurs are shown using the paths highlighted in this figure as A, B and C, respectively. These are the same paths that were described in section 2.2. The identification of all legacy databases used by an application is made prior to the commencement of path A of figure 4.4. The reverse engineering process is then performed on any selected database. This process commences when the database schema and its initial constraints are extracted from the selected database and is completed when the database schema is graphically displayed in a chosen format. Any missing or new information is supplied via path B in the form of enhanced constraints, to allow further relationships and constraints to appear in the conceptual model. The constraint enforcement process of path C is responsible for issuing queries and applying these constraints to the legacy data and taking necessary actions whenever a violation is detected, before any migration occurs. This ensures that the legacy data is consistent with its enhanced constraints before migration. Once these steps are completed, a graceful transparent migration process can be undertaken via path D. Our work focuses only on evolving and migrating database services, hence path X representing the application migration is not done via CCVES. The evolution of database services includes increasing IS program / data independence by identifying and transferring legacy application services which are concerned with data management functions, like enforcement of referential constraints, integrity constraints, rules, triggers, etc., to the database service from the application. Our migration process performs the transformation of the legacy database to the target environment and passes responsibility for enforcing the newly identified constraints to this system. Figure 4.4 indicates that our approach commences with a reverse engineering process. This is followed by a knowledge augmentation process which itself is a function of a forward engineering process. These two stages together are referred to as re-engineering (see section 5.1). The constraint enforcement process is the next stage of our approach. This is associated with the enhanced constraints of the previous stage as it is necessary to validate the existing and enhanced constraint specifications against the data held. These three preparatory stages are described in chapter 5. The final stage of our approach is the database migration process. This is described later after we have


Page 67

fully discussed the application of the earlier stages in relation to our test databases.


Page 68

CHAPTER 5

Re-engineering Relational Legacy Databases This chapter addresses the re-engineering process and issues concerned with relational legacy DBMSs. Initially, the reverse-engineering process for relational databases is overviewed. Next, we introduce our re-engineering approach, highlighting its important stages and the role of constraints in performing these stages. We then present how existing legacy databases can be enhanced with modern concepts and introduce our knowledge representation techniques which allow the holding of the enhanced knowledge in the legacy database. Finally, we describe the optional constraint enforcement process which allows validation of existing and enhanced constraint specifications against the data held. 5.1 Re-engineering Relational Databases Software such as programming code and databases is re-engineered for a number of reasons: for example, to allow reuse of past development efforts, reduce maintenance expense and improve software flexibility [PRE94]. This re-engineering process consists of two stages, namely: a reverse-engineering and a forward-engineering process. In database migration the reverse-engineering process may be applied to help migrate databases between different vendor implementations of a particular database paradigm (e.g. from INGRES to Oracle), between different versions of a particular DBMS (e.g. Oracle version 3 to Oracle version 7) and between database types (e.g. hierarchical to modern relational database systems). The forward-engineering process, which is the second stage of re-engineering, is performed on the conceptual model derived from the original reverse-engineering process. At this stage, the objective is to redesign and / or enhance an existing database system with missing and / or new information. The application of reverse-engineering to relational databases has been widely described and applied [DUM81, NAV87, DAV87, JOH89, MAR90, CHI94, PRE94, WIK95b]. The latest approaches have been extended to construct a higher level of abstraction than the original E-R model. This includes the representation of object-oriented concepts such as generalisation / specialisation hierarchies in a reversed-engineered conceptual model. Due to parallel work that had occurred in this area in the recent years, there are some similarities and differences in our reverse-engineering approach [WIK95b] when compared with other recent approaches [CHI94, PRE94]. In the next sub-sections we shall refer to them. The techniques used in the reverse-engineering process consist of identifying common characteristics as identified below: • Identify the database’s contents such as relations and attributes of relations. • Determine keys, e.g. primary keys, candidate keys and foreign keys. • Determine entity and relationship types. • Construct suitable data abstractions, such as generalisation and aggregation structures.

Chapter 5 Re-engineering relational legacy

DBs

Page 70

5.1.1 Contents of a relational database Diverse sources provide information that leads to the identification of a database’s contents. These include the database’s schema, observed patterns of data, semantic understanding of application and user manuals. Among these the most informative source is the database’s schema, which can be extracted from the data dictionary of a DBMS. The observed patterns of data usually provide information such as possible key fields, domain ranges and the related data elements. This source of information is usually not reliable as invalid, inconsistent, and incomplete data exists in most legacy applications. The reliability can be increased by using the semantics of an application. The availability of user manuals for a legacy IS is rare and they are usually out of date, which means they provide little or no useful information to this search. Data dictionaries of relational databases store information about relations, attributes of relations, and rapid data access paths of an application. Modern relational databases record additional information, such as primary and foreign keys (e.g. Oracle), rules / constraints on relations (e.g. INGRES, POSTGRES, Oracle) and generalisation hierarchies (e.g. POSTGRES). Hence, analysis of the data dictionaries of relational databases provides the basic elements of a database schema, i.e. entities, their attributes, and sometimes the keys and constraints, which are then used to discover the entity and relationship types that represent the basic components of a conceptual model for the application. The trend is for each new product release to support more sophisticated facilities for representing knowledge about the data. 5.1.2 Keys of a relational data model Theoretically, three types of key are specified in a relational data model. They are primary, candidate and foreign keys. Early relational DBMSs were not capable of implicitly representing these. However, sometimes indexes which are used for rapid data access can be used as a clue to determine some keys of an application database. For instance, the analysis of the unique index keys of a relational database provides sufficient information to determine possible primary or candidate keys of an application. The observed attribute names and data patterns may also be used to assist this process. This includes attribute names ending with ‘#’ or ‘no’ as possible candidate keys, and attributes in different relations having the same name for possible foreign key attributes. In the latter case, we need to consider homonyms to eliminate incorrect detections and synonyms to prevent any omissions due to the use of different names for the same purpose. Such attributes may need to be further verified using the data elements of the database. This includes explicit checks on data for validity of uniqueness and referential integrity properties. However the reverse of this process, i.e. determining a uniqueness property from the data values in the extensional database is not a reliable source of information, as the data itself is usually not complete (i.e. it may not contain all possible values) and may not be fully accurate. Hence we do not use this process although it has been used in [CHI94, PRE94]. The lack of information on keys in some existing database specifications has led to the use of data instances to derive possible keys. However it is not practicable to automate this process as some entities have keys consisting of multiple attributes. This means many permutations would have to be considered to test for all possibilities. This is an expensive operation when the volume of data and / or the number of attributes is large.


DBs

Page 71

In [CHI94], a consistent naming convention is applied to key attributes. Here attributes used to represent the same information must have the same name, and as a result referencing and referenced attributes of a binary relationship between two entities will have the same attribute names in the entities involved. This naming convention was used in [CHI94] to determine relationship types, as foreign key specifications are not supported by all databases. An important contribution of our work is to support the identification of foreign key specifications for any database and hence the detection of relationships, without performing any name conversions. We note that some reverse-engineering methods rely on candidate keys (e.g. [NAV87, JOH89]), while others rely on primary keys (e.g. [CHI94]). These approaches insist on their users meeting their pre-requisites (e.g. specification of missing keys) to enable the user to successfully apply their reverse-engineering process. This means it is not possible to produce a suitable conceptual model until the pre-requisites are supplied. For a large legacy database application the number of these could exceed a hundred and hence, it is not appropriate to rely on such pre-requisites being met to derive an initial conceptual model. Therefore, we concentrate on providing an initial conceptual model using only the available information. This will ensure that the reverse-engineering process will not fail due to the absence of any vital information (e.g. the key specification for an entity). 5.1.3 Entity and Relationship Types of a data model In the context of an E-R model an entity is classified as strong15 or weak depending on an existence-dependent property of the entity. A weak entity cannot exist without the entity it is dependent on. The enhanced E-R model (EER) [ELM94] identifies more entity types, namely: composite, generalised and specialised entities. In section 3.3.1 we described these entity types and the relationships formed among them. Different classifications of entities are due to their associative properties with other entities. The identification of an appropriate entity type for each entity will assist in constructing a graphically informative conceptual model for its users. The extraction of information from legacy systems to classify the appropriate entity type is a difficult task as such information is usually lost during an implementation. This is because implementations take different forms even within a particular data model [ELM94]. Hence, an information extraction process may need to interact with a user to determine some of the entity and relationship types. The type of interaction required depends on the information available for processing and will take different forms. For this reason we focus only on our approach, i.e. determining entity and relationship types using enhanced knowledge such as primary and foreign key information. This is described in section 5.4. 5.1.4 Suitable Data Abstractions for a data model Entities and relationships form the basic components of a conceptual data model. These components describe specific structures of a data model. A collection of entities may be used to represent more than one data structure. For example, entities Person and Student may be represented as a 1:1 relationship or as a is-a relationship. Each representation has its own view and hence the user understanding of the data model will differ with the choice of data structure. Hence it is important to be able to introduce any data structure for a conceptual model and view using the most

15 In some literature this type of entity is referred to as regular entity, e.g. [DAT95].


DBs

Page 72

suitable data abstraction. Data structures such as generalisation and aggregation have inherent behavioural properties which give additional information about their participating entities (e.g. an instance of a specialised entity of a generalisation hierarchy is made up from an instance of its generalised entity). These structures are specialised relationships and representation of them in a conceptual model provides a higher level of data abstraction and a better user understanding than the basic E-R data model gives. These data abstractions originated in the object-oriented data model and they are not implicitly represented in existing relational DBMSs. Extended-relational DBMSs support the O-O paradigm (e.g. POSTGRES) with generalisation structures being created using inheritance definitions on entities. However in the context of legacy DBMSs such information is not normally available, and as a result such data abstractions can only be introduced either by introducing them without affecting the existing data structures or by transforming existing entities and relationships to support their representation. For example, entities Staff and Student may be transformed to represent a generalisation structure by introducing a Person entity. Other forms of transformation can also be performed. These include decomposing all n-ary relationships for n > 3 into their constituent relationships of order 2 to remove such relationships and hence simplify the association among their entities. At this stage double buried relationships are identified and merged and relationships formed with subclasses are eliminated. Transitive closure relationships are also identified and changed to form simplified hierarchies. We use constraints to determine relationships and hierarchies. By controlling these constraints (i.e. modifying or deleting them) it is possible to transform or eliminate necessary relationships and hierarchies. 5.2 Our Re-engineering Process Our re-engineering process has two stages. Firstly, the relational legacy database is accessed to extract the meta-data of the application. This extracted meta-data is translated into an internal representation which is independent of the vendor database language. This information is next analysed to determine the entity and relationship types, their attributes, generalisation / specialisation hierarchies and application constraints. The conceptual model of the database is then derived using this information and is presented graphically for the user. This completes the first stage which is a reverse-engineering process for a relational database. To complete the re-engineering process, any changes to the existing design and any new enhancements are done at the second stage. This is a forward-engineering process that is applied to the reverse-engineered model of the previous stage. We call this process constraint enhancement as we use constraints to enhance the stored knowledge of a database and hence perform our forward-engineering process. These constraint enhancements are done with the assistance of the DBA. 5.2.1 Our Reverse-Engineering Process Our reverse-engineering process concentrates on producing an initial conceptual model without any user intervention. This is a step towards automating the reverse-engineering process. However the resultant conceptual model is usually incomplete due to the limited meta-knowledge available in most legacy databases. Also, as a result of incomplete information and unseen inclusion


DBs

Page 73

dependencies we may represent redundant relationships as well as fail to identify some of the entity and / or relationship types. We depend on constraint enhancement (i.e. the forward-engineering process) to supply this missing knowledge so that subsequent conceptual models will be more complete. The DBA can investigate the reversed-engineered model to detect and resolve such cases with the assistance of the initial display of that model. The system will need to guide the DBA by identifying missing keys and assisting in specifying keys and other relevant information. It also assists in examining the extent to which the new specifications conform to the existing data. Our reverse-engineering process does not depend on information about specialised constraints. When no information about these is available, we treat all entities of a database to be of the same type (i.e. strong entities) and any links present in the database will not be identified. In such a situation the conceptual model will display only the entities and attributes of the database schema without any links. For example, a relational database schema for a university college database system with no constraint-specific information will initially be viewed as shown in figure 5.1. This is the usual case for most legacy databases as they lack constraint-specific information. However, the DBA will be able to provide any missing information at the next stage so that any intended data structures can be reconstructed. Obviously if some constraints are available our reverse-engineering process will try to derive possible entity types and links during its initial application.

Figure 5.1 : A relational legacy database schema for a university college database

University

office code building name address principal phone

College Facultycode building name address secretary phone dean

DepartmentdeptCode building name address head phone faculty

Student

name address birthDate gender collegeNo course department tutor regYear

Employee

name address birthDate gender empNo designation worksFor yearJoined room phone salary

EmpStudent

collegeNo empNo remarks

Our reverse-engineering process first identifies all the available information by accessing the legacy database service (cf. section 5.3). The accessed information is processed to derive the relationship and entity types for our database schema (cf. section 5.4). These are then presented to the user using our graphical display function. 5.2.2 Our Forward-Engineering Process The forward-engineering process is provided to allow the designer (i.e. DBA) to interact with a conceptual model. The designer is responsible for verifying the displayed model and can supply any additional information to the reverse-engineered model at this stage. The aim of this process is to allow the designer to define and add any of the constraint types we identified in section 3.5 (i.e. primary key constraints, foreign key constraints, uniqueness constraints, check constraints, generalisation / specialisation structures, cardinality constraints and other constraints) which are not


DBs

Page 74

present. Such additions will enhance the knowledge held about a legacy database. As a result, new links and data abstractions that should have been in the conceptual model can be derived using our reverse-engineering techniques and presented in the graphical display. This means that the legacy database schema originally viewed as in figure 5.1 can be enhanced with constraints and presented as in figure 5.2, which is a vast improvement on the original display. Such an enhanced display demonstrates the extent to which a user’s understanding of a legacy database schema can be improved by providing some additional knowledge about the database. In sections 6.3.4 and 6.4 we introduce the enhanced constraints of our test databases including those used to improve the legacy database schema of figure 5.1 to figure 5.2.

Figure 5.2 : The enhanced university college database schema

inCharge

worksFor

4+

Person

name address birthDate gender

Constraints: ..............

EmpStudent

remarks


dean

Dept-Office

code AS deptCode siteName AS building unitName AS name inCharge AS head


faculty

College-Office

siteName AS building unitName AS name inCharge AS principal


Faculty-Office

siteName AS building unitName AS name inCharge AS secretary


tutor

Student

collegeNo course department regYear

EmployeeempNo designation yearJoined room phone salary


office

Office

code siteName unitName address phone

University




2-12

We support the examination of existing specifications and identification of possible new specifications (cf. section 5.5) for legacy databases. Once these are identified, the designer defines new constraints using a graphical interface (cf. section 5.6). The new constraint specifications are stored in the legacy database using a knowledge augmentation process (cf. section 5.7). We also supply a constraint verification module to give users the facility to verify and ensure that the data conforms to all the enhanced constraints (cf. section 5.8) being introduced. 5.3 Identifying Information of a Legacy Database Service Schema information about a database (i.e. meta-data) is stored in the data dictionaries of that database. The representation of information in these data dictionaries is dependent on the type of the DBMS. Hence initially the relational DBMS and the databases used by the application are identified. The name and the version of the DBMS (e.g. INGRES version 6), the names of all the databases in


DBs

Page 75

use (e.g. faculty / department), and the name of the host machine (e.g. llyr.athro.cf.ac.uk) are identified at this stage. These are the input data that allows us to access the required meta-data. As the access process is dependent on the type of the DBMS, we describe this process in section 6.5 after specifying our test DBMSs. This process is responsible for identifying all existing entities, keys and other available information in a legacy database schema. 5.4 Identification of Relationship and Entity Types Once the entities and their attributes along with primary and candidate keys have been provided, we are ready to classify relationships and entity types. Three types of binary relationships (i.e. 1:1, 1:N and N:M) and five types of entities (i.e. strong, weak, composite, generalised and specialised) are identified at this stage. Initially we assume that all entities are strong and look for certain properties associated with them (mainly primary and foreign key), so that they can be reclassified into any of the other four types. Weak and composite entities are identified using relationship properties and generalised / specialised entities are determined using generalisation hierarchies. 5.4.1 Relationship Types (a) A M:N relationship If the primary key of an entity is formed from two foreign keys then their referenced entities participate in an M:N relationship. This is a special case of n-ary relationship involving two referenced entities (see section ‘a’ of figure 5.3). This entity becomes a composite entity having a composite key. For example, entity Option with primary key (course,subject) participates in an M:N relationship as the primary key attributes are foreign keys - see tables 6.2, 6.4 and 6.6 (later). In a n-ary relationship (e.g. 3-ary or ternary if the number of foreign keys is 3, see section ‘b’ of figure 5.3) the primary key of an entity is formed from a set of n foreign keys. As stated in section 5.1.4, n-ary relationships for n > 3 are usually decomposed into their constituent relationships of order 2 to simplify their association. Hence we do not specifically describe this case. For example, entity Teach with primary key (lecturer, course, subject) participates in a 3-ary relationship when lecturer, course and subject are foreign keys referencing entities Employee, Course and Subject, respectively. However, as Option is made up using Course and Subject entities we could decompose this 3-ary relationship into a binary relationship by defining course and subject of Teach to be a foreign key referencing entity Option - see tables 6.2, 6.4 and 6.6.


DBs

Page 76

re

Relational Model ER Model Concept

M:N relationship

n-ary relationship

Graphical Notation

(e) A non-key FK and unique attr.

1:1 relationship re eattribute1 1

(d) A non-key FK and non-unique attr.

1:N relationship re eattribute1 N

re rerelationL N

M e.g. 3-ary

ternary

Figure 5.3:

1 2(a) PK = FK + FK (i.e. n = 2)

i=1 i

n

�(b) PK = FK n > 2

PK - Primary Key e - referencing entity FK - Foreign Key re - referenced entity

(c) FK attr. is part of PK and other part does not contain a key of any other relation

1:N relationship re e1 N

attribute

rereM N

binaryrelation

Mapping foreign key references to an ER relationship type Sometimes a foreign key refers to the same entity, forming a unary relationship, like in the case where some courses may have pre-requisites. In this case the attribute pre-requisites of entity Course is a foreign key referencing the same entity. (b) A 1:N relationship There are two types of 1:N relationships. One is formed with a weak entity and the other with a strong entity. If part of the primary key of an entity is a foreign key and the other part does not contain a key of any other relation, then the entity concerned is a weak entity and will participate in a weak 1:N relationship (see section ‘c’ of figure 5.3) with its referenced entity. For example, entity Committee with primary key (name, faculty) is a weak entity as only a part of its primary key attributes (i.e. faculty) is a foreign key. A non-key foreign key attribute (i.e. an attribute that is not part of a primary key) that may have multiple values will participate in a strong 1:N relationship (see section ‘d’ of figure 5.3) if it does not satisfy the uniqueness property. For example, attribute tutor of entity Student is a non-key, non-unique foreign key referencing the entity Employee (cf. tables 6.2 to 6.4). Here tutor participates in a 1:N relationship with Employee - see table 6.6. (c) A 1:1 relationship A non-key foreign key attribute will participate in a 1:1 relationship if a uniqueness constraint is defined for that attribute (see section ‘e’ of figure 5.3). For example, attribute head of entity Department participates in a 1:1 relationship with entity Employee as it is a non-key foreign key with the uniqueness property, referencing Employee - see tables 6.2 to 6.4 and 6.6.


DBs

Page 77

The specialised and generalised entity pair of a generalisation hierarchy has a 1:1 is-a relationship. Hence it is possible to define a binary relationship in place of a generalisation hierarchy. For example, it is possible to define a foreign key (empNo) on entity EmpStudent, referencing entity Employee to form a 1:1 relationship instead of representing it as a generalisation hierarchy. Such cases must be detected and corrected by the database designer. We introduce inheritance constraints involving these entities to resolve such cases. 5.4.2 Entity Types (a) A strong entity This is the default entity type, as any entity that cannot be classified as one of the other types will be a strong (regular) entity. (b) A composite entity An entity that is used to represent an M:N relationship is referred to as a composite (link) entity (cf. section 5.4.1 (a)). The identification of M:N relationships will result in the identification of composite entities. (c) A weak entity An entity that participates in a weak 1:N relationship is referred to as a weak entity (cf. section 5.4.1 (b)). The identification of weak 1:N relationships will result in the identification of weak entities. (d) A generalised / specialised entity An entity defined to contain an inheritance structure (i.e. inheriting properties from others) is a specialised entity. Entities whose properties are used for inheritance are generalised entities. The identification of inheritance structures will result in the identification of specialised and generalised entities. An inheritance structure defines a single inheritance structure (e.g. entities X1 to Xj inherit

from entity A in figure 5.4). However, a set of inheritance structures may form a multiple inheritance structure (e.g. entity Xj inherits from entities A and B in figure 5.4). To determine the existence of

multiple inheritance structures we analyse all subtype entities of the database (e.g. entities X1 to Xn in figure 5.4) and derive their supertypes (e.g. entity A or B or both in figure 5.4). For example, entity EmpStudent inherits from Employee and Student entities forming a multiple inheritance, while entity Employee inherits from Person to form a single inheritance.


DBs

Page 78

Figure 5.4: Single and multiple inheritance structures using EER notations

A

X1

B

Xi Xj Xn• •• •• •

Entities X1, .. ,Xi, .. ,Xj inherit from entity A and entities Xj, .. ,Xn inherit from entity B.

5.5 Examining and Identifying Information Our forward-engineering process allows the designer to specify new information. To successfully perform this task the designer needs to be able to examine the current contents of the database and identify possible missing information from it. 5.5.1 Examining the contents of a database At this stage the user needs to be able to browse through all features of the database. Overall, this includes viewing existing primary keys, foreign keys, uniqueness constraints and other constraint types defined for the database. When inheritance is involved the user may need to investigate the participating entities at each level of inheritance. As specific viewing the user may want to investigate the behaviour of individual entities. This includes identifying constraints associated with a particular entity (i.e. intra-object properties) and those involving other entities (i.e. inter-object properties). Our system provides for this via its graphical interface. We describe viewing of these properties in section 7.5.1, as it is directly associated with this interface. Here, global information is tabulated and presented for each constraint type, while specific information (i.e. inter- and intra- object) presents constraints associated with an individual entity. 5.5.2 Identifying possible missing, hidden and redundant information This process allows the designer to search for specific types of information, including information about the type of entities that do not contain primary keys, possible attributes for such keys, buried foreign key definitions and buried inheritance structures. In this section we describe how we identify this type of information. i) Possible primary key constraints Entities that do not contain primary keys are identified by comparing the list of entities having primary keys with the list of all entities of the database. When such entities are identified the user can view the attributes of these and decide on a possible primary key constraint. Sometimes, an entity may have several attributes and hence the user may find it difficult to decide on suitable primary key attributes. In such a situation the user may need to examine existing properties of that entity (cf. section 5.5.1) to identify attributes with uniqueness properties and not null values.


DBs

Page 79

Sometimes, attribute names such as those ending with ‘no’ or ‘#’ may give a clue in selecting the appropriate attributes. Once the primary key attributes have been decided the user may want to verify this choice against the data of the database (cf. section 5.8). ii) Possible foreign key constraints Existence of either an inclusion dependency between a non-key attribute of one table and a key attribute of another (e.g. deptno of Employee and deptno of Department), or a weak or n-ary relationship between a key attribute and part of a key attribute (e.g. cno of strong entity Course and cno of link entity Teach) implies the possible existence of a foreign key definition. Such possibilities are detected by matching attribute names satisfying the required condition. Otherwise, the user needs to inspect the attributes and detect their possible occurrence (e.g. if attribute name worksfor instead of deptno was used in Employee). iii) Possible uniqueness constraints Detection of a uniqueness index gives a clue to a possible uniqueness constraint. All other indications of this type of constraint have to be identified by the user. iv) Possible inheritance structures Existence of an inclusion dependency between two strong entities having the same key implies a subtype / supertype relationship between the two entities. Such possible relationships are detected by matching identical key attribute names of strong entities (e.g. empno of Person and empno of Employee). Otherwise, the user needs to inspect the table and 1:1 relationships to detect these structures (e.g. if personid instead of empno was used in Person then the link between empno and personid would have to be identified by the user). In distributed database design some entities are partitioned using either horizontal or vertical fragmentation. In this situation strong entities having the same key will exist with a possible inclusion dependency between vertically fragmented tables. Such cases need to be identified by the designer to avoid incorrect classifications occurring. For example, employee records can be horizontally fragmented and distributed over each department as opposed to storing at one site (e.g. College). Also, employee records in a department may be vertically fragmented at the College site as the college is interested in a subset of information recorded for a department. v) Possible unnormalised structures All entities of a relational model are at least in 1NF, as this model does not allow multivalued attributes. When entities are not in 3NF (i.e. a non-key attribute is dependent on part of a key or another non-key attribute: violating 2NF or 3NF, respectively), there are hidden functional dependencies. These entities need to be identified and transformed into 3NF to show their dependencies. New entities in the form of views are used to construct this transformation. For example, entity Teach can be defined to contain attributes lecturer, course, subNo, subName and room. Here we see that subName is fully dependent on subNo and hence Teach is in 2NF. Using a view we separate subName from Teach and use it as a separate entity with primary key subNo. This


DBs

Page 80

allows us to transform the original Teach to 3NF and view Subject and Teach as a binary, instead of an unary relationship. This will assist in improving conceptual model readability. vi) Possible redundant constraints Redundant inclusion dependencies representing projection or transitivity must be removed, otherwise incorrect entity or relationship types may be represented. For instance, if there is an inclusion dependency between entities A, B and B, C then the transitivity inclusion dependency between A, C is redundant. Such relationships should be detected and removed. For example, EmpStudent is an Employee and Employee is a Person, thus EmpStudent is a Person is a redundant relationship. Redundant constraints are often most obvious when viewing the graphical display of a conceptual model with its inter- and intra- object constraints. 5.6 Specifying New Information We can specify new information using constraints. In a modern DBMS which supports constraints we can use its query language to specify these. However this approach will fail for legacy databases as they do not normally support the specification of constraints. To deal with both cases we have designed our system to externally accept constraints of any type, but represent them internally by adopting the appropriate approach depending on the capabilities of the DBMS in use. Thus if constraint specification is supported by the DBMS in use we will issue a DDL statement (cf. figure 5.5 which is based on SQL-3 syntax) to create the constraint. If constraint specification is not supported by the DBMS in use we will store the constraint in the database using techniques described in section 5.7. These constraints are not enforced by the system but they may be used to verify the extent to which the data conforms with the constraints (cf. section 5.8). In both cases this enhanced knowledge is used by our conceptual model wherever it is applicable. The following sub-sections describe the specification process for each constraint type. We cover all types of constraints that may not be supported by a legacy system, including primary key. We use the SQL syntax to introduce them. In SQL, constraints are specified as column/table constraint definitions and can optionally contain a constraint name definition and constraint attributes (see sections A.3 and A.4) which are not included here. i) Specifying Primary Key Constraints Only one primary key is allowed for an entity. Hence our system will not allow any input that may violate this status. Once an entity is specified the primary key attributes are chosen. Modern SQL DBMSs will use the DDL statement ‘a’ of figure 5.5 to create a new primary key constraint, older systems do not have this capability in their syntax. ii) Foreign Key Constraints A foreign key establishes a relationship between two entities. Hence, when the enhancing constraint type is chosen as a foreign key, our system requests two entity names. The first is the referencing entity and the second the referenced entity. Once the entity names are identified the system automatically shows the referenced attributes. These attributes are those having the uniqueness property. When these attributes are chosen a new foreign key is established. This


DBs

Page 81

constraint will only be valid if there is an inclusion dependency between the referencing and referenced attributes. Modern SQL DBMSs will use the DDL statement ‘b’ of figure 5.5 to create a new foreign key constraint in this situation. This statement can optionally contain a match type and referential triggered action (see section A.8) which are not shown here. iii) Uniqueness Constraints A uniqueness constraint may be defined on any combination of attributes. However such constraints should be meaningful (e.g. these is no point in defining a uniqueness constraint for a set of attributes when a subset of it already holds the uniqueness status), and should not violate any existing data. Modern SQL DBMSs will use the DDL statement ‘c’ of figure 5.5 to create a new uniqueness constraint.

(a) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name PRIMARY KEY (Primary_Key_Attributes) (b) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name FOREIGN KEY (Foreign_Key_Attributes) REFERENCES Referenced_Entity_Name (Referenced_Attributes) (c) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name UNIQUE (Uniqueness_Attributes) (d) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name CHECK (Check_Constraint_Expression) (e) ALTER TABLE Entity_Name ADD UNDER Inherited_Entities [ WITH (Renamed_Attributes) ] (f) ALTER TABLE Entity_Name ADD CONSTRAINT Constraint_Name FOREIGN KEY (Foreign_Key_Attributes) [ CARDINALITY (Referencing_Cardinality_Value) ] REFERENCES Referenced_Entity_Name (Reference_Attributes) Our optional extensions to the SQL-3 syntax are highlighted using bold letters here.

Figure 5.5 : Constraints expressed in extended SQL-3 syntax iv) Check Constraints A check constraint may be defined to represent a complex expression involving any combination of attributes and system functions. However such constraints should not be redundant (i.e. not a subset of an existing check constraint) and should not violate any existing data. Modern SQL DBMSs will use the DDL statement ‘d’ of figure 5.5 to create a new check constraint. v) Generalisation / Specialisation Structures An inheritance hierarchy may be defined without performing any structural changes if its existence can be detected by our process described in part ‘d’ of section 5.4.2. In this case we need to specify the entities being inherited (cf. statement ‘e’ of figure 5.5). If an inherited attribute’s name differs from the target attribute name it is necessary to rename them. For example, attributes siteName, unitName and inCharge of Office are renamed to building, name and principal when it is inherited by College - see figures 6.2 and 6.3 (later). It is also possible to make some structural changes in order to introduce new generalisation / specialisation structures. In such situations new entities are created to represent the specialisation/generalisation. Appropriate data for these entities are copied to them during this process. For instance, in our university college example of figure 5.1, the entities College, Faculty and


DBs

Page 82

Department can be restructured to represent a generalisation hierarchy, by introducing a generalised entity called Office and transforming the entities College, Faculty and Department to College-Office, Faculty-Office and Dept-Office, respectively (cf. figure 5.2). Once this transformation is done the entities Office, College-Office, Faculty-Office and Dept-Office will represent a generalisation hierarchy as shown in figure 5.2. Any change to existing structures and table names will affect the application programs which use them. To overcome this we introduce view tables in the legacy database to represent new structures. These tables are defined using the syntax of figure 5.6. For example, the generalised entity will be Office and the specialised entities will be College-Office, Faculty-Office and Dept-Office. The introduction of view tables means that legacy application code using the original tables will not be affected by the change. However, appropriate changes must be introduced in the target application code and database if we are going to introduce these features permanently. We introduced the concept of defining view tables in the legacy database to assist the gateway service in managing these structural changes.

Figure 5.6 : Creation of view table to represent a hierarchy

CREATE VIEW GeneralisedEntity (GeneralisedAttributes) AS SELECT Attributes FROM SpecialisedEntity [ [UNION SELECT Attributes FROM SpecialisedEntity] ..]

CREATE VIEW SpecialisedEntity (SpecialisedAttributes) AS SELECT g1.Attributes [ [, g2.Attributes] ..] FROM GeneralisedEntity g1 [ [, GeneralisedEntity g2] ..] [ WHERE specialised-conditions ]

vi) Cardinality Constraints Cardinality constraints specify the minimum / maximum number of instances associated with a relationship. In a 1:1 relationship type the number of instances associated with a relationship will be 0 or 1, and in a M:N relationship type it can take any value from 0 upwards. The ability to define more specific limits allows users not only to gain a better understanding about a relationship, but also to be able to verify its conformance by using its instances. We suggest creating such specifications using an extended syntax of the current SQL foreign key definition (cf. statement ‘f’ of figure 5.5) as this is the key which initially establishes this relationship. The minimum / maximum instance occurrences for a particular relationship of a referential value (i.e. cardinality values) can be specified using a keyword CARDINALITY as shown in figure 5.5. Here the Referencing_Cardinality_Value corresponds to the many side of a relationship. Hence the value of this indicates the minimum instances. When the referencing attribute is not null then the minimum cardinal value is 1, else it is 0. In our examples introduced in part ‘b’ of section 6.2.3, we have used ‘+’ to represent the many symbol (e.g. 0+ represents zero or many) and ‘-’ to represent up to (e.g. -1 represents 0 to 1). vii) Other Constraints In the example in figure 5.2, we have also shown an aggregation relationship between the entities University and Office. Here we have assumed that a reference to a set of instances can be defined. In such a situation, as with the other constraint types, an appropriate SQL statement should be used to describe the constraint and an appropriate augmented table such as those used in figure 5.7 must be used to record this information in the database itself. We discuss this case here for the purpose of highlighting that other constraint types can be introduced and incorporated into our


DBs

Page 83

system using the same general approach. However our implementations have concentrated only on the constraints discussed above. The enhanced constraints, once they are absorbed into the system, will be stored internally in the same way as any existing constraints. Hence the reconstruction process to produce an enhanced conceptual model can utilise this information directly as it is fully automated. To hold the enhancements in the database itself we need to issue appropriate query statements. The enhancements can be effected using the SQL statements shown in figure 5.5 if the database is SQL based and such changes are implicitly supported by it. In section 5.7 we describe how this is done when the database supports such specifications (e.g. Oracle version 7) and when it does not (e.g. INGRES version 6). When the DBMS does not support SQL, the query statement to be issued is translated using QMTS [HOW87] to a form appropriate to the target DBMS. As there are variants of SQL16 we send all queries via QMTS so that the necessary query statements will automatically get translated to the target language before entering the target DBMS environment. 5.7 The Knowledge Augmentation Approach In this section we describe how the enhanced constraints are retained in a database. Our aim has been to make these enhancements compatible with the newer versions of commercial DBMSs, so that the migration process is facilitated as fully as possible. Many types of constraint are defined in a conceptual model during database design. These include relationship, generalisation, existence condition, identity and dependency constraints. In most current DBMSs these constraints are not represented as part of the database meta-data. Therefore, to represent and enforce such constraints in these systems, one needs to adopt a procedural approach which makes use of some embedded programming language code to perform the task. Our system uses a declarative approach (cf. section 3.6) for constraint manipulation, as it is easier to process constraints in this form than when they are represented in the traditional form of procedural code.

16 The date functions of most SQL databases (e.g. INGRES and Oracle) are different.


DBs

Page 84

CREATE TABLE Table_Constraints ( Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Table_Id char(32) NOT NULL, Table_Name char(32) NOT NULL, Constraint_Type char(32) NOT NULL, Is_Deferrable char(3) NOT NULL, Initially_Deferred char(3) NOT NULL ); CREATE TABLE Key_Column_Usage ( Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Table_Id char(32) NOT NULL, Table_Name char(32) NOT NULL, Column_Name char(32) NOT NULL, Ordinal_Position integer(2) ); CREATE TABLE Referential_Constraints ( Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Unique_Constraint_Id char(32) NOT NULL, Unique_Constraint_Name char(32) NOT NULL, Match_Option char(32) NOT NULL, Update_Rule char(32) NOT NULL, Delete_Rule char(32) NOT NULL );

CREATE TABLE Check_Constraints ( Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Check_Clause char(240) NOT NULL );

CREATE TABLE Sub_tables ( Table_Id char(32) NOT NULL, Sub_Table_Name char(32) NOT NULL, Super_Table_Name char(32) NOT NULL, Super_Table_Column integer(4) NOT NULL );

CREATE TABLE Altered_Sub_Table_Columns ( Table_Id char(32) NOT NULL, Sub_Table_Name char(32) NOT NULL, Sub_Table_Column char(32) NOT NULL, Super_Table_Name char(32) NOT NULL, Super_Table_Column char(32) NOT NULL );

CREATE TABLE Cardinality_Constraints ( Constraint_Id char(32) NOT NULL, Constraint_Name char(32) NOT NULL, Referencing_Cardinal char(32) );

Figure 5.7: Knowledge-based table descriptions The constraint enhancement module of our system (CCVES) accepts new constraints (cf. figure 5.5) irrespective of whether they are supported by the selected DBMS. These new constraints are the enhanced knowledge which is stored in the current database, using a set of user defined knowledge-based tables, each of which represents a particular type of constraint. These tables provide general structures for all constraint types of interest. In figure 5.7 we introduce our table structures which are used to hold constraint-based information in a database. We have followed the current SQL-3 approach to representing constraint types supported by the standards. In areas which the current standards have yet to address (e.g. representation of cardinality constraints) we have proposed our own table structures. Thus, all general constraints associated with a table (i.e. an entity) are recorded in Table_Constraints. The constraint description for each type is recorded elsewhere in other tables, namely, Key_Column_Usage for attribute identifications, Referential_Constraints for foreign key definitions, Check_Constraints to hold constraint expressions, Sub_Tables to hold generalisation / specialisation structures (i.e. inherited tables), Altered_Sub_Table_Columns to hold any attributes renamed during inheritance, and Cardinality_Constraints to hold cardinal values associated with relationship structures. The use of these table structures to represent constraint-based information in a database depends on the type of DBMS in use and the features it supports. The features supported by a DBMS may differ from the standards to which it claims to conform, as database vendors do not always follow the standards fully when they develop their systems. However, DBMSs supporting the representation of constraints need not have identical table structures to our approach as they may have used an alternative way of dealing with constraints. In such situations it is not necessary to insist on the use of our table structures for constraint representation, as the database is capable of managing them itself if we follow its approach. Therefore we need to identify which SQL standards have been used and in which DBMSs we should introduce our own tables to hold enhanced constraints. In figure 5.8 we identify the tables required for the three SQL standards and for three selected DBMSs. The selected DBMSs were used as our test DBMSs, as we shall see in section 6.1. The CCVES determines for the DBMS being used the required knowledge-based tables and


DBs

Page 85

creates and maintains them automatically. The creation of these tables and the input of data to them are ideally done at the database application implementation stage, by extracting data from the conceptual model used originally to design a database. However, as current tools do not offer this type of facility, one may have to externally define and manage these tables in order to hold this knowledge in a database. Our system has been primed with the knowledge of the tables required for each DBMS it supports, and so it automatically creates these tables and stores information in them if the database is enhanced with new constraints. Here, Table_Constraints, Referential_Constraints, Key_Column_Usage, Check_Constraints and Sub_Tables are those used by SQL-3 to represent constraint specifications. SQL-2 has the same tables, except for Sub_Tables, Hence, as shown in figure 5.8, these tables are not required as augmented tables when a DBMS conforms to SQL-3 or SQL-2 standards, respectively. Adopting the same names and structures as used in the SQL standards makes our approach compatible with most database products. We have introduced two more tables (namely: Cardinality_Constraints and Altered_Sub_Table_Columns) to enable us to represent cardinality constraints and to record any synonyms involved in generalisation / specialisation structures. The representation of this type of information is not yet addressed by the SQL standards. CCVES utilises the above mentioned user defined knowledge-based tables not only to automatically reproduce a conceptual model, but also to enhance existing databases by detecting and cleaning inconsistent data. To determine the presence of these tables, CCVES looks for user defined tables such as Table_Constraints, Referential_Constraints, etc., which can appear in known existing legacy databases only if the DBMS maintains our proposed knowledge-base. For example, in INGRES version 6 we know that such tables are not maintained as part of its system provision, hence the presence of tables with these names in this context confirms the existence of our knowledge-base. Use of our knowledge-based tables is database system specific as they are used only to represent knowledge that is not supported by that DBMS's meta-data facility. Hence, the components of two distinct knowledge-bases, e.g. for INGRES version 6 and Oracle version 7, are different from each other (see figure 5.8).

Figure 5.8 : Requirement of augmented tables for SQL standards and some current DBMSs

Table Name S1 S2 S3 I O P - - D V6 V7 V4 Table_Constraints Y N N Y N Y Referential_Constraints Y N N Y N Y Key_Column_Usage Y N N Y N Y Check_Constraints Y N N N N N Sub_Tables Y Y N Y Y N Altered_Sub_Table_Columns Y Y Y Y Y Y Cardinality_Constraints Y Y Y Y Y Y

S1 - SQL/86, S2 - SQL-2, S3 - SQL-3 Y - Yes, required I - INGRES, O - Oracle, P - POSTGRES N - No, not required D - Draft, V - Version

The different types of constraints are identified using the attribute Constraint_Type of Table_Constraints, which must have one of the values PRIMARY KEY, UNIQUE, FOREIGN KEY or CHECK. A set of example instances are give in figure 5.9 to show the types of information held in our knowledge-based tables. The constraint type NOT NULL may also appear in Table_Constraints when dealing with a DBMS that does not support NULL value specifications. We have not included it in our sample data as it is supported by our test DBMSs and all the SQL standards. The constraint


DBs

Page 86

and table identifications in our knowledge-based tables (i.e. Constraint_Id and Table_Id of figure 5.9), may be of composite type as they need to identify not only the name, but also the schema, catalog and location of the database. Foreign key constraints are associated with their referenced table through a unique constraint. Hence, the ‘Const_Name_Key’ instance of attribute Unique_Constraint_Name of table Referential_Constraints should also appear in Key_Column_Usage as a unique constraint. This means that each of the knowledge-based tables has its own set of properties to ensure the accuracy and consistency of the information retained in these tables. For instance Constraint_Type of Table_Constraints must be one of {‘PRIMARY KEY’, ‘UNIQUE’, ‘FOREIGN KEY’, ‘CHECK’} if these are the only type of constraints that are to be represented. Also, within a particular schema a constraint name is unique. Hence Constraint_Name of Table_Constraints must be unique for a particular type of Constraint_Id. In figure 5.10 we present the set of constraints associated with our knowledge-based tables. Besides these there are a few others which are associated with other system tables, such as Tables and Columns which are used to represent all entity and attribute names respectively. Such constraints are used in systems supporting the above constraint types. This allows us to maintain consistency and accuracy within the constraint definitions.

Table_Constraints { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type, Is_Deferrable, Initially_Deferred } ('dbId', 'Const_Name_PK', 'TableId', 'Entity_Name_PK', 'PRIMARY KEY', 'NO', 'NO') ('dbId', 'Const_Name_UNI', 'TableId', 'Entity_Name_UNI', 'UNIQUE', 'NO', 'NO') ('dbId', 'Const_Name_FK', 'TableId', 'Entity_Name_FK', 'FOREIGN KEY', 'NO', 'NO') ('dbId', 'Const_Name_CHK', 'TableId', 'Entity_Name_CHK', 'CHECK', 'NO', 'NO') Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name, Ordinal_Position } ('dbId', 'Const_Name_PK', 'TableId','Entity_Name_PK', 'Attribute_Name_PK', i) ('dbId', 'Const_Name_UNI', 'TableId','Entity_Name_UNI', 'Attribute_Name_UNI', i) ('dbId', 'Const_Name_FK', 'TableId','Entity_Name_FK', 'Attribute_Name_FK', i) Referential_Constraints { Constraint_Id, Constraint_Name,Unique_Constraint_Id, Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule } ('dbId', 'Entity_Name_FK', 'TableId', 'Const_Name_Key', 'NONE', 'NO ACTION', 'NO ACTION') Check_Constraints { Constraint_Id, Constraint_Name, Check_Clause } ('dbId', 'Const_Name_CHK', 'Const_Expression') Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name } ('dbId', 'Entity_Name_Sub', 'Entity_Name_Super') Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name, Super_Table_Column } ('dbId', 'Entity_Name_Sub', 'newAttribute_Name', 'Entity_Name_Super', 'oldAttribute_Name') Cardinality_Constraints { Constraint_Id, Constraint_Name, Referencing_Cardinal } ('dbId', 'Entity_Name_FK', 'Const_Value_Ref')

Figure 5.9 : Augmented tables with different instance occurrences Some attributes of these knowledge-based tables are used to indicate when to execute a constraint and what action is to be taken. The actions are application dependent or have no effect on the approach proposed here, and hence we have used a default value as proposed in the standards. However, it is possible to specify trigger actions like ON DELETE CASCADE so that when a value of the referenced table is deleted the corresponding values in the referencing table will automatically get deleted. These features were initially introduced in the form of rule based constraints to allow triggers and alerters to be specified in databases and make them active [ESW76, STO88]. Such actions may also have been implemented in legacy ISs as in the case of general constraints. The


DBs

Page 87

constraints used in our constraint enforcement process (cf. section 5.8) are alerters as they draw attention to constraints that do not conform to the existing legacy data.

Table_Constraints

PRIMARY KEY (Constraint_Id, Constraint_Name) CHECK (Constraint_Type IN ('UNIQUE','PRIMARY KEY','FOREIGN KEY','CHECK') ) CHECK ( (Is_Deferrable, Initially_Deferred) IN ( values ('NO','NO'), ('YES','NO'), ('YES','YES') ) ) CHECK ( UNIQUE ( SELECT Table_Id, Table_Name FROM Table_Constraints WHERE Constraint_Type = 'PRIMARY KEY' ) )

Key_Column_Usage

PRIMARY KEY (Constraint_Id, Constraint_Name, Column_Name) UNIQUE (Constraint_Id, Constraint_Name, Ordinal_Position) CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type IN ('UNIQUE', 'PRIMARY KEY','FOREIGN KEY' ) ) )

Referential_Constraints

PRIMARY KEY (Constraint_Id, Constraint_Name) CHECK ( Match_Option IN ('NONE','PARTIAL','FULL') ) CHECK ( Update_Rule IN ('CASCADE','SET NULL','SET DEFAULT','RESTRICT','NO ACTION') ) CHECK ( (Constraint_Id, Constraint_Name) IN (SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type = 'FOREIGN KEY' ) ) CHECK ( (Unique_Constraint_Id, Unique_Constraint_Name) IN ( SELECT Constraint_Id, Constraint_Name FROM Table_Constraints WHERE Constraint_Type IN ('UNIQUE','PRIMARY KEY') ) )

Check_Constraints

PRIMARY KEY (Constraint_Id, Constraint_Name)

Sub_Tables

PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name)

Altered_Sub_Table_Columns

PRIMARY KEY (Table_Id, Sub_Table_Name, Super_Table_Name, Column_Name) FOREIGN KEY (Table_Id, Sub_Table_Name, Super_Table_Name) REFERENCES Sub_Tables

Cardinality_Constraints

PRIMARY KEY (Constraint_Id, Constraint_Name) FOREIGN KEY (Constraint_Id, Constraint_Name) REFERENCES Referential_Constraints

Figure 5.10: Consistency constraints of our knowledge-based tables Many other types of constraint are possible in theory [GRE93]. We shall not deal with all of them as our work is concerned only with constraints applicable at the conceptual modelling stage. These applicable constraints take the form of logical expressions and are stored in the database using the knowledge-based table Check_Constraints. They are identified by the keyword 'CHECK' in Table_Constraints in figure 5.9. Similarly, other constraint types (e.g. rules and procedures) are represented by means of distinct keywords and tables. Figure 5.9 also includes generalisation and cardinality constraints. A generalisation hierarchy is defined using the SQL-3 syntax (i.e. UNDER, see figure 5.5), while a cardinality constraint is defined using an extended foreign key definition (see figure 5.5). These specifications are also held in the database, using the tables Sub_Tables, Altered_Sub_Table_Columns and Cardinality_Constraints, respectively (see figure 5.9). 5.8 The Constraint Enforcement Process This is an optional process provided by our system, as the third stage of its application to a database. The objective is to give users the facility to verify / ensure that the data conforms to all the enhanced constraints. This process is optional so that the user can decide whether these constraints should be enforced to improve the quality of the legacy database prior to its migration or whether it is best left as it stands.


DBs

Page 88

During the constraint enforcement process any violations of the enhanced constraints are identified. In some cases this may result in removing the violated constraint as it may be an incorrect constraint specification. However, the DBA may decide to keep such constraints as the constraint violation may be as a result of incorrect data instances or due to a change in a business rule that has occurred during the lifetime of the database. Such a rule may be redefined with a temporal component to reflect this change. Such data are manageable using versions of data entities as in object-oriented DBMSs [KIM90]. We use the enhanced constraint definitions to identify constraints that do not conform to the existing legacy data. Here each constraint is used to produce a query statement. This query statement depends on the type of constraint, as shown in figure 5.11. CCVES uses constraint definitions to produce data manipulation language statements suitable for the host DBMS. Once such statements are produced, CCVES will execute them against the current database to identify any violated data for each of these constraints. When such violated data are found for an enhanced constraint it is up to the user to take appropriate action. Enforcement of such constraints can prevent data rejection by the target DBMS, possible losses of data and/or delays in the migration process, as the migrating data’s quality will have been ensured by prior enforcement of the constraints. However as the enforcement process is optional, the user need not take immediate action. He can take his own time to determine the exact reasons for each violation and take action at his convenience prior to migration. 5.9 The Migration Process The migration process is the fourth and final stage in the application of our approach. This is incrementally performed by initially creating the meta-data in the target DBMS, using the schema meta-translation technique of Ramfos [RAM91], and then copying the legacy data to the target system, using the import/export tools of source and target DBMSs. During this activity, legacy applications must continue to function until they too are migrated. To support this process we need to use an interface (i.e. a forward gateway) that can capture and process all database queries of the legacy application and then re-direct those related to the target system via CCVES. The functionality that is required here is a distributed query processing facility which is supported by current distributed DBMSs. However, in our case the source and target databases are not necessarily of the same type as in the case of distributed DBMSs, so we need to perform a query translation in preparation for the target environment. Such a facility can be provided using the query meta-translation technique of Howells [HOW87]. This approach will facilitate transparent migration for legacy databases as it will allow the legacy IS users to continue working while the legacy data is being migrated incrementally.


DBs

Page 89

Constraint Queries to detect Constraint Violation Instances Primary Key SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > 1 UNION SELECT Attribute_Names, 1 FROM Entity_Name WHERE Attribute_Names IS NULL

Unique SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > 1

Referential SELECT * FROM Referencing_Entity_Name WHERE NOT (Referencing_Attributes IS NULL OR Referencing_Attributes IN (SELECT Referenced_Attributes FROM Referenced_Entity_Name))

Check SELECT * FROM Entity_Name WHERE NOT (Check_Constraint_Expression)

Cardinality SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) < Min_Cardinality_Value UNION SELECT Attribute_Names, COUNT(*) FROM Entity_Name GROUP BY Attribute_Names HAVING COUNT(*) > Max_Cardinality_Value

Figure 5.11: Detection of violated constraints in SQL

CHAPTER 6

Test Databases and their Access Process In this chapter we introduce our example databases, by describing their physical and logical components. The selection criteria for these test databases, and the associated constraints in accessing and using them are discussed here. We investigate the tools available for our test DBMSs. We then apply our re-engineering process to our test databases to show its applicability. Lastly, we refer to the organisation of system information in a relational DBMS and describe how we identify and access information about entities, attributes, relationships and constraints in our test DBMSs. 6.1 Introduction to our Test Databases In the following sub-sections we introduce our test databases. We first identify the main requirements for these databases. This is followed by a description of associated constraints and their role in database access and use. Finally, we identify how we established our test databases and the DBMSs we have used for this purpose. 6.1.1 Main Requirements The design of our test databases was based on two important requirements. Firstly, to establish a suitable legacy test database environment to enable us to demonstrate the practicability of our re-engineering and migration techniques. Secondly, to establish a heterogeneous database environment for the test databases to enable us to test the generality of our approach. As described in section 2.1.2.1, the problems of legacy databases apply mostly to long existing database systems. Most of these systems use traditional file-based methods or an old version of the hierarchical, network or relational database models for their database management. Due to complexity and availability of resources, which are discussed in section 6.1.2, we decided to focus on a particular type of database model to apply our legacy database enhancement and migration techniques. Test databases were developed for the chosen database model, while establishing the required levels of heterogeneity in the context of that model. 6.1.2 Availability and Choice of DBMSs At University of Wales College of Cardiff, where our research was conducted, there were only a few application databases. These included systems used to process student and staff information for personnel and payroll applications. This information was centrally processed and managed using third party software. Due to the licence conditions on this software, the university did not have the authority to modify and improve it on their own. Also, most of this software was developed with 3GL technology using files to manipulate information. There were recent enhancements which had been developed using 4GL tools. However, no proper DBMS had been used to build any of these applications, although future plans included using Oracle for new

database applications. These databases were therefore not well suited to our work. Other than the personnel and payroll applications there were a few departmental and specific project based applications. Some of these were based on DBMSs (such as Oracle), although their application details were not readily available. Information gathered from these sources revealed that not many database applications existed in our university environment and gaining permission to access them for research purposes was practically impossible. Also, until we obtained access and investigated each application we would not be able to fully justify its usefulness as a test database, as it might not fulfil all our requirements. Therefore, it was decided to initially design and develop our own database applications to suit our requirements and then if possible to test our system on any other available real world databases. Access to DBMSs was restricted to products running on Personal Computers (PCs) and some Unix systems. Most of these products were based on the relational data model and some on the object-oriented data model. The older database models - hierarchical and network - were no longer being used or available as DBMSs. Also, the available DBMSs were in their latest versions, making the task of building a proper legacy database environment more difficult. The relational model has been in use for database applications over the last 20 years and currently is the most widely used data model. During this time many database products and versions have been used to manage these database applications. As a result, many of them are now legacy systems and their users need assistance to enhance and migrate them to modern environments. Thus the choice of the relational data model for our tests is reasonable, although one may argue that similar requirements exist for database applications which have been in use prior to this data model gaining its pre-eminent position. Due to the superior power of workstations as compared to PC’s it was decided to work on these Unix platforms and to build test databases using the available relational DBMSs, as our main aim was simply to demonstrate the applicability of our approach. Two popular commercial relational DBMSs, namely: INGRES and Oracle, were accessible via the local campus network. We selected these two products to implement our test databases as they are leading, commercially-established products which have been in existence since the early days of relational databases. The differences between these two database products made them ideal for representing heterogeneity within our test environment. Both products supported the standard database query language, SQL. However, only one of them (Oracle) conforms to the current SQL-2 standard. Oracle is also a leading relational database product, along with SYBASE and INFORMIX, on Unix platforms [ROS94]. As described in section 3.8, SQL standards have been regularly reviewed and hence it is also important to choose a database environment that will support at least some of the modern concepts, such as object-oriented features. In recent database products these features have been introduced either via extended relational or via object-oriented database technology. Obviously the choice of an extended relational data model is the most suitable for our purposes as it incorporates natural extensions to the relational data model. Hence we selected POSTGRES, which is a research DBMS providing modern object-oriented features in an extended relational model, as our third test DBMS. 6.1.3 Choice of database applications and levels of heterogeneity

Designing a single large database application as our test database would result in one very complex database application. To overcome the need to devise and manage a single complex application to demonstrate all of our tasks, we decided to build a series of simple applications and later to provide a single integrated application derived from these simple database applications. Our own university environment was chosen to construct these test database systems as we were able to perform a detailed system study in this context and collect sufficient information to create appropriate test applications. Typical text book examples [MCF91, ROB93, ELM94, DAT95] were also used to verify the contents chosen for our test databases. Three databases representing college, faculty and department information were chosen for our simple test databases. To ensure simplicity, no more than ten entities were included for each of these databases. However, each was carefully designed to enable us to thoroughly test our ideas, as well as to represent three levels of heterogeneity within our test systems. These systems were implemented on different computer systems using different DBMSs so that they represented heterogeneity at the physical level. INGRES, POSTGRES and Oracle running on DEC station, SUN Sparc and DEC Alpha, respectively, were chosen. The differences in characteristics among these three DBMSs introduced heterogeneity at the logical level. Here, Oracle conforms to the current SQL/92 standard and supports most modern relational data model requirements. INGRES and POSTGRES, although they are based on the same data model, have some basic differences in handling certain database functions such as integrity constraints. These two DBMSs use a rule subsystem to handle constraints, which is a different approach from that proposed by the SQL standards. POSTGRES, which is regarded as an extended relational DBMS having many object-oriented features, is also regarded as an object-oriented DBMS. These inherent differences ensure the initial heterogeneity of our environment at the logical level. Our test databases were designed to highlight these logical differences, as we shall see. 6.2 Migration Support Tools for our Test DBMSs Prior to creating and applying our approach it was useful to investigate the availability of tools for our test DBMSs to assist the migration of databases. As indicated in the following sub-sections, only a few tools are provided to assist this process and they have limited functionality that is inadequate to assist all the stages of enhancing and migrating a legacy database service. 6.2.1 INGRES INGRES permits manipulation of data in non-INGRES databases [RTI92] and the development of applications that are portable across all INGRES servers. This type of data manipulation is done through an INGRES gateway. INGRES/Open SQL, a subset of INGRES SQL, is used for this purpose. The type of services provided by this gateway include [RTI92]:

• Translation between Open SQL and non-INGRES SQL DBMS query interfaces such as Rdb/VMS (for DEC) or DB2 (for IBM).

• Conversion between INGRES data types and non-INGRES data types. • Translation of non-INGRES DBMS error messages to INGRES generic error types.

This functionality is useful in creating a target database service. However, as the target databases supported by INGRES/Open SQL do not include Oracle and POSTGRES, this tool was not helpful to us. The PRODBI interface for INGRES [LUC93] allows access to INGRES databases from Prolog code. This tool is useful in our work as our main processing is done using Prolog. Hence we have used this tool to implement our constraint enforcement process. Meta-data access from INGRES databases could have been done using PRODBI. However, due to its unavailability at the start of our project we implemented this using C programs embedded with SQL code. INGRES does not support any CASE tools that assist in reverse-engineering or analysing INGRES applications. Its only support was in the form of a 4GL environment [RTI90b] which is useful for INGRES application development, but not for any INGRES based legacy ISs and their reverse engineering. 6.2.2 Oracle The latest version of Oracle (i.e. version 7) is a RDBMS that conforms to the SQL-2 standards. Hence, this DBMS supports most modern database functions, including the specification, representation and enforcement of integrity constraints. Oracle has provided migration tools to convert databases from either of its two most recent versions (i.e. versions 5 or 6) to version 7. Oracle, a leading database product on the Unix platform [ROS94], has its own tool set to assist in developing Oracle based application systems [KRO93]. This includes screen-based application development tools SQL*Forms and SQL*Menu, and the report-writing product SQL*Report. These tools assist in implementing Oracle applications but do not provide any form of support to analyse the system being developed. To overcome this, a series of CASE products are provided by Oracle (i.e. CASE*Bridge, CASE*Designer, CASE*Dictionary, CASE*Generator, CASE*Method and CASE*Project) [BAR90]. The objective of these tools is to assist users by supporting a structured approach to the design, analysis and implementation of an Oracle application. CASE*Designer provides different views of the application using Entity Relationship Diagrams, Function Hierarchies, Dataflow Diagrams and matrix handlers to show the inter-relationship between different objects held in an Oracle dictionary. Oracle*Dictionary maintains complete definitions of the requirements and the detailed design of the application. Oracle*Generator uses these definitions to generate the code for the target environment and CASE*Bridge is used to extract information from other Oracle CASE tools or vice versa. However, such functions can be performed only on applications developed using these tools and not on an Oracle legacy database developed in any other way, which means they are no help with the current legacy problem. Hence, Oracle CASE tools are useful when developing new applications but cannot be used to re-engineer a pre-existing Oracle application, unless that original application was developed in an Oracle CASE environment. This limitation is shared by most CASE tools [COMP90, SHA93]. Currently, Oracle and other vendors are working on overcoming this limitation, and

Oracle’s open systems architecture for heterogeneous data access [HOL93] is a step towards this. ANSI standard embedded SQL [ANSI89b] is used for application portability along with a set of function calls. In Oracle’s open systems architecture, standard call level interfaces are used to dynamically link and run applications on different vendor engines without having to recompile the application programs. This functionality is a subset of Microsoft’s ODBC [RIC94, GEI95] and the aim is to provide a transparent gateway to access non-Oracle SQL database products (e.g. IMS, DB2, SQL/DS and VSAM for IBM machines, or RMS and Rdb for DEC) via Oracle’s SQL*Connect. Transparent gateway products are machine and DBMS dependent in that they need to be recompiled or modified to run on different computers and support access to a variety of DBMSs. In the past, developers had to create special code for each type of database their users wanted to access. This limitation can now be overcome using a tool like ODBC to permit access to multiple heterogeneous databases. Most database vendors have development strategies which include plans to interoperate with open systems vendors as well as proprietary database vendors. This facility is being implemented using the 17SQL Access Group’s RDA (Remote Database Access) standard. As a result, products such as Microsoft’s Open Database Connectivity (ODBC), INFORMIX-Gateway [PUR93] and Oracle Transparent Gateway [HOL93] support some form of connectivity between their own and other products. For our work with Oracle, we developed our own C programs embedded with the query language SQL to access and update our prototype Oracle database. There is a version of PRODBI for Oracle that allows access to Oracle databases from Prolog code which was used in this project. 6.2.3 POSTGRES POSTGRES was developed at the University of California at Berkeley as a research oriented relational database extended with object-oriented features. Since 1994 a commercial version called ILLUSTRA [JAE95] has been available. However, POSTGRES has yet to address the inter-operability and other issues associated with our migration approach. 6.3 The Design of our Test University Databases 6.3.1 Background In our university system, we assume that departments and faculties have common user requirements and ideally could share a common database. Based on this assumption we have developed our test database schema to contain shared information. Hence, our three simple test databases, known as: College, Faculty and Department, can be easily integrated. A complete integration of these three databases will result in the generation of a global University database schema. However, in practice, schemas used by different departments and faculties may differ,

17 SQL Access Group (SAG) is a non-profit corporation open to vendors and users that

develops technical specifications to enable multiple SQL-based RDBMS’s and application tools to interoperate. The specifications defined by the SAG consist of a combination of current and evolving standards that include ANSI SQL, ISO RDA and X/Open SQL.

making the task of integration more difficult and bringing up more issues of heterogeneity. As our work is concerned with legacy database issues in a heterogeneous environment and not with integrating or resolving conflicts that arise in these environments, the differences that exist within this type of environment were not considered. Hence, we shall be looking at each of these databases independently. The main advantage of being able to easily integrate our test databases was the ability, thereby, to readily generate a complex database schema which could also be used to test our ideas. Each test database was designed to represent a specific kind of information, for example the Faculty and Department databases represent all kinds of structural relationships (e.g. 1:1, 1:M, and M:N; strong and weak relationships and entity types). The College database represents specialisation / generalisation structures, while the University database acts as a global system consisting of all the sub-database systems. This allows all sub-database systems, i.e. College, Faculty and Department, to act as a distributed system - the University database system. This is illustrated in figure 6.1 and is further described in section 6.3.2. We also need to be able to specify and represent all the constraint types discussed in section 3.5, as our re-engineering techniques are based on constraints. These were chosen to reflect actual database systems as closely as possible. We introduce these constraints in section 6.4 after identifying the entities of each of our test databases.

Figure 6.1: The UWCC Database

Departmental Databases

A Faculty Database

College Database

FPS Database

COMMA Database MATHS Database

6.3.2 The UWCC Database We shall use the term UWCC database to refer to our example university database, as the data of our system is based on that used at University of Wales College of Cardiff (UWCC). The UWCC database consists of many distributed database sites each used to perform the functions either of a particular department or school, or of a faculty, or of the college. The functions of the college are performed using the database located at the main college, which we shall refer to as the College database. The College consists of five faculties, each having its own local database located at the administrative section of the faculty. Our test database has been populated for one faculty, namely: The Faculty of Physical Science (FPS), and we shall refer to

this database as the Faculty database. The College has 28 departments or schools, with five of them belonging to FPS [UWC94a, UWC94b]. Our test databases were populated using data from two departments of FPS, namely: The Department of Computing Mathematics (COMMA), which is now called the Department of Computer Science, and The Department of Mathematics (MATHS). These are referred to as Department databases. The component databases of our UWCC database form a hierarchy as shown in figure 6.1. This will let us demonstrate how the global University database formed by integrating these components incorporates all the functions present in the individual databases. In the next section we identify our test databases by summarising their entities and specific features.

Table 6.1: Entities used in our test databases

Name (Meaning) College Faculty Department University

University (university data) Employee (university employees) Student (university students) EmpStudent (employees as students) College (college data) Faculty (faculty data) Department (department data) Committee (faculty committees) ComMember (committee members) Teach (subjects taught by staff) Course (offered by the department) Subject (subject details) Option (subjects for each course) Take (subjects taken by each student)

-x---xxxx-----

x x x xxxx-- - - - - -

-xx---x--xxxxx

xxxxxxxxxxxxxx

Entity Database

6.3.3 The Test Database schemas Fourteen entities shown in table 6.1 were represented in our test database schemas. As we are not concerned with heterogeneity issues associated with schema integration, we have simplified our local schemas by using the same attribute definitions in schemas having the same entity name. The attribute definitions of all our entities are given in figure 6.2. Each test database schema is defined using the data definition language (DDL) of the chosen DBMS, and is governed by a set of rules to establish integrity within the database. In the context of a legacy system these rules may not appear as part of the database schema. In this situation our approach is to supply them externally via our constraint enhancement process. Therefore we present the set of constraints defined on our test databases separately, so that the initial state of these databases conforms to the database structure of a typical legacy system. 6.3.4 Features of our Test Database schemas Among the specific features represented in our test databases are relationship types which form weak and link entities, cardinality constraints which highlight the behaviour of entities, and inheritance and aggregation which form specialised relationships among entities. These features (if not present) are introduced to our test database schemas by enhancing them with new constraints.

a) Relationship types Our reverse-engineering process uses the knowledge of constraint definitions to construct a conceptual model for a legacy database system. The foreign key definitions of table 6.4 along with their associated primary (cf. table 6.2) and uniqueness constraints (cf. table 6.3) are used to determine the relationship structures of a conceptual model. In this section we look at our foreign key constraint definitions to identify the types of relationship formed in our test database schemas. The check constraints of table 6.5 are used purely to restrict the domain values of our test databases. The foreign keys of table 6.4 are processed to find relationships according to our approach described in section 5.4.1. Here we identify keys defined on primary key attributes to determine M:N and 1:N weak relationships. The remaining keys will form 1:N or 1:1 relationships depending on the uniqueness property of the attributes of these keys. Table 6.6 shows all the relationships found in our test databases. We have also identified the criteria used to determine each relationship type according to section 5.4.1.

CREATE TABLE University ( Office char(50) NOT NULL ); CREATE TABLE Employee ( Name char(25) NOT NULL, Address char(30) NOT NULL, BirthDate date(7) NOT NULL, Gender char(1) NOT NULL, EmpNo char(9) NOT NULL, Designation char(30) NOT NULL, WorksFor char(5) NOT NULL, YearJoined integer(2) NOT NULL, Room char(9), Phone char(13), Salary decimal(8,2) ); CREATE TABLE Student ( Name char(20) NOT NULL, Address char(30) NOT NULL, BirthDate date(7) NOT NULL, Gender char(1) NOT NULL, CollegeNo char(9) NOT NULL, Course char(5) NOT NULL, Department char(5) NOT NULL, Tutor char(9), RegYear integer(2) NOT NULL ); CREATE TABLE EmpStudent ( CollegeNo char(9) NOT NULL, EmpNo char(9) NOT NULL, Remark char(10) ); CREATE TABLE College ( Code char(5) NOT NULL, Building char(20) NOT NULL, Name char(40) NOT NULL, Address char(30), Principal char(9), Phone char(13) ); CREATE TABLE Faculty ( Code char(5) NOT NULL, Building char(20) NOT NULL, Name char(40) NOT NULL, Address char(30), Secretary char(9), Phone char(13), Dean char(9) );

CREATE TABLE Department ( DeptCode char(5) NOT NULL, Building char(20) NOT NULL, Name char(50) NOT NULL, Address char(30), Head char(9), Phone char(13), Faculty char(5) NOT NULL ); CREATE TABLE Committee ( Name char(15) NOT NULL, Faculty char(5) NOT NULL, Chairperson char(9) ); CREATE TABLE ComMember ( ComName char(15) NOT NULL, MemName char(9) NOT NULL, Faculty char(5) NOT NULL, YearJoined integer(2) NOT NULL ); CREATE TABLE Teach ( Lecturer char(9) NOT NULL, Course char(5) NOT NULL, Subject char(5) NOT NULL, Room char(9) ); CREATE TABLE Course ( CourseNo char(5) NOT NULL, Name char(35) NOT NULL, Coordinator char(9), Offeredby char(5) NOT NULL, Type char(1) NOT NULL, Length char(10), Options integer(2) ); CREATE TABLE Subject ( SubNo char(5) NOT NULL, Name char(40) NOT NULL ); CREATE TABLE Option ( Course char(5) NOT NULL, Subject char(5) NOT NULL, Year integer(2) NOT NULL ); CREATE TABLE Take ( CollegeNo char(9) NOT NULL, Subject char(5) NOT NULL, Year integer(2) NOT NULL, Grade char(1) );

Figure 6.2: Test database schema entities and their attribute descriptions We can see that the selected constraints cover four of the five relationship identification categories of figure 5.3. The remaining category (i.e. ‘b’) is a special case of category ‘a’ which could be represented in the entity Take by introducing two separate foreign keys to link entities Course and Subject, instead of linking with the entity Option. However, as stated in section 5.4.1, n-ary relationships are simplified whenever possible. Hence, in the test examples presented here we do not show this type to reduce the complexity of our diagrams. In appendix C we present the graphical view of all our test databases. The figures there show the graphical representation of all the relationships identified in table 6.6. b) Inheritance We have introduced two inheritance structures, one representing a single inheritance and the other a multiple inheritance (see figure 5.2 and table 6.7). To do so, two generalised entities,

namely: Office and Person, have been introduced (see figure 6.3). Entities College, Faculty and Department now inherit from Office, while entities Employee and Student inherit from Person. Entity EmpStudent has been modified to become a specialised combination of Student and Employee. Figure 6.3 also contain all constraints associated with these entities.

Constraint Entity(s) PRIMARY KEY (office) University PRIMARY KEY (empNo) Employee PRIMARY KEY (collegeNo) Student, EmpStudent PRIMARY KEY (code) College, Faculty PRIMARY KEY (deptCode) Department PRIMARY KEY (name,faculty) Committee PRIMARY KEY (comName,memName,faculty) ComMember PRIMARY KEY (lecturer,cource,subject) Teach PRIMARY KEY (courseNo) Course PRIMARY KEY (subNo) Subject PRIMARY KEY (course,subject) Option PRIMARY KEY (collegeNo,subject,year) Take

Table 6.2: Primary Key constraints of our test databases

Constraint Entity(s) UNIQUE (empNo) EmpStudent UNIQUE (name) College, Department, Faculty UNIQUE (principal) College UNIQUE (dean) Faculty UNIQUE (head) Department UNIQUE (name,offeredBy) Course

Table 6.3: Uniqueness Key constraints of our test databases c) Cardinality constraints We have introduced some cardinality constraints on our test databases to show how these can be specified for a legacy database. In table 6.8 we show those used in the College database. Here the cardinality constraints for worksFor and faculty have been explicitly specified (see figure 6.3), while the others (inCharge, tutor and dean) have been derived using their relationship types. For example inCharge and tutor are 1:N relationships while dean is a 1:1 relationship. Our conceptual diagrams incorporate these constraint values (cf. appendix C and figure 5.2).

Constraint Entity(s) FOREIGN KEY (course) REFERENCES Course Student, Option FOREIGN KEY (department) REFERENCES Department Student FOREIGN KEY (tutor) REFERENCES Employee Student FOREIGN KEY (dean) REFERENCES Employee Faculty FOREIGN KEY (faculty) REFERENCES Faculty Committee FOREIGN KEY (chairPerson) REFERENCES Employee Committee FOREIGN KEY (comName,faculty) REFERENCES Committee ComMember FOREIGN KEY (memName) REFERENCES Employee ComMember FOREIGN KEY (lecturer) REFERENCES Employee Teach FOREIGN KEY (course,subject) REFERENCES Option Teach FOREIGN KEY (coordinator) REFERENCES Employee Course FOREIGN KEY (offeredBy) REFERENCES Department Course FOREIGN KEY (subject) REFERENCES Subject Option, Take FOREIGN KEY (collegeNo) REFERENCES Student Take

Table 6.4: Foreign Key constraints of our test databases

Constraint Entity(s) CHECK (yearJoined >= 21 + birthDate INTERVAL YEAR) Employee CHECK (salary BETWEEN 200 AND 3000 OR salary IS NULL) Employee CHECK (regYear >= 18 + birthDate INTERVAL YEAR) Student CHECK (phone IS NOT NULL) College, Department, Faculty CHECK (type IN ('U','P','E','O')) Course CHECK (options >= 0 OR options IS NULL) Course CHECK (year BETWEEN 1 AND 7) Option

Table 6.5: Check constraints of our test databases d) Aggregation A university has many offices (e.g. faculties, departments etc.) and an office belongs to a university. Also, attribute office is the key of entity University. Hence, entities University and Office participate in a 1:1 relationship. However, it is natural to represent this as a specialised relationship by considering office of University to be of type set. Then University and Office participate in an aggregation relationship which is a special form of a binary relationship. We introduce this type to show how specialised constraints could be introduced into a legacy database system. As shown in figure 6.3 we have used the key word REF SET to specify this type of relationship. In this case, as office is the key of University, a foreign key definition on office (see figure 6.3) will treat University as a link entity and hence can be classified as a special relationship.

Attribute(s) Entity Relationship Entity(s) Criteria

course Student 1 : N Course (d) department Student 1 : N Department (d) tutor Student 1 : N Employee (d) dean Faculty 1 : 1 Employee (e) faculty Committee 1 : N Faculty (c) chairPerson Committee 1 : N Employee (d) comName, faculty, memName ComMember M : N Committee, Employee (a) lecturer, course, subject Teach M : N Employee, Option (a) coordinator Course 1 : N Employee (d) offeredBy Course 1 : N Department (d) course, subject Option M : N Course, Subject (a) collegeNo, subject Take M : N Student, Subject (a)

Table 6.6: Relationship types of our test databases

Entity Inherited Entities

Table 6.7: Inherited Entities

Employee Person Student Person EmpStudent Student, Employee College Office Faculty Office Department Office

Participating Referencing Referenced Referencing Referenced Attribute Entity Entity Cardinality Cardinality

inCharge Office Employee 0+ -1 worksFor Employee Office 4+ 1 tutor Student Employee 0+ -1 dean Faculty Employee -1 -1 faculty Department Faculty 2-12 1

Table 6.8: Cardinality constraints of College database

6.4 Constraints Specification, Enhancement and Enforcement In the context of legacy systems, our test database schemas (cf. figure 6.2) will not explicitly contain most of the constraints introduced in tables 6.2 to 6.5, 6.7 and 6.8. Thus we need to specify them using the approach described in section 5.6. In figure 6.3 we present these constraints for the College database.

CREATE TABLE Office (code, siteName, unitName, address, inCharge, phone) AS SELECT code, building, name, address, principal, phone FROM College UNION SELECT code, building, name, address, secretary, phone FROM Faculty UNION SELECT deptCode, building, name, address, head, phone FROM Department;

ALTER TABLE Office ADD CONSTRAINT Office_PK PRIMARY KEY (code) ADD CONSTRAINT Office_Unique_name UNIQUE (siteName, unitName) ADD CONSTRAINT Office_FK_Staff FOREIGN KEY (inCharge) REFERENCES Employee ADD UNIQUE (phone);

ALTER TABLE College ADD UNDER Office WITH (siteName AS building, unitName AS name, inCharge AS principal);

ALTER TABLE Faculty ADD UNDER Office WITH (siteName AS building, unitName AS name, inCharge AS secretary) ADD FOREIGN KEY (faculty) CARDINALITY (2-12) REFERENCES Faculty ;

ALTER TABLE Department ADD UNDER Office WITH (code AS deptCode, siteName AS building, unitName AS name, inCharge AS head);

CREATE VIEW College_Office AS SELECT * FROM Office WHERE code in (SELECT code FROM College);

CREATE VIEW Faculty_Office AS SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, f.dean FROM Office o, Faculty f WHERE o.code = f.code;

CREATE VIEW Dept_Office AS SELECT o.code, o.siteName, o.unitName, o.address, o.inCharge, o.phone, d.faculty FROM Office o, Department d WHERE o.code = d.deptCode;

ALTER TABLE University ALTER COLUMN office REF SET(Office) NOT NULL | ADD FOREIGN KEY (office) REFERENCES Office ;

CREATE TABLE Person AS SELECT name, address, birthDate, gender FROM Employee UNION SELECT name, address, birthDate, gender FROM Student;

ALTER TABLE Person ADD PRIMARY KEY (name, address, birthDate) ADD CHECK (gender IN ('M', 'F'));

ALTER TABLE Employee ADD UNDER Person ADD CONSTRAINT Employee_FK_Office FOREIGN KEY (worksFor) CARDINALITY (4) REFERENCES Office;

ALTER TABLE Student ADD UNDER Person;

ALTER TABLE EmpStudent ADD UNDER Student, Employee ADD CHECK (tutor <> empNo OR tutor IS NULL);

Figure 6.3 : Enhanced constraints of college database in extended SQL-3 syntax When all the above constraints are not supported by a legacy database management system, we need to be able to store the constraints in the database using our knowledge augmentation techniques (cf. section 5.7). In figure 6.4 we present selected instances used in our knowledge-based tables to represent the enhanced constraints for the College database. The selected instances represent all the possible constraint types so we have not represented all the enhanced constraints of figure 6.3.

Our constraint enforcement process (cf. section 5.8) allows users to verify the extent to which the data in a database conforms to its enhanced constraints. The different types of queries used for this process in the College database are given in figure 6.5.

Table_Constraint { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Constraint_Type, Is_Deferrable, Initially_Deferred } ('Uni_db', 'Office_PK', 'Col', 'Office', 'PRIMARY KEY', 'NO', 'NO') ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'UNIQUE', 'NO', 'NO') ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'FOREIGN KEY', 'NO', 'NO') ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'PRIMARY KEY', 'NO', 'NO') ('Uni_db', 'Employee_FK_Office', 'Col, 'Employee', 'FOREIGN KEY', 'NO', 'NO') ('Uni_db', 'College_phone', 'Col', 'College', 'CHECK', 'NO', 'NO')

Referential_Constraint { Constraint_Id, Constraint_Name,Unique_Constraint_Id, Unique_Constraint_Name, Match_Option, Update_Rule, Delete_Rule } ('Uni_db', 'Office_FK_Employee', 'Col', 'Employee_PK', 'NONE', 'NO ACTION', 'NO ACTION') ('Uni_db', 'Employee_FK_Office', 'Uni', 'Office_PK', 'NONE', 'NO ACTION', 'NO ACTION')

Key_Column_Usage { Constraint_Id, Constraint_Name, Table_Id, Table_Name, Column_Name, Ordinal_Position } ('Uni_db', 'Office_PK', 'Col', 'Office', 'Code', 1) ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'siteName', 1) ('Uni_db', 'Office_Unique_name', 'Col', 'Office', 'unitName', 2) ('Uni_db', 'Office_FK_Staff', 'Col', 'Office', 'inCharge', 1) ('Uni_db', 'Employee_PK', 'Col', 'Employee', 'empNo', 1) ('Uni_db', 'Employee_FK_Office', 'Col', 'Employee', 'worksFor', 1)

Check_Constraint { Constraint_Id, Constraint_Name, Check_Clause } ('Uni_db', 'College_phone', 'phone is NOT NULL')

Sub_Tables { Table_Id, Sub_Table_Name, Super_Table_Name } ('Uni_db', 'College', 'Office')

Altered_Sub_Table_Columns { Table_Id, Sub_Table_Name, Sub_Table_Column, Super_Table_Name, Super_Table_Column } ('Uni_db', 'College', 'building', 'Office', 'siteName') ('Uni_db', 'College', 'name', 'Office', 'unitName') ('Uni_db', 'College', 'principal', 'Office', 'inCharge')

Cardinality_Constraint { Constraint_Id, Constraint_Name, Referencing_Cardinal } ('Uni_db', 'Office_FK_Employee', '0+') ('Uni_db', 'Employee_FK_Office', '4+')

Figure 6.4 : Augmented tables with selected sample data for the college database 6.5 Database Access Process Having described the application of our re-engineering processes using our test databases, we identify the tools developed and used to access those databases. The database access process is the initial stage of our application. This process extracts meta-data from legacy databases and represents it internally so that it can be used by other stages of our application. During re-engineering we need to access a database at three different stages: to extract meta-data and any existing constraint knowledge specifications to commence our reverse-engineering process; to add enhanced knowledge to the database; and to verify the extent to which the data conforms to the existing and enhanced constraints. We also need to access the database during the migration process. In all these cases, the information we require is held in either system or user-defined tables. Extraction of information from these tables can be done using the query language of the database, thus what we need for this stage is a mechanism that will allow us to issue queries and capture their responses.

Figure 6.5: Selected constraints to be enforced for the college database in SQL

Constraint Type Constraint Violation Instances Primary Key SELECT code, COUNT(*) FROM Office GROUP BY code HAVING COUNT(*) > 1 UNION SELECT code, 1 FROM Office WHERE code IS NULL

Unique SELECT dean, COUNT(*) FROM Faculty GROUP BY dean HAVING COUNT(*) > 1

Referential SELECT * FROM Office WHERE NOT (inCharge IS NULL OR inCharge IN (SELECT empNo FROM Employee))

Check SELECT * FROM College WHERE NOT (phone IS NOT NULL)

Cardinality SELECT worksFor, COUNT(*) FROM Employee GROUP BY worksFor HAVING COUNT(*) < 4

As our system implementation is in Prolog, the necessary query statements are generated from Prolog rules. The PRODBI interface allows access to several relational DBMSs, namely: Oracle, INGRES, INFORMIX and SYBASE [LUC93], from Prolog as if their relational tables are in the Prolog environment. The availability of INGRES PRODBI enabled us to use this tool to communicate with our INGRES test databases in the latter stages of our project. This interface has a performance as good as that of INGRES/SQL and hence, to the user, database interaction is fully transparent. Such Prolog database interface tools are currently commercially available only for relational database products. This means that we were not in a position to use this approach to perform database interactions for our POSTGRES test databases. Tools such as ODBC allow access to heterogeneous databases. This option would have been ideal for our application, but was not considered due to its unavailability in our development time scale. As far as our work is concerned, we needed a facility to issue specific types of query and obtain the response in such a way that Prolog could process responses without having to download the entire database. The PRODBI interfaces for relational databases perform this task efficiently, and also have many other useful data manipulation features. Due to the absence of any PRODBI equivalent tools to access non-relational or extended-relational DBMSs, we decided to develop our own version for POSTGRES. Here the functionality of our POSTGRES tool is to accept a POSTGRES DML statement (i.e a QUEL query statement) and produce the results for that query in a form that is usable by our (Prolog based) system. For Oracle, a PRODBI interface is available commercially, and to use it with our system the only change we would have to make is to load the Oracle library. As far as our code is concerned there is no change in any other commands, since they support the same rules as in INGRES. However at Cardiff only the PRODBI interface for INGRES was available, and even this was in the latter stages of our project. Therefore we developed our own tool to perform this functionality for INGRES and Oracle databases. However the implementation of this tool was not fully generalised, given that such tools were commercially available. When developing this tool we were not too concerned by performance degradation as our aim was to test functionality, not performance. Also, in the case of INGRES we have confirmed performance by subsequently using a commercially developed PRODBI tool with an SQL equivalent query facility. 6.5.1 Connecting to a database To establish a connection with a database the user needs to specify the site name (i.e. the

location of the database), the DBMS name (e.g. Oracle v7) and the database name (e.g. department) to ensure a unique identification of a database located over a network. The site name is the address of the host machine (e.g. thor.cf.ac.uk) and is used to gain access to that machine via the network. The type of the named DBMS identifies the kind of data to be accessed, and the name18 of the database tells us which database is to be used in the extraction process. In our system (CCVES), we provide a pop-up window (cf. left part of figure 6.6) to select and specify these requirements. Here, a set of commonly used site names and the DBMSs currently supported at a site are embedded in the menu to make this task easy. The specification of new site and database names can also be done via this pop-up menu (cf. right part of figure 6.6). 6.5.2 Meta-data extraction Once a physical connection to a database is achieved it is possible to commence the meta-data extraction process. This process is DBMS dependent as the kind of meta-data represented in a database and the methods of retrieving it vary between DBMSs. The information to be extracted is recorded in the system catalogues (i.e. data dictionaries) of respective databases. The most basic type of information is entity and attribute names, which are common to all DBMSs. However, information about different types of constraints is specific to DBMSs and may not be present in legacy database system catalogues.

Figure 6.6: Database connection process of CCVES The organisation of meta-data in databases differs with DBMSs, although all relational database systems use some table structure to represent this information. For example, a table structure for an Oracle user table is straight forward as they are separated from its system tables, while it is more complex in INGRES as all tables are held in a single form using attribute values to differentiate user defined tables from system and view tables. Hence the extraction query statements to retrieve entity names of a database schema differ for each system, as shown in table 6.9. These query statements indicate that the meta-data extraction process is done using the query language of that DBMS (e.g. SQL for Oracle and POSTQUEL for POSTGRES) and that the query table names and conditions vary with the type of the DBMS. This clearly demonstrates the DBMS dependency of the extraction process. Once the meta-data is obtained from system

18 For simplicity, identification details like the owner of the database are not included here.

catalogues we can process it to produce the database schema in the DDL formalism of the source database and to represent this in our internal representation (see section 7.2). The extraction process for entity names (cf. table 6.9) covers only one type of information. A similar process is used to extract all the other types of information, including our enhanced knowledge based tables. Here, the main difference is in the queries used to extract meta-data and any processing required to map the extracted information into our internal structures, which are introduced in section 7.2 (see also appendix D). 6.5.3 Meta-data storage The generated internal structures are stored in text files for further use as input data for our system. These text files are stored locally using distinct directories for each database. The system uses the database connection specifications to construct a unique directory name for each database (e.g. department-Oracle7-thor.cf.ac.uk). We have given public access to these files so that the stored data and knowledge is not only reusable locally, but also usable from other sites. This directory structure concept provides a logically coherent database environment for users. It means that any future re-engineering processes may be done without physically connecting to the database (i.e. by selecting a database logically from one of the public directories instead).

DBMS Query Oracle V7 SELECT table_name FROM user_table; INGRES V6 SELECT table_name FROM iitables WHERE table_type='T' and system_use='U'; POSTGRES V4 RETRIEVE pg_class.relname WHERE pg_class.relowner!='6'; SQL-3 SELECT table_name FROM tables WHERE table_type='BASE TABLE';

Table 6.9: Query statement to extract entity names of a database schema The process of connecting to a database and accessing its meta-data usually does not take much time (e.g. at most 2 minutes). However, trying to access an active database whenever a user wants to view its structure slows down the regular activities of this database. Also, local working is more cost effective than regularly performing remote accesses. This alternative also guarantees access to the database service as it is not affected by network traffic and breakdowns. We experienced such breakdowns during our system development, especially when accessing INGRES databases. A database schema can be considered to be static, whereas its instances are not. Hence, the decision to simulate a logical database environment after the first physical remote database access is justifiable because it allows us to work on meta-data held locally. 6.5.4 Schema viewing As meta-data is stored in text files for easy database access sessions, it is possible to skip the stages described in section 6.5.1 to 6.5.3 when viewing a database schema which has been accessed recently. During a database connection session, our system will only extract and store the meta-data of a database. Once the database connection process is completed the user needs to invoke a schema viewing session. Here, the user is prompted with a list of the current logically connected databases, as shown on the left of figure 6.7. When a database is selected from this list,

its name descriptions (i.e. database name and associated schema names) are placed in the main window of CCVES (cf. right of figure 6.7). The user selects schemas to view them. Our reverse-engineering process is applied at this point. Here, meta-data extracted from the database schema is processed further to derive the necessary constructs to produce the conceptual model as an E-R or OMT diagram. CCVES allows multiple selections of the same database schema (i.e. by selecting the same schema from the main window; cf. right of figure 6.7). As a result, multiple schema visualisation windows can be produced for the same database. The advantage of this is that it allows a user to simultaneously view and operate on different sections of the same schema, which otherwise would not be visible simultaneously due to the size of the overall schema (i.e. we would have to scroll the window to make other parts of the schema visible). Also, the facility to visualise schemas using a user preferred display model means that the user could now view the same schema simultaneously using different display models.

Figure 6.7: Database selection and selected databases of CCVES To produce a graphical view of a schema, we apply our reverse-engineering process. This process uses the meta-data which we extracted and represented internally. In chapter 7 we introduce the representation of our internal structures and describe the internal and external architecture and operation of our system, CCVES.

CHAPTER 7

Architecture and Operation of CCVES The Conceptualised Constraint Visualisation and Enhancement System (CCVES) is defined by describing its internal architecture and operation - i.e. the way in which different legacy database schemas are processed within CCVES in the course of enhancing and migrating them into a target DBMS’s schema - and its external architecture and operation - i.e. CCVES as seen and operated by its users. Finally, we look into the possible migrations that can be performed using CCVES. 7.1 Internal Architecture of CCVES In previous chapters, we discussed overall information flow (section 2.2), our re-engineering process (section 5.2) and the database access process (section 6.5). Here we describe how the meta-data accessed from a database is stored and manipulated by CCVES in order to successfully perform its various tasks. There are two sources of input information available to CCVES (cf. figure 7.1): initially, by accessing a legacy database service via the database connectivity (DBC) process, and later by using the database enhancement (DBE) process. This information is converted into our internal representation (see section 7.2) and held in this form for use by other modules of CCVES. For example, the Schema Meta-Visualisation System (SMVS) uses it to display a conceptual model of a legacy database, the Query Meta-Translation System (QMTS) uses it to construct queries that verify the extent to which the data conforms to existing and enhanced constraints, and the Schema Meta-Translation system (SMTS) uses it to generate and create target databases for migration. 7.2 Internal Representation To address heterogeneity issues, meta-representation and translation techniques have been successfully used in several recent research projects at Cardiff [HOW87, RAM91, QUT93, IDR94]. A key to this approach is the transformation of the source meta-data or query into a common internal representation which is then separately transformed into a chosen target representation. Thus components of a schema, referred to as meta-data, are classified as entity (class) and attribute (property) on input, and are stored in a database language independent fashion in the internal representation. This meta-data is then processed to derive the appropriate schema information of a particular DBMS. In this way it is possible to use a single representation and yet deal with issues related to most types of DBMSs. A similar approach is used for query transformation between source and target representations.

DBCInternal

Representation SMVS

DBE

QMTS SMTS

Figure 7.1: Internal Architecture of CCVES The meta-data we deal with has been classified into two types. The first category represents essential meta-data and the other represents derived meta-data. Information that describes an entity and its attributes, and constraints that identify relationships/hierarchies among entities are the essential meta-data (see section 7.2.1), as they can be used to build a conceptual model. Information that is derived for use in the conceptual model from the essential meta-data constitutes the other type of meta-data. When performing our reverse-engineering process we look only at the essential meta-data. This information is extracted from the respective databases during the initial database access process (i.e. DBC in figure 7.1). 7.2.1 Essential Meta-data Our essential (basic) meta-data internal representation captures sufficient information to allow us to reproduce a database schema using the DDL syntax of any DBMS. This representation covers entity and view definitions and their associated constraints. The following 5 Prolog style constructs were chosen to represent this basic meta-data, see figure 7.2. The first two constructs, namely: class and class property, are fundamental to any database schema as they describe the schema entities and their attributes, respectively. The third construct represents constraints associated with entities. This information is only partially represented by most DBMSs. The next two constructs are relevant only to some recent object-oriented DBMSs and are not supported by most DBMSs. We have included them mainly to demonstrate how modern abstraction mechanisms such as inheritance hierarchies could be incorporated into legacy DBMSs. By a similar approach, it is possible to add any other appropriate essential meta-data constructs. For conceptual modelling, and for the type of testing we perform for the chosen DBMSs, namely: Oracle, INGRES and POSTGRES, we found that the 5 constructs described here are sufficient. However, some additional environmental data (see section 7.2.2) which allows identification of the name and the type of the current database is also essential.

1. class(SchemaId, CLASS_NAME). 2. class_property(SchemaId, CLASS_NAME, PROPERTY_NAME, PROPERTY_TYPE). 3. constraint(SchemaId, CLASS_NAME, PROPERTY_list, CONST_TYPE, CONST_NAME, CONST_EXPR). 4. class_inherit(SchemaId, CLASS_NAME, SUPER_list). 5. renamed_attr(SchemaId, SUPER_NAME, SUPER_PROP_NAME, CLASS_NAME, PROPERTY_NAME).

Figure 7.2: Our Essential Meta-data Representation Constructs We now provide a detailed description of our meta-representation constructs. This representation is based on the concepts of the Object Abstract Conceptual Schema (OACS) [RAM91] used in his SMTS and other meta-processing systems. Hence we have used the same name to refer to our own internal representation. Ramfos’s OACS internal representation provides a natural abstraction of a particular structure based on the notion of objects. For example, when an object is described, its attributes, constraints and other related properties are treated as a single construct although only part of it may be used at a time. Our OACS representation directly resembles the internal representation structure of most relational DBMSs (e.g. class represents an entity, class_property represent the list of attributes of an entity). This is the main difference between the two representations. However it is possible to map the OACS constructs of Ramfos to our internal representation and vice-versa, hence our decision to use a variation of the original OACS does not affect the meta-representation and processing principles in general. • Meta-data Representation of class

The names of all the entities for a particular schema are recorded using class. This information is processed to identify all the entities of a database schema.

• Meta-data Representation of class_property

The names of all attributes and their data types for a particular schema are recorded using class_property. This information is processed to identify all attributes of an entity.

• Meta-data Representation of constraint

All types of constraints associated with entities are recorded using constraint. This information has been organised to represent constraints as logical expressions, along with an identification name and participating attributes. Different types of constraint, i.e. primary key, foreign key, unique, not null, check constraints, etc., are each processed and stored in this form. Usually a certain amount of preprocessing is required for the construction of our generalised representation of a constraint. For example, some check constraints extracted from the INGRES DBMS need to be preprocessed to allow them to be classified as check constraints by our system.

• Meta-data Representation of class_inherit

Entities that participate in inheritance hierarchies are recorded using class_inherit. The names of all super-entities for a particular entity are recorded here. This information is processed to identify all sub-entities of an entity and the inheritance hierarchies of a database schema.

• Meta-data Representation of renamed_attr

During an inheritance process, some attribute names may be changed to give more meaningful names for the inherited attributes. Once the inherited names are changed it makes it impossible to automatically reverse engineer these entities as their attribute names no longer match. To overcome this problem we have introduced an additional meta-data representation construct, namely: renamed_attr, which keeps track of all attributes whose names have changed due to inheritance. This is a representation of synonyms for attribute names of an inheritance hierarchy.

7.2.2 Environmental Data This is recorded using ccves_data, which is used to represent three types of information, namely: the database name, the DBMS name and the name of the host machine (see figure 7.3). These are captured at the database connection stage. 7.2.3 Processed Meta-data The essential meta-data described in section 7.2.1 is processed to derive additional information required for conceptual modelling. This additional information is schema_data, class_data and relationship. Here, schema_data (cf. figure 7.4 section 1), identifies all entities (all_classes, using class of figure 7.2 section 1) and entity types (link_classes and weak_classes, by the process described in section 5.4, using constraint types such as primary and foreign key which are recorded in constraint of figure 7.2 section 3). Class_data (cf. figure 7.4 section 2) identifies all class properties (property_list, using class_property of figure 7.2 section 2), inherited properties (using class_property, class_inherit and renamed_attr of figure 7.2 sections 2, 4 and 5, respectively), sub- and super- classes (subclass_list and superclass_list, using class_inherit of figure 7.2) and referencing and referenced classes (ref and refed, using foreign key constraints recorded in constraint of figure 7.2). Relationship (cf. figure 7.4 section 3) records the relationship types (derived using the process described in section 5.4) and cardinality information (using derived relationship types and available cardinality values).

ccves_data(dbname, DATABASE_NAME). ccves_data(dbms, DBMS_NAME). ccves_data(host, HOST_MACHINE_NAME).

Figure 7.3: OACS Constructs used as environmental data

1. schema_data(SchemaId, [ all_classes(ALL_CLASS_list), link_classes(LINK_CLASS_list), weak_classes(WEAK_CLASS_list) ]). 2. class_data(SchemaId, CLASS_NAME, [ property_list(OWN_PROPERTY_list, INHERIT_PROPERTY_list), subclass_list(SUBCLASS_list), superclass_list(SUPERCLASS_list), ref(REFERENCING_CLASS_list), refed(REFERENCED_CLASS_list) ]). 3. relationship(SchemaId, REFERENCING_CLASS_NAME, RELATIONSHIP_TYPE, CARDINALITY, REFERENCED_CLASS_NAME).

Figure 7.4: Derived OACS Constructs 7.2.4 Graphical Constructs Besides the above OACS representations it is necessary to support additional constructs to produce a graphical display of a conceptual model. For this we produce graphical constructs using our derived OACS constructs (cf. figure 7.4) and apply a display layout algorithm (see section 7.3). We call these graphical object abstract conceptual schema (GOACS) constructs, as they are graphical extensions of our OACS constructs. The graphical display represents entities, their attributes (optional), relationships, etc., using graphical symbols which consist of strings, lines and widgets (basic objects in a toolkit which contains data that will not be forgotten after writing to the screen as in the case of strings and lines [NYE93]). To produce this display, coordinates of the positions of all entities, relationships etc., are derived and recorded in our graphical constructs. The coordinates of each entity are recorded using class_info as shown in section 1 of figure 7.5. This information identifies the top left coordinates of an entity.

1. class_info(SchemaId, CLASS_NAME, [ x(X0), y(Y0) ] ). 2. box(SchemaId, X0, Y0, W, H, REGULAR_CLASS_NAME). box_box(SchemaId, X0, Y0, W, H, Gap, WEAK_CLASS_NAME). diamond_box(SchemaId, X0, Y0, W, H, LINK_CLASS_NAME). 3. ref_info(Schema_Id, REFERENCING_CLASS_NAME, REFERENCING_CLASS_CONNECTING_SIDE, REFERENCING_CLASS_CONNECTING_SIDE_COORDINATE, REFERENCED_CLASS_NAME, REFERENCED_CLASS_CONNECTING_SIDE, REFERENCED_CLASS_CONNECTING_SIDE_COORDINATE). 4. line(SchemaId, X1, Y1, X2, Y2). string(SchemaId, X0, Y0, STRING_NAME). diamond(SchemaId, X0, Y0, W, H, ASSOCIATION_NAME). 5. property_line(SchemaId, CLASS_NAME, X1, Y1, X2, Y2). property_string(SchemaId, CLASS_NAME, PROPERTY_NAME, DISPLAY_COLOUR, X0, Y0).

Figure 7.5: Graphical Constructs (GOACS) The graphical symbol for an entity depends on the entity type. Thus, further processing is required to graphically categorise entity types. For the EER model, we categorise entities: regular as box, weak as box_box and link as diamond_box (cf. section 2 of figure 7.5, and figure 7.6). We use an intermediate representation construct, namely: ref_info, to assist in the derivation of appropriate coordinates for all associations (cf. section 3 of figure 7.5). With the assistance of

ref_info, coordinates to represent relationships are derived and recorded using line, string and diamond (cf. section 4 of figure 7.5, and figure 7.7). Users of our schema displays are allowed to interact with schema entities. During this process, optional information such as properties (i.e. attributes) of selected entities can be added to the display. This feature is the result of providing the entities and their attributes at different levels of abstraction. The added information is recorded separately using property_line and property_string (cf. section 5 of figure 7.5, and figure 7.8).

(X0,Y0) W

REGULAR CLASS

H

box box_box diamond_box

(X0,Y0) W

WEAK CLASS

H

Gap

(X0,Y0) W

LINK CLASS

H

Figure 7.6: Graphical representation of entity types in EER notations

line diamond

(X0,Y0) W

Association Name

H

Figure 7.7: Graphical representation of connections, labels and associations in EER notations

(X1,Y1) (X2,Y2)(X0,Y0)

STRING_NAME

string

.

property_line

Figure 7.8: Graphical representation of selected attributes of a class in EER notations

(X1,Y1) (X2,Y2)(X0,Y0)

PROPERTY_NAME

property_string

.

7.3 Display Layout Algorithm To produce a suitable display of a database schema it was necessary to adopt an intelligent algorithm which determines the positioning of objects in the display. Such algorithms have been used by many researchers and also commercially for similar purposes [CHU89]. We studied these ideas and implemented our own layout algorithm which proved to be effective for small, manageable database schemas. However, to allow displays to be altered to a user preferred style and for our method to be effective for large schemas we decided to incorporate an editing facility. This feature allows users to move entities and change their original positions in a conceptual schema. Internally, this is done by changing the coordinates recorded in class_info for a repositioned entity and recomputing all its associated graphical constructs.

When the location of an entity is changed the connection side of that entity may also need to be changed. To deal with this case, appropriate sides for all entities can be derived at any stage of our editing process. When appropriate sides are derived, the ref_info construct (cf. section 3 of figure 7.5) is appropriately updated to enable us to reproduce the revised coordinates of line, string and diamond constructs (cf. section 4 of figure 7.5). Our layout algorithm does the following things:

1. Entities connected to each other are identified (i.e. grouped) using their referenced entity information. This process highlights unconnected entities as well as entities forming hierarchies or graph structures.

2. Within a group, entities are rearranged according to the number of connections associated with them. This arrangement puts entities with most connections in the centre of the display structure and entities with the least connections at the periphery.

3. A tree representation is then constructed starting from the entity having the most connections. During the construction of subsequent trees, entities which have already been used are not considered, to prevent their original position being changed. This makes it easy to visualise relationships/aggregations present in a conceptual model. The identification of such properties allow us to gain a better understanding of the application being modelled. Similarly, attempts are made to highlight inheritance hierarchies whenever they are present. However, when too many inter-related entities are involved, it is sometimes necessary to use the move editing facility to relocate some entities so that their properties (e.g. relationships) are highlighted in the diagram. The existence of such hidden structures is due to the cross connection of some entities. To prevent overlapping of entities, relationships, etc., initial placement is done using an internal layout grid. However, the user is permitted to overlap or place entities close to each other during schema editing.

The coordinate information of a final diagram is saved in disk files, so that these coordinates are automatically available to all subsequent re-engineering processes. Hence our system first checks for the existence of a file containing these coordinates and only in its absence would it use the above layout algorithm. 7.4 External Architecture and Operation of CCVES We characterise CCVES by first considering the type of people who may use this system. This is followed by an overview of the external system components. Finally the external operations performed by the system are described. 7.4.1 CCVES Operation The three main operations of CCVES, i.e. analysing, enhancing and incremental migration, need to be performed by a suitably qualified person. This person must have a good knowledge of the current database application to ensure that only appropriate enhancements are made to it. Also, this person must be able to interpret and understand conceptual modelling and the data manipulation language SQL, as we have used the SQL syntax to specify the contents of databases.

This person must have the skills to design and operate a database application. Thus they are a more specialised user than the traditional adhoc user. We shall therefore refer to this person as a DataBase Administrator (DBA), although they need not be a professional DBA. It is this person who will be in charge of migrating the current database application. To this DBA the process of accessing meta-data from a legacy database service in a heterogeneous distributed database environment is fully automated once the connection to the database of interest is made. The production of a graphical display representation for the relevant database schema is also fully automated. This representation shows all available meta-data, links and constraints in the existing database schema. All links and constraints defined by hand coding in the legacy application (i.e. not in the database schema but appearing in the application in the form of 3GL or equivalent code) will not be shown until such links and constraints are supplied to CCVES during the process of enhancing the legacy database. Such enhancements are represented in the database itself to allow automatic reuse of these additions, not only by our system users but also by others (i.e. users of other database support tools). The enhancement process will assist the DBA in incrementally building the database structure for the target database service. Possible decomposable modules for the legacy system are identified during this stage. Finally, when the incremental migration process has been performed, the DBA may need to review its success by viewing both the source and the target database schemas. This is achieved using the facility to visualise multiple heterogeneous databases. We have sought to meet our objectives by developing an interactive schema visualisation and knowledge acquisition tool which is directed by an inference engine using a real world data modelling framework based on the EER and OMT conceptual models and extended relational database modelling concepts. This tool has been implemented in prototype form mostly in Prolog, supported by some C language routines embedded with SQL to access and use databases built with the INGRES DBMS (version 6), Oracle DBMS (version 6 and 7) or POSTGRES O-O data model (version 4). The Prolog code which does the main processing and uses X window and Motif widgets exceeds 13,000 lines, while the C language embedded with SQL code is from 100 to 1,000 lines depending on the DBMS. 7.4.2 System Overview This section defines the external architecture and operation of CCVES. It covers the design and structure of its main interfaces, namely: database connection (access), database selection (processing) and user interaction (see figure 7.9). The heart of the system consists of a meta-management module (MMM) (see figure 7.10), which processes and manages meta-data using a common internal intermediate schema representation (cf. section 7.2). A presentation layer which offers display and dialog windows has been provided for user interaction. The schema visualisation, schema enhancement, constraint visualisation and database migration modules (cf. figure 7.9) communicate with the user.

GUI

Connect Database

Select Database

Query Tool

Schema Visualisation

Database Tools

Schema Enhancement

Constraint Visualisation

Database Migration

User InteractionDatabase Access

Database Processing

Figure 7.9: Principal processes and control flow of CCVES

Start

The meta-data and knowledge for this system is extracted from respective database system tables and stored using a common internal representation (OACS). This knowledge is further processed to derive the graphical constructs (GOACS) of a visualised conceptual model. Information is represented in Prolog, as dynamic predicates, to describe facts, and semantic relationships that hold between facts, about graphical and textual schema components. The meta-management module has access to the selected database to store any changes (e.g. schema enhancements) made by the user. The input / output interfaces of MMM manage the presentation layer of CCVES. This consists of X window and Motif widgets used to create an interactive graphical environment for users. In section 2.2 we introduced the functionality of CCVES in terms of information flow with special emphasis on its external components (cf. figure 2.1). Later, in sections 2.3 and 7.1, we described the main internal processes of CCVES (cf. figures 2.2 and 7.1). Here, in figure 7.10, we show both internal and external components of CCVES together with special emphasis on the external aspect. 7.4.3 System Operation The system has three distinct operational phases: meta-data access, meta-data processing and user interaction. In the first phase, the system communicates with the source legacy database service to extract meta-data19 and knowledge20 specifications from the database concerned. This is achieved when connection to the database (connect database of figure 7.10) is made by the system, and is the meta-data access phase. In the second phase, the source specifications extracted from the database system tables are analysed, along with any graphical constructs we may have subsequently derived, to form the meta-data and meta-knowledge base of MMM. This information

19 meta-data represents the original database schema specifications of the database. 20 knowledge represents subsequent knowledge we may have already added to augment this

database schema.

is used to produce a visual representation in the form of a conceptual model. This phase is known as meta-data processing and is activated when select database (cf. figure 7.10) is chosen by the user. The final phase is interaction with the user. Here, the user may supply the system with semantic information to enrich the schema; visualise the schema using a preferred modelling technique (EER and OMT are currently available); select graphical objects (i.e. classes) and visualise their properties and intra- and inter- object constraints using the constraint window; and modify the graphical view of the displayed conceptual model. He may also incrementally migrate selected schema constructs; transfer selected meta-data to other tools (e.g. MITRA, a Query Tool [MAD95]); accept meta-data from other tools (e.g. REVEERD, a reverse-engineering tool [ASH95]); and examine the same database using another window of the CCVES or other database design tools (e.g. Oracle*Design). The objective of providing the user with a wide range of design tools is to optimise process of analysing the source legacy database. The enhancement of the legacy database with constraints is an attempt to collect information that is managed by modern DBMSs in the legacy database without affecting its operation and in preparation for its migration.

The Designer

External Constraints Store & Enforce

Meta-Translation (OUTPUT)

DB Sch SQL-3 Const

Text Files

Display and Dialog Windows

GOACS

OACS

Meta-Transformation

Meta-Management Module

Meta-Translation (INPUT)

External Constraints

DefineSelect

Select, Move, Transfer OMT/EER

Schema Display

View

Constraint Window

Database Tools

GUI GQL Oracle * Design .......... ..........

Dept-Oracle College-POSTGRES

Faculty-INGRES

Host : DBMS:

DB Name:

Designer Interaction

External Constraints

Define

Constraint Visualiser

Select

Select, Move, TransferOMT/EER

Schema Display

Heterogeneous Distributed Databases

Input / Output Interface

Meta-Storage System

Meta-Knowledge base

Meta-Processor

Connect Database

Select Database

Figure 7.10: External Architecture of CCVES

Connect Database

For successful system operation, users need not be aware of the internal schema representation or any other non-SQL database specific syntax of the source or target database. This is because all source schemas are mapped into our internal representation and are always

presented to the user using the the standard SQL language syntax (unless specifically requested otherwise). This enables the user to deal with the problem of heterogeneity, since at the global level local databases are viewed as if they come from the same DBMS. The SQL syntax is used by default to express the associated constraints of a database. If specifically requested, the SQL syntax can be translated and viewed using the DDL of the legacy DBMS; as far as CCVES is concerned this is just performing another meta-translation process. A textual version of the original legacy database definition is also created by CCVES when connection to the legacy database is established. This definition may be viewed by the user for better understanding of the database being modelled. The ultimate migration process allows the user to employ a single target database environment for all legacy databases. This will assist in removing the physical heterogeneity between those databases. The complete migration process may take days for large information systems as they already hold a large volume of data. Hence the ability to enhance and migrate while legacy databases continue to function is an important feature. Our enhancement process does not affect existing operations as it involves adding new knowledge and validating existing data. Whenever structural changes are introduced (e.g. an inheritance hierarchy), we have proposed the use of view tables (cf. section 5.6) to ensure that normal database operations will not be affected until the actual migration is commenced. This is because some data items may continue to change while the migration is in preparation, and indeed during migration itself. We have proposed an incremental migration process to minimise this effect and use of a forward gateway to deal with such situations. 7.5 External Interfaces of CCVES CCVES is seen by users as consisting of four processes, namely: a database access process, a schema and constraint visualisation process, a schema enhancement process, and a schema migration process. The database access process was described in section 6.5. In the next subsections we describe the other three processes of CCVES to complete the picture. 7.5.1 Schema and Constraint Visualisation The input / output interfaces of MMM manage the presentation layers of CCVES. These layers consist of display and dialog windows used to provide an interactive graphical environment for users. The user is presented with a visual display of the conceptual model for a selected database, and may perform many operations on this schema display window (SDW) to analyse, enhance, evolve, visualise and migrate any portion of that database. Most of these operations are done via SDW as they make use of the conceptual model. The traditional conceptual model is an E-R diagram which displays only entities, their attributes and relationships. This level of abstraction gives the user a basic idea of the structure of a database. However this information is not sufficient to gain a more detailed understanding of the components of a conceptual model, including identification of intra- and/or inter- object constraints. Intra-object constraints for an entity provide extra information that allows the user to identify behavioural properties of the entity. For instance, the attributes of an entity do not provide sufficient information to determine the type of data that may be held by an attribute and any

restrictions that may apply to it. Hence providing a higher level of abstraction by displaying constraints along with their associated entities and attributes gives the user a better understanding of the conceptual model. The result is much more than a static entity and attribute description of a data model as it describes how the model behaves for dynamic data (i.e. a constraint implies that any data item which violates it cannot be held by the database). The schema visualisation module allows users to view the conceptual schema and constraints defined for a database. This visualisation process is done using three levels of abstraction. The top level describes all the entity types along with any hierarchies and relationships. The properties of each entity are viewed at the next level of abstraction to increase the readability of the schema. Finally, all constraints associated with the selected entities and their properties are viewed to gain a deeper and better understanding of the actual behaviour of selected database components. The conceptual diagrams of our test databases are given in appendix C. These diagrams are at the top level of abstraction. Figures 7.11 and 7.12 show the other two levels of abstraction. The graphical schema displayed in the SDW for a selected database uses by default the OMT notation, which can be changed to EER from a menu. Users can produce any number of schema displays for the same schema, and thus can visualise a database schema using both OMT and EER diagrams at the same time (a picture of our system while viewing the same schema using both forms is provided in figure 7.11). The display size of the schema may also be changed from a menu. A description that identifies the source of each display is provided as we are dealing with many databases in a heterogeneous environment. The diagrams produced by CCVES can be edited to alter the location of their displayed entities and hence to permit visualisation of a schema in the personally preferred style and format of an individual user. This is done by selecting and moving a set of entities within the scrolling window, thus altering the relative positions of entities within the diagram produced. These changes can be saved and automatically restored for the next session by users. The system allows interactive selection of entities and attributes from the SDW. We initially do not include any attributes as part of the displayed diagram, because we provide them as a separate level of abstraction. A list of attributes associated with an entity can be viewed by first selecting the entity from the display window (abstraction at level 2), and then browsing through its attributes in the attribute window, which is placed just below the display window (see figure 7.12). Any attribute selected from this window will automatically be transferred to the main display window, so that only attributes of interest are displayed there. This technique increases the readability of our display window. At each stage, appropriate messages produced by the system display unit are shown at the bottom of this window. For successful interactions, graphical responses are used whenever applicable. For example, when an entity is selected by clicking on it, the selected entity will be highlighted. In this thesis we do not provide interaction details as these are provided at the user manual level. The 'browser' menu option for SDW will invoke the browser window. This is done only when a user wishes to visit the constraint visualisation abstraction level, our third level of abstraction. Here we see all entities and their properties of interest as the default option, but we can expand this to display other entities and properties by choosing appropriate menu options

from the browsing window. We can also filter the displayed constraints to include those of interest (e.g. show only domain constraints). In cases where inherited attributes are involved the system will initially show only those attributes owned by an entity (the default option), others can be viewed by selecting the required level of abstraction (available in the menu) for the entity concerned. The ability to select an entity from the display window and display its properties in the browsing window satisfies our requirement of visualising intra-object constraints. The reverse of this process, i.e. selecting a constraint from the browsing window and displaying all its associated entities in the display window, satisfies our inter-object constraint visualisation requirement. Both of these facilities are provided by our system (see figures 7.11 and 7.12, respectively). All operations with the mouse device are done using the left button except when altering the location of an entity of a displayed conceptual model. We have allowed the use of the middle button of the mouse to select, drag and drop21 such an entity. This process alters the position of the entity and redraws the conceptual model. By this means, object hierarchies, relationships, etc., can be made prominent by placing the objects concerned in hierarchies close to each other. This feature was introduced firstly to allow users to visualise a conceptual model in a preferred manner and secondly as our display algorithm was not capable of automatically providing such features when constructing complex schemas having many entities, hierarchies and relationships (cf. section 7.3).

21 CCVES changes the cursor symbol to confirm the activation of this mode.

7.5.2 Schema Enhancement Schema enhancements are also done via the schema display window. This module is mainly used to specify dynamic constraints. These constraints are usually extracted from the legacy code of a database application, as in older systems they were specified in the application itself. Constraint extraction from legacy applications is not supported by CCVES. Hence, this information must be extracted by other means. We assume that such constraints can be extracted by either examining the legacy code, using any existing documentation or using user knowledge of the application, and have introduced this module to capture them. We have also provided an option to detect possible missing, hidden and redundant information (cf. section 5.5.2) to assist users in formulating new constraints. The user specifies constraints via the constraint enhancement interface by choosing the constraint type and associated attributes. In the case of a check constraint the user needs to specify it using SQL syntax. The constraint enhancement process allows further constraints to appear in the graphical model. This development is presented via the graphical display, so that the user is aware of the existing links and constraints present for the schema. For instance, when a foreign key is specified, CCVES will try to derive a relationship using this information. If this process is successful a new relationship will appear in the conceptual model. A graphical user interface in the form of a pop-up sub-menu is used to specify constraints, which take the form of integrity constraints (e.g. primary key, foreign key, check constraints) and structural components (e.g. inheritance hierarchies, entity modifications). In figure 7.13 we present some pop-up menus of CCVES which assist users in specifying various types of constraints. Here, names of entities and their attributes are automatically supplied via pull-down menus to ensure the validity of certain input components of user specified constraints. For all constraints, information about the type of constraint and the class involved is initially specified. When the type of constraint is known, prior existence of such a constraint is checked in the case of primary key and foreign key constraints. For primary keys, the process will not proceed if a key specification already exists; for foreign keys, if the attributes already participate in such a relationship they will not appear in the referencing attribute specification list. In the case of referenced attributes, only attributes with at least the uniqueness property will appear in order to prevent specification of any invalid foreign keys. All enhanced constraints are stored internally until they are added to the database using another menu option. Prior to this augmentation process the user should verify the validity of the constraints. In the case of recent DBMSs like Oracle, invalid constraints will be rejected automatically and the user will be requested to amend them or discard them. In such situations the incorrect constraints are reported to the user and are stored on disk as a log file. Also, all changes made during a session will not be saved until specifically instructed by the user. This gives the opportunity to rollback (in the event of an incorrect addition) and resume from the previous state.

Figure 7.13: Two stages of a Foreign Key constraint specification Input data in the form of constraints to enhance the schema provides new knowledge about a database. It is essential to retain this knowledge within the database itself, if it is to be used for any future processing. CCVES achieves this task using its knowledge augmentation process as described in section 5.7. From a user’s point of view this process is fully automated and hence no intermediate interactions are involved. The enhanced knowledge is augmented only if the database is unable to naturally represent the new knowledge. Such knowledge cannot be automatically enforced except via a newer version of the DBMS or other DBMS (if supported), by migrating the database. However, the data being held by the database may already not conform to the new constraints, and hence existing data may be rejected by the target DBMS. This will result in loss of data and/or migration delays. To address this problem, we provide an optional constraint enforcement process which checks the conformation of the data to the new constraints prior to migration. 7.5.3 Constraint Enforcement and Verification Constraint enforcement is automatically managed only by relatively recent DBMSs. If CCVES is used to enhance a recent DBMS such as Oracle then verification and enforcement will be handled by the DBMS, as CCVES will just create constraints using the DDL commands of that DBMS. However, when such features are not supported by the underlying DBMS, CCVES has to provide such a service itself. Our objective in this process is to give users the facility to ensure that the database conforms to all the enhanced constraints. Constraints are chosen from the browser window to verify their validity. Once selected, the constraint verification process will issue each constraint to the database using the technique described in section 5.8 and report any violations to the user. When a violated constraint is detected, the user can decide whether to keep or discard it. The user could decide to retain the constraint in the knowledge-base for various reasons. These include: ensuring that future data conforms to the constraint; providing users with a guideline to the system data contents irrespective of violations that may occur occasionally; assisting the user in improving the data or the constraint. To enable the enforcement of such constraints for future data instances, it is necessary to either use a trigger (e.g. on append check constraint) or add a temporal component to the constraint (e.g. system date > constraint input date). This will ensure that the constraint will not be enforced on existing data.

When using queries to verify enhanced constraints the retrieved data are instances that violate a constraint. In such a situation, retrieving a large number of instances for a given query does not make much sense as it may be due to an incorrect constraint specification rather than the data itself. Therefore in the event of an output exceeding 20 instances we terminate the query and instruct the user to inspect this constraint as a separate action. 7.5.4 Schema Migration The migration process is provided to allow an enhanced legacy system to be ported to a new environment. This is incrementally performed by initially creating the schema in the target DBMS and then copying the legacy data to the target system. To create the schema in the target system, DDL statements are generated by CCVES. An appropriate schema meta-translation process is performed if required (e.g. if the target DBMS has a non-SQL query language). The legacy data is migrated using the import/export tools of source and target DBMSs. The migration process is not fully automated as certain conflicts cannot be resolved without user intervention. For example, if the target database only accepts names of length 16 (as in Oracle) instead of 32 (as in INGRES) in the source database, then a name resolution process must be performed by the user. Also, names used in one DBMS may be keywords in another. Our system resolves these problems by adding a tag to those names or by truncating the length of a name. This approach is not generic as the uniqueness property of an attribute cannot be maintained by truncating its original name. In these situations user involvement is unavoidable. CCVES, although it has been tested for only three types of DBMS, namely: INGRES, POSTGRES and Oracle, could be easily adapted for other relational DBMSs as they represent their meta-data similarly - i.e. in the form of system tables, with minor differences such as table and attribute names and some table structures. Non relational database models accessible via ODBC or other tools (e.g. Data Extract for DB2, which permits movement of data from IMS/VS, DL/1, VSAM, SAM to SQL/DS or DB2), could also be easily adapted as the meta-data required by CCVES could be extracted from them. Previous work related to meta-translation [HOW87] has investigated the translation of dBase code to INGRES/QUEL, demonstrating the applicability of this technique in general, not only to the relational data model but also to others such as CODASYL and hierarchical data models. This means CCVES is capable in principle of being extended to cope with other data models.

CHAPTER 8

Evaluation, Future Work and Conclusion In this chapter the Conceptualised Constraint Visualisation and Enhancement System (CCVES) described in Chapters 5, 6 and 7 is evaluated with respect to our hypotheses and objectives listed in Chapter 1. We describe the functionality of different components of CCVES to identify their strengths and summarise their limitations. Potential extensions and improvements are considered as part of future work. Finally, conclusions about the work are drawn by reviewing the objectives and the evaluation. 8.1 Evaluation 8.1.1 System Objectives and Achievements The major technical challenge in designing CCVES was to provide an interactive graphical environment to access and manipulate legacy databases within an evolving heterogeneous distributed database environment for the purpose of analysing, enhancing and incrementally migrating legacy database schemas to modern representations. The objective of this exercise was to enrich a legacy database with valuable additional knowledge that has many uses, without being restricted by the existing legacy database service and without affecting the operation of the legacy IS. This knowledge is in the form of constraints that can be used to understand and improve the data of the legacy IS. Here, we assess the main external and internal aspects of our system, CCVES, based on the objectives laid out in sections 1.2 and 2.4. Externally, CCVES performs 3 important tasks - initially, a reverse-engineering process; then, a knowledge augmentation process, which is a re-engineering process on the original system; and finally, an incremental migration process. The reverse-engineering process is fully automated and is generalised to deal with the problems caused by heterogeneity. a) A framework to address the problem of heterogeneity

The problems of heterogeneity have been addressed by many researchers, and at Cardiff the meta-translation technique has been successfully used to demonstrate its wide-ranging applicability to heterogeneous systems. This previous work, which includes query meta-translation [HOW87], schema meta-translation [RAM91] and schema meta-integration [QUT93], was developed using Prolog - emphasising its suitability for meta-data representation and processing. Hence Prolog was chosen as the main programming language for the development of our system. Among the many Prolog versions around, we found that Quintus Prolog was well suited to supporting an interactive graphical environment as it provided access to X window and Motif widget routines. Also, the PRODBI tools [LUC93] were available with Quintus Prolog, and these enabled us to directly access a number of relational DBMSs, like INGRES, Oracle and SYBASE.

Chapter 8 Evaluation, Future Work and

Conclusion

Page 125

Our framework for meta-data representation and manipulation has been described in section 7.2. The meta-programming approach enabled us to implement many other features, such as the ability to easily customise our system for different data models, e.g. relational and object-oriented (cf. section 7.2.1), the ability to easily enhance or customise for different display models, e.g. E-R, EER and OMT (cf. section 7.2.4), and the ability to deal with heterogeneity due to differences in local databases (e.g. at the global level the user views all local databases as if they come from the same DBMS, and is also able to view databases using a preferred DDL syntax). b) An interactive graphical environment An interactive graphical environment which makes extensive use of modern graphical user interface facilities was required to provide graphical displays of conceptual models and allow subsequent interaction with them. To fulfil these requirements the CCVES software development environment had to be based on a GUI sub-system consisting of pop-up windows, pull-down menus, push buttons, icons etc. We selected X window and Motif widgets to build such an environment on a UNIX platform. SunSparc workstations were used for this purpose. Provision of interactive graphical responses when working via this interface was also included to ensure user friendliness (cf. section 7.5). c) Ability to access and work on legacy database systems An initial, basic facility of our system was the ability to access legacy database systems. This process, which is described in section 6.5, enables users to specify and access any database system over a network. Here, as the schema information is usually static, CCVES has been designed to provide the user with the option of by-passing the physical database access process and using instead an existing (already accessed) logical schema. This saves time once the initial access to a schema has been made. Also, it guarantees access to meta-data of previously accessed databases during server and network breakdowns, which were not uncommon during the development of our system. d) A framework to perform the reverse-engineering process A framework to perform the reverse-engineering process for legacy database systems has been provided. This process is based on applying a set procedure which produces an appropriate conceptual model (cf. section 5.2). It is performed automatically even if there is very limited meta-knowledge. In such a situation, links that should be present in the conceptual model will not appear in the corresponding graphical display. Hence, the full success of this process depends on the availability of adequate meta-knowledge. This means that a real world data modelling framework that facilitates the enhancement of legacy systems must be provided, as described next. e) A framework to enhance existing systems A comprehensive data modelling framework that facilitates the enhancement of established database systems has been provided (cf. section 5.6). A method of retaining the


Conclusion

Page 126

enhanced knowledge for future use which is in line with current international standards is employed. Techniques that are used in recent versions of commercial DBMSs are supported to enable legacy databases to logically incorporate modern data modelling techniques irrespective of whether these are supported by their legacy DBMSs or not (cf. section 5.7). This enhancement facility gives users the ability to exploit existing databases in new ways (i.e. restructuring and viewing them using modern features even when these are not supported by the existing system). The enhanced knowledge is retained in the database itself so that it is readily available for future exploitation by CCVES or other tools, or by the target system in a migration. f) Ability to view a schema using preferred display models The original objective of producing a conceptual model as a result of our reverse-engineering process was to display the structure of databases in a graphical form (i.e. conceptual model) and so make it easier for users to comprehend their contents. As all users are not necessarily familiar with the same display model, the facility to visualise using a user preferred display model (e.g. EER or OMT) has been provided. This is more flexible than our original aim. g) High level of data abstraction for better understanding A high level of data abstraction for most components of a conceptual model (i.e. visualising the contents, relationships and behavioural properties of entities and constraints; including identification of intra- and inter-object constraints) has been provided (cf. section 7.5.1). Such features are not usually incorporated in visualisation tools. These features and various other forms of interaction with conceptual models are provided via the user interface of CCVES. h) Ability to enhance schema and to verify the database The schema enhancement process was provided originally to enrich a legacy database schema and its resultant conceptual model. A facility to determine the constraints on the information held and the extent to which the legacy data conforms to these constraints is also provided to enable the user to verify their applicability (section 5.7). The graphical user interface components used for this purpose are described in section 7.5.2. i) Ability to migrate while the system continues to function The ability to enhance and migrate while a legacy database continues to function normally was considered necessary as it ensures that this process will not affect the ongoing operation of the legacy system (section 5.8). The ability to migrate to a single target database environment for all legacy databases assists in removing the physical heterogeneity between these databases. Finally, the ability to integrate CCVES with other tools to maximise the benefits to the user community was also provided (section 7.4.3). 8.1.2 System Development and Performance A working prototype CCVES system that enabled us to test all the contributions of this research was implemented using Quintus Prolog with X window and Motif libraries; INGRES,


Conclusion

Page 127

Oracle and POSTGRES DBMSs; the C programming language embedded with SQL and POSTQUEL; and the PRODBI interface to INGRES. This system can be split into 4 parts, namely: the database access process to capture meta-data from legacy databases; the mapping of the meta-data of a legacy database to a conceptual model to present the semantics of the database using a graphical environment; the enhancement of a legacy database schema with constraint based knowledge to improve its semantics and functionality; and the incremental migration of the legacy database to a target database environment. Our initial development commenced using POPLOG, which was at that time the only Prolog version with any graphical capabilities available on UNIX workstations at Cardiff. Our initial exposure to X window library routines occurred at this stage. Later, with the availability of Quintus Prolog, which had a more powerful graphical capability due to its support of X windows and Motif widgets, it was decided to transfer our work to this superior environment. To achieve this we had to make two significant changes, namely: converting all POPLOG graphic routines to Quintus equivalents and modifying a particular implementation approach adopted by us when working with POPLOG. The latter took advantage of POPLOG’s support for passing unevaluated expressions as arguments of Prolog clauses. In Quintus Prolog we had to evaluate all expressions before passing them as arguments. Due to the use of slow workstations (i.e. SPARC1s) and running Prolog interactively, there was a delay in most interactions with our original system. This delay was significant (e.g. nearly a minute) when having to redraw a conceptual model. It was necessary to redraw this model when we moved an object of the display in order to change its location and whenever the drawing window was exposed. This exposure occurred when the window’s position changed, or was overlapped with another window or a menu, or when someone clicked on this window. In such situations it was necessary to refresh the drawing window by redrawing the model. Redrawing was required as our initial attempt at producing a conceptual model was based solely on drawing routines. This method was inefficient as such drawings had to be redone every time the drawing window became exposed. Our second attempt was to draw conceptual models in the background using a pixmap. This process allocates part of the memory of the computer to enable us to directly draw and retain an image. A pixmap can be copied to any drawing window without having to reconstruct its graphical components. This means that when the drawing window becomes exposed it is possible to copy this pixmap to that window without redrawing the conceptual model. The process of copying a pixmap to the drawing window took only a few seconds and so there was a significant improvement over our original method. However with this new approach whenever a move operation is performed, it is still necessary to recompute all graphical settings and redraw. This took a similarly long time to the original method. The use of a pixmap took up a significant part of the computer’s memory and as a result Quintus was unable to cope if there was a need to simultaneously view multiple conceptual models. We also experienced several instances of unusual system behaviour such as failure to execute routines that had been tested previously. This was due to the full utilisation by Prolog of run time memory because of the existence of this pixmap. We noticed that Quintus Prolog had a bug of not being able to release the memory used by a pixmap. In order to regain this memory we


Conclusion

Page 128

had to logout (exit) from the workstation, as the xnew process which was collecting garbage was unable to deal with this case. Hence, we decided to use widgets instead of drawing rectangles for entities, as widgets are managed automatically by X windows and Motif routines. This allowed us to reduce the drawing components in our conceptual model and hence to minimise redrawing time when the drawing window became exposed. We discarded the pixmap approach as it gave us many problems. However as widgets themselves take up memory their behaviour under some complex conceptual models is questionable. We decided not to test this in depth as we had already spent too much time on this module, and its feasibility had been demonstrated satisfactorily. During the course of CCVES development, Quintus Prolog was upgraded from release 3.1 to 3.1.4. Due to incompatibilities between the two versions, certain routines of our system had to be modified to suit the new version. This meant that a full test of the entire system was required. Also, since three versions of INGRES, two versions of Oracle and one POSTGRES were used during our project, this meant that more and more system testing was required. Thus, we have experienced several changes to our system due to technological changes in its development environment. Comparing the lifespan and scale of our project with those of a legacy IS, we could more clearly appreciate the amount of change that is required for such systems to keep up with technological progress and business needs. Hence, the migration of any IS is usually a complex process. However, the ability to enhance and evolve such a system without affecting its normal operation is a significant step towards assisting this process. Our final task was to produce a compiled version of our system. This is still being undertaken, as although we have been able to produce executable code, some user interface options are not being activated for unknown reasons (we think this may be due to insufficient memory), although the individual modules work correctly. 8.1.3 System Appraisal The approach presented in this thesis for mapping a relational database schema to a conceptual schema is in many ways simpler and easier to apply than any previous attempts as it has eliminated the need for any initial user interaction to provide constraint based knowledge for this process. Constraint information such as primary and foreign keys are used to automatically derive the entity and relationship types. Use of foreign key information was not considered in previous approaches as most database systems did not support such facilities at that time. One major contribution of our work is providing the facility for specifying and using constraint based information in any type of DBMS. This means that once a database is enhanced with constraints, it is semantically richer. If the source DBMS does not support constraints then the conceptual model will still be enhanced, and our tool will augment the database with these constraints in an appropriate form. Another innovative feature of our system is the automated use of the DML of a database to determine the extent to which the enhanced constraints conform to its data. This enables users to take appropriate compensatory actions prior to migrating legacy databases. We provided an additional level of schema abstraction for our conceptual models. This is


Conclusion

Page 129

in the form of viewing the constraints associated with a schema. This feature allows users to gain a better understanding of databases. The facility to view multiple schemas allows users to compare different components of a global system if it comprises several databases. This feature is very useful when dealing with heterogeneous databases. We also deal with heterogeneity at the conceptual viewing stage by providing users with the facility to view a schema using their preferred modelling notation. For example, in our system the user can choose either an EER or an OMT display to view a schema. This ensures greater accuracy in understanding, as the user can select a familiar modelling notation to view database schemas. The display of the same schema in multiple windows using different scales allows the user to focus on a small section of the schema in one window while retaining a larger view in another. The ability to view multiple schemas also means that it is possible to jointly monitor the progress or status of the source and target databases during an incremental migration process. The introduction of both EER and OMT as modelling options means that the recent advances which were not present in the original E-R model and some subsequent variants can be represented using our system. Our approach of augmenting a database itself with new semantic knowledge rather than using separate specialised knowledge-bases means that our enhanced knowledge is accessible by any user or tools using the DML of the database. This knowledge is represented in the database using an extended version of the SQL-3 standards for constraint representation. Thus this knowledge will be compatible with future database products, which should conform to the new SQL-3 standards. Also, no semantics are lost due to the mapping from a conceptual model to a database schema. Oracle version 6 provided similar functionality by allowing constraints to be specified even though they could not be applied until the introduction of version 7. 8.1.4 Useful real-life Applications We were able to successfully reverse-engineer a leading telecommunication database extract consisting of over 50 entities. This enabled us to test our tool on a scale greater than that of our test databases. The successful use of all or parts of our system for other research work, namely: accessing POSTGRES databases for semantic object-oriented multi-database access [ALZ96], viewing heterogeneous conceptual schemas when dealing with graphical query interfaces [MAD95], and viewing heterogeneous conceptual schemas via the world wide web (WWW) [KAR96] indicates its general usefulness and applicability. The display of conceptual models can be of use in many areas such as database design, database integration and database migration. We could identify similar areas of use for CCVES. These include training new users by allowing them to understand an existing system, and enabling users to experiment with possible enhancements to existing systems. 8.2 Limitations and possible future Extensions There are a number of possible extensions that could be incorporated to improve the current functionalities of our system. Some of these are associated with improving run time efficiency, accommodating a wider range of users and extending graphical user interaction


Conclusion

Page 130

capabilities. Such extensions would not have great significance with respect to demonstrating the applicability of our fundamental ideas. Examples of improvements are: engineer the system to the level of a commercial product so that it could be used by a wide range of users with minimal user training; improve the run time efficiency of the system by producing a compiled version; test it in a proper distributed database environment, as our test databases did not emphasise distribution; extend the graphical display options to offer other conceptual models, such as ECR; extend the system to enable us to test migrations to a proper object-oriented DBMS (i.e. not only to an extended relational DBMS with O-O features, like POSTGRES); and improve the display layout algorithm (cf. section 7.3) to efficiently manage large database schemas. The time scale for such improvements would vary from a few weeks to many months each, depending on the work involved. Our system is designed to cope with two important extensions. They are: extend the graphical display option to offer other forms of conceptual models, and extend the number of DBMSs and their versions it can support. Of these two extensions, the least work involved is in supporting a new graphical display. Here, the user needs to identify the notations used by the new display and write the necessary Prolog rules to generate strings and lines used for the drawings. This process will take at most one week, as we do not change graphical constructs such as class_info and ref_info (cf. section 7.2.4) to support different display models. On the other hand, inclusion of a new relational DBMS or version can take a few months as it affects 3 stages of our system. These stages are: meta-data access, constraint enforcement and database migration. All 3 stages uses the query language (SQL) of the DBMS and hence, if this is variant we will need to expand our QMTS. The time required for such an extension will depend on its similarities when compared with standard SQL and may take 2-4 person weeks. Next, we need to assess the constraint handling features supported by the new DBMS so that we can use our knowledge-based tables to overcome any constraint representation limitations. This process may take 1-2 person weeks. To access the meta-data from a database it is necessary to know the structures of its system tables. Also, we need a mechanism to externally access this information (i.e. use an ODBC driver or write our own). This stage can take 1-6 person weeks as in many cases system documentation will be inadequate. Inclusion of a different data model would be a major extension as it affects all stages of our system. It would require provision of new abstraction mechanisms such as parent-child relationships for a hierarchical model and owner-member relationships for a network model. Other possible extensions are concerned with incorporating software modules that would expand our approach. These include a forward gateway for use at the incremental migration stage; an integration module for merging related database applications; and analysers for extracting constraint based information from legacy IS code. These are important and major areas of research, hence the development of such modules could take from many months to years in each case. 8.3 Conclusion 8.3.1 Overall Summary This thesis has reported the results of a research investigation aimed at the design and implementation of a tool for enhancing and migrating heterogeneous legacy databases.


Conclusion

Page 131

In the first two chapters we introduced our research and its aims and objectives. Then in chapter 3, we presented some preliminary database concepts and standards relevant to our work. In chapters 4 and 5, we introduced wider aspects of our problem and studied alternative ways proposed to solve major parts of this problem. Many important points emerged from this study. These include: application of meta-translation techniques to deal with legacy database system heterogeneity; application of migration techniques to specific components of a database application (i.e. the database service) as opposed to an IS as a whole; extending the application of database migration beyond the traditional COBOL oriented and IBM database products; application of a migration approach to distributed database systems; enhancing previous re-engineering approaches to incorporate modern object-oriented concepts and multi-database capabilities; introducing semantic integrity constraints into legacy database systems and hence exploring them beyond their structural semantics. In chapter 5, we described our re-engineering approach and explained how we accomplished our goals in enhancing and preparing legacy databases for migration, while chapter 6 was concerned with testing our ideas using carefully designed test databases. Also in chapter 6, we provided illustrative examples of our system working. In chapter 7, we described the overall architecture and operation of the system together with related implementation considerations. Here we also gave some examples of our system interfaces. In chapter 8, we carried out a detailed evaluation which included research achievements, limitations and suggestions for possible future extensions. We also looked at some real-life areas of application in which our prototype system has been tested and/or could be used. Finally, some major conclusions that can be drawn from this research are presented, below. 8.3.2 Conclusions The important conclusions that can be drawn from the work described in this thesis are as follows:

• Although many approaches have been proposed for mapping relational schemas to a form where their semantics can be more easily understood by users, they either lack the application of modern modelling concepts or have been applied to logically centralised or decentralised database schemas, not physically heterogeneous databases.

• Previous proposed approaches for mapping relational schemas to conceptual models involve user interactions and pre-requisites. This is confusing for first time users of a system as they don’t have any prior direct experience or knowledge about the underlying schema. We produce an initial conceptual model automatically, prior to any user interaction, to overcome this problem. Our user interaction commences only after the production of the initial conceptual model. This gives users the opportunity to gain some vital basic understanding of a system prior to any serious interaction with it.

• Most previous reverse-engineering tools have ignored an important source of database semantics, namely semantic integrity constraints such as foreign key definitions. One obvious reason for this is that many existing database systems do not support the representation of such semantics. We have identified and showed the important contribution that semantic integrity constraints can make by presenting them and applying them to the conceptual and physical models. We have also successfully incorporated them into legacy database systems which do not directly support such semantics.


Conclusion

Page 132

• The problem of legacy IS migration has not been studied for multi-database systems in general. This appears to present many difficulties to users. We have tested and demonstrated the use of our tools with a wide range of relational and extended relational database systems.

• The problem of legacy IS migration has not been studied for more recent and modern systems; as a result, ways of eliminating the need for migration have not yet been addressed. Our approach of enhancing legacy ISs irrespective of their DBMS type will assist in redesigning modern database applications and hence overcome the need to migrate such applications in many cases.

• Our evaluation has concluded that most of the goals and objectives of our system, presented in sections 1.2 and 2.4 have been successfully met or exceeded.

assisting migration and evolution of relational legacy databases

Education

current database technology

hardware technology

heterogeneous database

database concepts

legacy database service

primitive database systems

older technology

metaprogramming technology