data transformation interactive design

Data Transformation Interactive Design

Jirı Slefr

December 10, 2001

Abstract

The popularity of data warehouses, providing many possibilities how to dis-play data, is growing up rapidly. This phenomenon is forcing different soft-ware companies to make an investment into development of tools providingeasy manipulation with data stored in different types of standard data stor-ing places as databases or data warehouses. In the Data point of view we cantreat both of these and many others as equal and call them data sources.There exists some standards in this part of development especially in case ofdata format. Unfortunately these standards can’t cover all users wishes andneeds and so in many cases the user is forced to use data which are not insuitable form or to write a program transforming non suitable data formatinto a suitable one. Such data manipulation is called a Transformation.

The goal of this work was to design and implement a tool with a standardGUI, providing interactive manipulation with data sources using transforma-tions. The output of this tool is a special template written as a METADATA.This template is posted to the SumatraTT product which process commandsinside the template and do the physical data Transformation.

The final tool is implemented in the JAVA(TM) programming language inorder to be platform independent, and allows user to add new transformationsand data sources to be supported. The tool also follows the latest trends inthe information storing techniques and so it uses the XML technology fortransformation and data sources METADATA storing. Because the mainfunctionality of the discussed tool is to allow user to “design”, it can betreated as an IDE.

Contents

1 Introduction 3

2 Business Intelligence, Data Warehousing 62.1 Growth of Data Warehousing . . . . . . . . . . . . . . . . . . 62.2 Requirements to Data Warehouses . . . . . . . . . . . . . . . 72.3 Data Warehouse System Architecture . . . . . . . . . . . . . . 82.4 Multidimensional model . . . . . . . . . . . . . . . . . . . . . 112.5 Aggregations . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Design of a Data Warehouse . . . . . . . . . . . . . . . . . . . 12

3 Metadata 153.1 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2 Historical approaches . . . . . . . . . . . . . . . . . . . . . . . 17

3.2.1 Data Administration: Data Dictionaries . . . . . . . . 173.2.2 Problems of Distributed Systems . . . . . . . . . . . . 183.2.3 Early Metadata . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Using Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 193.4 Metadata Requirements . . . . . . . . . . . . . . . . . . . . . 213.5 Metadata Components . . . . . . . . . . . . . . . . . . . . . . 233.6 Four Layer Meta-data Architectures . . . . . . . . . . . . . . . 253.7 Metadata management . . . . . . . . . . . . . . . . . . . . . . 27

3.7.1 Metadata and the growth of data warehousing . . . . . 273.7.2 Enterprise view of Metadata Management . . . . . . . 27

3.8 Tools for Metadata . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Metadata Repository 304.1 Using a Repository as a Metadata Integration Platform . . . . 314.2 Repository Lifecycle . . . . . . . . . . . . . . . . . . . . . . . 314.3 Database Management System . . . . . . . . . . . . . . . . . . 334.4 Fully Extensible Meta Model . . . . . . . . . . . . . . . . . . . 334.5 Central Point of Metadata Control . . . . . . . . . . . . . . . 34

1

4.6 Impact Analysis Capability . . . . . . . . . . . . . . . . . . . 344.7 Naming Standards Flexibility . . . . . . . . . . . . . . . . . . 344.8 Versioning Capabilities . . . . . . . . . . . . . . . . . . . . . . 354.9 Query and Reporting . . . . . . . . . . . . . . . . . . . . . . . 354.10 Data Warehousing Support . . . . . . . . . . . . . . . . . . . . 35

5 SumatraTT project 365.1 SumatraTT Overview . . . . . . . . . . . . . . . . . . . . . . . 375.2 SumatraTT GUI . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2.1 The Java(TM) programming language . . . . . . . . . 395.2.2 IDE’s objects interactivity model . . . . . . . . . . . . 395.2.3 IDE’s UI model . . . . . . . . . . . . . . . . . . . . . . 465.2.4 IDE’s Use Case/Data Flow model . . . . . . . . . . . . 495.2.5 IDE’s Extensibility model . . . . . . . . . . . . . . . . 525.2.6 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . 545.2.7 Transformation extension API . . . . . . . . . . . . . . 555.2.8 Graphical Transformation . . . . . . . . . . . . . . . . 615.2.9 Wizard extension API . . . . . . . . . . . . . . . . . . 655.2.10 IDE configurability . . . . . . . . . . . . . . . . . . . . 685.2.11 DataSource editor . . . . . . . . . . . . . . . . . . . . . 71

6 Summary 73

A Methods of the MainClass in the Transformation API 74

B Methods of the MainClass in the Wizard API 77

2

Chapter 1

Introduction

Data is becoming an abundant commodity because it can be found anywhereand everywhere. However data as itself is losing value, because there is somuch of it. Not too long ago, the situation was exactly the opposite: raw in-formation was extremely difficult to acquire and so highly prized. Nowadays,it is possible to get high quality raw data quickly and cheaply.

The availability of data and the enormous sizes of the quantities availablemakes it difficult to work with. When data was rare, the amount availablewas consumable by users with very primitive tools. Now that plenty data isavailable, we have a problem of separating the valuable facts from the rest.The problem is critical because the ability to use data to get answers tobusiness questions is the key to making money.

We know that in order to make good business decisions we need gooddata. The process which has been accepted as a good business practice canbe described as follows:

• Acquiring quality raw data.

• Combining and integrating the data to get useful information.

• Analyzing the information to make high quality decisions.

The big issue is knowing which data use to create useful information. Thetorrent of raw data has added more complexity to the process. What weneed is to put the data into context, give the data meaning, relevance andpurpose, and make it complete and accurate. Data which is viewed in thislight is called information, because it is usable for deductive and inductiveinsights which can lead us to quality decisions.

Data is flowing into the company from suppliers and customers. Theinternal system of the corporation is adding their share. Corporations are

3

coupling together in webs of suppliers, partners and customers, exchanging amyriad of information through a spectrum of technologies as Electronic Doc-ument Interchange (EDI), Electronic Funds Transfer (EFT) systems, emailetc.

At the same time, existing systems in the enterprises continue to gen-erate data on orders, sales, revenues, employee information, inventory andevery other parameter imaginable. As computers become more and moreaffordable, as storage costs continue to fall down, as sophistication increasesin the use of information technology, the growth of generated data becomesexponential.

What do we know about the data being generated by these systems?First, we know that it is by and large dispersed across the enterprise. Eachdepartment, division, group, section, subsection or any other subdivision istoday capable of generating its own unique sets of data. The managementinterest in achieving operating efficiencies by centralizing, decentralizing andreengineering has created the opportunity for the data in one group to have adifferent meaning in another group in the same organization. This disparatedata is exacerbated by readily available CASE tools, rapid application de-velopment tools, application and code generators, underutilized data modelsand definitions, database products, spreadsheets and other client friendlyproducts and a lack of leadership in management.

We know that along with the dispersion there exists a view that the datagenerated by each group belongs only to itself and is intended only for itsown uses.

Finally, we know that because of the observations mentioned above, thepotential for integrating these disparate data elements across various depart-ments is poor without some significant work.

Recently, management has begun to recognize the value of using data asa corporate asset. The idea of using all of the organization’s data to get acomplete picture of the enterprise is today’s ideal. Management is recogniz-ing the need to view data from multiple departments to get some kind ofcombined view of operations. The concept of a data warehouse has appearedas a technology by which management can get a single comprehensive viewof the state of the organization. Data is extracted at regular intervals fromexisting systems and paced in the warehouse, summarized to allow manage-ment to look at trends, but also available in detail for drill down data accessand analysis.

Management has realized that data is resource but only if all importantattributes are known and understood. Data must be set in context, havemeaning to its users, be relevant and have purpose. It is not enough to knowthat in the inventory levels have ranged between two values over time. One

4

has also to know what is the definition of inventory levels etc. Data must alsobe complete and accurate. If there are multiple sources for a particular dataelement, which is used in data warehouse and why? What are the businessrules which impact how we view data? If we have calculated some dataelements such as profitability, what questions have been used to get theseresults? Only when these are known,understood and applied can data befully utilized and only then can we begin the reliable building of informationfrom the data which ultimately leads to quality decision making.

The need to understand the data leads to need for managing the data.This need is particularly acute in systems such as data warehouses whoseprimary purpose is to provide answers and supply a fertile ground for explo-ration and insight. When we are talking about managing data, we are talkingabout a store of attributes about data, or data about data. This concept istermed “metadata” from the Greek “meta” which means a large stage, tran-scending or situated behind. So we are talking about data that sits behindthe operational data, and that describes its origin, meaning, derivation etc.

A data resource becomes useless without readily available high qualitymetadata. It is primary objective to provide a comprehensive guide to thedata resource.

5

Chapter 2

Business Intelligence, DataWarehousing

There are lots of books, reports and master thesis concerning Data Ware-housing and Business Intelligence. In this chapter will be described basicparts and properties of Data Warehouses using the high level view.

At first is necessary to define what the data warehouse is: The datawarehouse is a decision support knowledge base, oriented to a con-crete problem or section, integrated from different data sources,periodically actualized at the same time. The data warehouse oftencontains a huge amount of data and its structure is using multidimensionalmodel mentioned in the section 2.4 on page 11.

2.1 Growth of Data Warehousing

There is no doubt that data warehousing has grown as a technology and thatit has firmly established itself as a mainstream tool in competitive businessestoday. However, along with the success of the concept came growth in anumber of other areas.

• Growth across platforms. Initially, the global data warehouses werethe domain of a few platforms and a few vendors. As their popularitygrew, data warehousing technologies such as parallel computing, par-allel databases, and OLAP/ROLAP were extended into smaller andmore pervasive platforms, so that now the range of platforms whichclaim data warehousing capabilities spans from Microsoft NT, throughthe UNIX domain, and on up into the giants such as NCR, DEC, andIBM.

6

• Growth across tools. As vendors sprang from all directions to jumpon, each touting special features and functions. From legacy dataextraction tools to maintenance and scheduling tools, and tools thatpurported to handle metadata, vendors and products proliferated.

• Growth across departments. Success also breeds demand, which isexactly what happens when department A gets a new data set andstarts showing colleagues in department B how easily they can nowaccess consistent data. Soon, department B has its own data set, anddepartments C, D, and E are not far behind.

2.2 Requirements to Data Warehouses

In the [22] were specified basic requirements to a data warehouse:

• The data warehouse provides an access to enterprise data.

Managers and analyst of the enterprise has to be able to connect to datawarehouse from theirs computers. This connection must be immediate,on demand and with high performance. The high performance accessmeans, that the tiniest queries run in less than one second. Access alsomeans that tools available to managers and analyst are “easy to use”which means an useful report can be run with one button click and thereport can be changed with two clicks.

• The data in a data warehouse is consistent.

Consistency means that if two peoples has the same request to thedata warehouse, they receive the same information even if they re-quests at different times. The consistency also means, that if thosepeoples requests a definition of any data element, they will receive anuseful answer that lets them know what they are fetching from the datawarehouse. Consistency also means that if yesterday’s data were notloaded properly, the analyst will be warned that the data load was notcomplete.

• The data in a data warehouse can be separated and combined by meansof every possible measure in the business.

The slicing and dicing requirement speaks directly to the dimensionalapproach. The more operational definition of slicing and dicing is rowheaders and constraints. Row headers and constrains will turn out tobe fundamental building blocks of every data warehouse applicationand they will come directly from the dimensions in our data model.

7

• The data warehouse is not just data, but also a set of tools to query,analyze and present information.

The central data warehouse hardware, the relational database softwareand the data itself, are only about 60 percent of what is need for asuccessful data warehouse. The remaining 40 percent is a set of frontend tools that query, analyze and present the data.

• The data warehouse is a place where we publish used data.

The responsibility to publish is at the very core of the data ware-house. Data are not simply accumulated at a central point and letloose. Rather, data is carefully assembled from a variety of informa-tion sources around the enterprise, cleaned up, quality assured andthen released only if it is suitable for use. If the data is unreliable orincomplete, the responsible data quality manager does not allow it tobe published to the user community. The data quality manager is re-sponsible for the content and quality of the publication and is identifiedwith the deliverable.

• The quality of a data in a data warehouse is a driver of business reengi-neering.

The best data in any company is the record of how much money some-one else owes the company. Data quality goes downhill after that.Frequently a data element would be very interesting if it were of highquality, but it either is not collected at all or it is optional.

The data warehouse can’t fix the poor quality of the data. The onlyway how to fix the data quality is for both affected, data entry andmanagement, to return to the source of the data with better systems,better management and better visibility of the value of good data.

2.3 Data Warehouse System Architecture

The architecture of the data warehouse system is on the figure 2.1 on the nextpage. There are some special levels in the architecture and it is necessary todescribe them:

OLTP — it means On-Line Transaction Processing. This part of the systemis oriented to transactions. A serious OLTP system processes thousandsor even millions of transactions per day. Each transaction contains onesmall piece of data. The point of transaction processing is to processa very large number of tiny, atomic transactions without loosing any

8

OLTPOLTP

Presentation Level

Application Level

Data Warehouse

Data Pump

OLTP

OLAP

QueryOLAP

Result

Figure 2.1: Data Warehouse System Architecture

9

of them. The essence of a transaction is that both, the sender andthe receiver, agree at all times as to whether the transaction has takenplace. The consistency of OLTP is very important and it is microscopic.All that we care about in OLTP consistency is whether we agree thatall the transactions presented to the system have been accounted for.

Data Pump — is a special part of the system, which loads data from avail-able all OLTP sources to a data warehouse structure. During the load-ing process, data can be transformed or modified.

Data Warehouse — this level is the core of the data warehouse system.This is a storage place, where all data loaded, transformed and mod-ified from the OLTP sources are stored. All data here are read only.A serious data warehouse often process only one transaction per day,but this transaction contains thousands or even millions of records.The transaction is called “production data load”. The consistency ofthe data warehouse is also very important but in the opposite of theOLTP, the consistency here is measured globally. In general, we don’tcare about an individual transaction but we care enormously, that thecurrent load of new data is full and consistent set of data. Instead ofmicroscopic perspective, we have a quality assurance perspective. In-stead of technical calculations of data consistency, we have a manager’sjudgement of data consistency.

OLAP — means On-Line Analytical Processing. This part is very impor-tant and complex, because it is allowing to an analyst or a managerto query the data warehouse for analytical useful information allowingto produce intelligent business decisions. It is working directly withthe data warehouse storage place and as already mentioned it can onlyread data not write in order to keep the global data warehouse systemconsistency. There are some technologies of the OLAP:

MOLAP — this technology is used for data warehouses using multi-dimensional structure (hyper-cube).

ROLAP — this technology is used for data warehouses using rela-tional structure (relational databases).

HOLAP — this technology is used for data warehouses using hybridstructure (combination).

Application and Presentation level — we can concatenate these twolevels together, because in fact an user of a data warehouse system

10

Figure 2.2: Hyper cube

see them as one piece. This level is responsible for communication be-tween the system and the user. It receives requests and queries andthen presets results to the user. It is also responsible for warning orinforming the user about states of the whole system.

2.4 Multidimensional model

The core of the multidimensional model of a data warehouse is a specialarchitecture and organization of the place, where data are stored. Becauseof this special architecture, the place is called Hyper cube. The schema ofa hyper cube is on the figure 2.2.

The cube on the figure has three dimensions but it is not a rule. In factthere are almost no limits in number of dimensions of the cube. On thefigure is described that the whole hyper cube can be divided into smallerhyper cubes. Each small cube or a group of cubes has its own concretemeaning. Operations which are dividing the hyper cube to smaller logicalparts are called “slicing” and “dicing”. These operations theirs results aredescribed on the figure 2.3 on the next page.

11

Figure 2.3: Sliced/Diced hyper cube

2.5 Aggregations

Aggregations are combinations of existing data, in most cases the combina-tion means summarization. These aggregations are stored in a data ware-houses in aggregation tables. We distinguish these basic types of aggrega-tions:

• Pre-summarized results of queries on higher aggregation levels.

• Detail data, e.g. sales on the level product, place, day.

• Summarized tables, e.g. sales on the level group of product, year.

Examples of aggregations are on the figure 2.4 on the following page. Theywere published in [23].

2.6 Design of a Data Warehouse

There exists a procedure describing how to create a data warehouse. Thisprocedure contains problem analysis, data structure design, user interfacedesign etc. The procedure can be described as following steps:

12

Figure 2.4: Aggregations examples

• Definition and knowledge analysis of the problem.

This part can be marked as a critical one, because without a correctproblem and knowledge analysis all products of other steps are value-less.

• Data structures design.

This step can be critical for the system performance and for futuremodifications.

• Platform choosing.

This concerns of storage platforms and presentation platforms. Nowa-days there are many different type of platforms on the market.

• Data transformations.

The loading process is critical for the user, because without correct dataloaded into a data warehouse one can’t receive a valuable information.

• Data warehouse filling.

The data warehouse has to be filled with data loaded from data sources.

• User interface design.

13

As already mentioned, the data warehouse and its tools has to havean user friendly effective and easy to use user interface, otherwise thevalue of the whole system goes down quickly.

• Data security.

This is not so important for the system itself but critical for most ofenterprises. Each user has its own role in the system and the role shouldto define an access to specific data.

• Testing.

The system of course has to be tested. The testing should run intwo phases — the first phase is testing during the implementation, thesecond is testing during deployment. There can raise many of commentsduring the deployment phase. These comments has also to be solved.

14

Chapter 3

Metadata

In this chapter will be described what metadata is. How can metadata bededucted using an Abstraction is discussed in the section 3.1. Why was cre-ated a specification for metadata and how was used in the past is mentionedin the section 3.2 on page 17. The using of metadata in data warehouses isdescribed in the section 3.3 on page 19. What are requirements and compo-nents of metadata was described in [15] and we are talking about it in thesection 3.4 on page 21 and in the next section 3.5 on page 23. The Four LayerMeta-data Architecture was introduced in [29] and it is described in the sec-tion 3.6 on page 25. The need and process of metadata managing is in thesection 3.7 on page 27 where was used [19] where the problem of metadatamanagement is described in detail. Tools used for metadata manipulationare mentioned in section 3.8 on page 28.

3.1 Abstraction

The founder of the General Semantics discipline, Alfred Korzybski introducedthe idea of a “chain of abstraction”. Example: an actual chair is a totalyconcrete item. Any chair has a lots of characteristics: it is made of wood, ithas a specific shape, it has two arms etc. We can abstract from that chair,and others we encounter, to the word “chair”. That is one step in the chainof abstraction, but of course we can take further steps. For example, we candescribe all other words we know, containing the word “chair”, by the word“word”. There is no limit: “word”, “ideal”, “hatred” etc. are all instancesof “abstraction”, and so on.

Data and Metadata Software is abstraction piled on abstraction. Meta-data is an abstraction from data. It is high-level data that describes lower-level data. Software is full of metadata for example:

15

• Record descriptions in a program.

• Logical view descriptions in a data server’s catalog.

• SQL statement.

• Entity-relationship diagrams in a CASE tool’s repository.

Metadata is a medium in transforming raw data into knowledge. Forexample, it is metadata in the shape of a field definition that tells us that agiven stream of bits is a customer’s address, part of a photographic image, ora code fragment in a given computer’s machine language. If the metadata willget mixed up or out of alignment, data will make sense no more. For examplethe address will contain nonsense characters, the code will give unpredictableresults and probably crash any process that runs it,the photograph will notlook quite the way it should.

Suppose that a large corporation is building a data warehouse, and load-ing it with data and metadata from its databases. How likely is it that all thedifferent sets of data and metadata can agree? In one such case, it turned outthat there were lots of different ways of encoding the unique card-ID numberallocated to each employee. If the corporation has an office in French, the forexample use a 32-bit integer for the purpose, in the Australian office, theyuse binary coded decimal, what are the chances that applications writtento use these two representations could interpret each other’s data? There isstill more scope for confusion and misunderstanding in the interpretation ofexternal data. Of course, information from one source may totaly contradictanother.

Without metadata, the data is meaningless. We do not even know whereit is, or how much of it to take. How could we query a database whenwe take away data definitions? All we could hope to recover would be astring of ones and zeroes. In the early days of commercial computing, eachapplication created and handled its own data files. Because the metadatawas embedded in the application’s data definitions, no application could makesense of another application’s files. When databases were introduced, one ofthe greatest advantages they brought was that the metadata was stored inthe database catalog, not in the individual programs. There was only oneversion of the “truth”, and that was stored in the database.

When database manuals talk about “structured data”, they are implyingthe presence of metadata. Relational databases can store information forwhich they have no corresponding predefined dataTypes, but it is not apretty sight. “Binary large objects” (BLOBs) are a case in point. Theseare used to store large complex data items, such as maps, medical scans,

16

photographs etc. Because relational databases lack the metadata to describethese items, the relational vendors call them “unstructured” data. Object-oriented databases can store these items and many others in a structuredformat, even including the code routines needed to make sense of them.

3.2 Historical approaches

When the Information Technology has born, one of its main goals was to ef-fectively manage the data assets of an organization. As systems have becomemore different, distributed, and complex, the management of the data assetshas become increasingly difficult but nevertheless critical to the corporateentity.

3.2.1 Data Administration: Data Dictionaries

In early days of Information Technology, all data was defined and main-tained inside of the computer program itself. These were complete packagescontaining the logic and purpose (64K Assembler programs). There was noneed to share data and information between systems and programs, becauseit was not possible without great effort to transfer data between differentphysical computer systems and when the need, that a given job should bedone by multiple computer programs working together sequentially, the stateof system development simply did not allow it.

At the end of 1960’s and beginning of 1970’s, the technology improvedto level where was possible to run multiple programs sequentially using agiven data set in order to solve a business problems. For example, a batchrun could use a data set such as a collection of checking account transactionsand calculate a new balance. This required some coordination between theset of programs how to use the system which was holding the transactionsand then how to access to information and data inside. In this technologyevolution, where each computer program was its ow little environment, wasnot possible for those solutions to stay on top and so first versions of datacoordinators were developed which were simple data “dictionaries”. Thesewere shared between computer programs looking for data to use in their logicprocesses. Computer programs were in this way able to get data definitiontogether with its location from a common data dictionary. These dictionarieswere most likely managed by an IS organization that was centralized andstrictly controlled.

As developers became more sophisticated over time, data dictionariesevolved to provide more than just data attribute descriptions. They were

17

also able to track which applications accessed which pieces of the database.This means that managers who took advantage of the capabilities of the datadictionary and did a good job of designing and populating the data dictionaryfound themselves in the enviable position of being able to maintain theirsystems more easily than their counter-parts who did not. Such a centrallydesigned and maintained system, which holds the data definitions as well asCASE information about which applications use which pieces of the databaseis sometimes called a repository.

3.2.2 Problems of Distributed Systems

As demand for more IS technology blossomed and the technology advancedwith lower cost mid-range and distributed systems, these systems foundthemselves as islands of automation ensconced in individual departmentsand working on specific business problems which only related to the specificdepartment. Data was defined in a decentralized manner, by the businessunit, with no central arbiter, if it was defined at all. Worse yet, as new anddifferent CASE tools came on the market, and as new and different architec-tures came into vogue, such as object oriented databases and client/ serverarchitectures, different tools were used to define the data for different appli-cations. In some cases, the same element was defined multiple times withslight variations, as new systems were used to create applications to helpusers solve various business problems.

Exchanging data between systems became risky, highly structured, andinfrequent. Importing data from external systems and environments becamea labor intensive endeavor and was avoided if possible. However, in manycases it became necessary. For example, a call management system for cus-tomer support needs to get all of the customer information from the legacydatabases. In order to do this successfully, elaborate programs needed tobe written to ”scrub” the data clean. With inadequate metadata, data in-compatibilities could cause programs to fail, keeping programmers up nightsdebugging operational systems, as well as taking down the call managementsystem, for example.

3.2.3 Early Metadata

As in case of any other warehouse which needs to keep inventory of its hold-ings, early implementors of data warehouses found, that they need to keeptrack of that what data the data warehouse was currently holding togetherwith its history. In order to do this, was created an idea of a metadata

18

repository similar to data dictionaries. The idea was created in order to pro-vide to users and technicians information about data as is information whereare data from, which rules and processes were used to get data, what eachelement signifies, how old the data is etc..

In early implementation and even up to the recent past, may systemsdivided data to two different groups: business in other words end-user-relatedinformation and the second group is technic in other words development-directory. These two data groups were separated not only in theirs datadefinition but also physically in order to not confuse the user, because thetechnic data are mostly unusable for the user and so his task to choose usableimportant data could become more difficult.

Nowadays is in most cases accepted, that a metadata component has tobe designed in a way, where each person understand which data are in thedata warehouse and what they means.

3.3 Using Metadata

Lets talk now about a small example when a business analyst seeking datathrough a data warehouse and finding three sets of data:

1. 739516 13238350 426615 800441 4912313

2. ”An agency report dated 7/9/99 states that the export from the Europegrowth by 33 percent in 98.”

3. ”Leading gadget vendors: Peters Group 48 percent, Starford Goods 34percent, Zimurgy Inc. 18 percent”

What can the analyst say about these data?In the first case we can fairly say that “absolutely nothing”. Those num-

bers can be revenues of a company, it can be a population in a country, anumber of hairs on individuals head, even it can be a sequence of machinecode for some computer. There are no limits what the number can signify.The only ways to assign any meanings to it are:

• From the context — if this data are a result of a concrete query from aconcrete database and concrete table, we usually know what the datameans.

• From metadata — if we are able to associate metadata significationwith data, we can get a name of a table where are the data from andmaybe more different information.

19

In the second case it is a plain text which is more or less self describing,but still there is at least one small problem - the date. We don’t know whereis the data from and so if the date is written using a British convention,which means 7th September or using American convention which means 9thJuly.

The third case contains already some metadata but not enough. Thespecified gadget has a precise meaning for somebody who is working in agadget industry, but we don’t know if we are talking about American, Euro-pean or World trade market, what time period was used to collect the dataand even how the data was collected. Under the circumstances, the exampleshould be filled in the data warehouse with a metadata notice that says -source unknown, reliability unknown, no further details.

In the operational world the metadata are valuable rather for developersand data administrators. To operational databases are accessing only trans-action processing applications, which contains the data definition embeddedinside. End users, people like a bank clerks, travel agents, hospital staff etc.does not need to know the way how data are stored and held in the database.They are just simply interacting with forms, dialogs and other screens whichare provided by the application they use in theirs everydays work.

The decision support environment is very different. Here data analystsand executives are trying to find useful facts and correlation, they are able torecognize, when they find them and often not before. Routine applicationsare of no use to them: they need to get in among the data, and to do sosuccessfully, they need to understand its structure and meaning. Passengerson a train do not need a map, although a timetable may prove useful. Adriver setting out by road across a foreign country to an unknown village,though, would be very unwise to set out without a complete set of large-scaleand small-scale maps.

Generally speaking, there are three main layers of metadata in a datawarehouse:

Application-level metadata. This defines the structure of the data held inoperational databases and used by operational applications. It is oftenquite complex, and tends to be application or department oriented.

Core warehouse metadata. This is held in the catalogue of the ware-house’s database system. It is distinguished by being ”subject oriented”- in other words, it is based on abstractions of real world entities like’customer’, ’project’ or ’organization’. It defines the way in which thetransformed data is to be interpreted, as well as any additional viewsthat may have been created. This includes decision support aggregates

20

and computed fields, as well as cross-subject warehouse views and def-initions.

User-level metadata. Maps the core warehouse metadata to business con-cepts that are familiar and useful to end-users.

3.4 Metadata Requirements

Unreliable or missing metadata are leading to familiar situations when onedepartment reports to the management that the corporation’s revenues hasgrowth by 10 percent meanwhile another department of the same corporationreports to the management that the corporation’s revenues are down by 15percent. Both departments are using theirs figures, collected using theirs ownmethods and represented using theirs own applications. The discrepanciesproduced in this way is sometimes called as ’algorithmic differential’. Whenhard facts are missing, the decision making falls down to morass of politicsand personalities.

In the data warehouse not only data reflects changes but also metadatadoes it. The business users are continually looking for interesting new pat-terns and it can lead them to compare information from all corners of thewhole enterprise. It sometimes also lead them to keep asking for new setsof data, which have to be replicated from operational systems or importedfrom outside sources. The warehouse’s metadata map has to be extended toembrace each new addition. Besides, many business entities, such as productlines, organizational structures, markets, and plans, change regularly. Ex-ample of a business analyst asked to prepare a report for the management.The analyst works hard and delivers the report, complete with impressivegraphics, within 24 hours. The management congratulates him, and asks forthe same information on the period five years ago. After a lot more work,this too is forthcoming, but the management is displeased. After a briefglance at the two reports, he complains that the IS department can neverget anything right. These figures are obviously wrong. How can this be?Factors to evaluate are:

• Different sources of data.

• Different sales territories (perhaps with the same names).

• A different definition of the product line.

• Tax has been added in, or removed, from the figures.

21

• Mergers leading to sudden jumps in revenue, profit, and many othermetrics.

Once these factors have been brought to his notice, a reasonable manage-ment would admit that the analyst had done another good job. He wouldnever have thought otherwise, had the various changes in the basis of mea-surement been noted on the report.

There are at least two contrasting kinds of metadata:

• Classical. This consists of formal definitions, such as a COBOL layoutor a database schema.

• “Mushy” or “Business”. This consists of information in the enterprisethat is not in classical form - for example the organization chart orhistorical pricing information. (Most businesses do not formally retainpricing information from previous years). To look at it another way,there is no meta2data (short for “metametadata”) for it — no formallyprescribed process for recording the metadata.

A new analyst approaching a data warehouse wants to know:

1. What tables, attributes and keys are stored in the data warehouse?

2. Where are the data in the data warehouse from?

3. What kind of transformation logic was used to load the data into thedata warehouse?

4. How was the data changing over the time?

5. What aliases are used inside the data warehouse and what is theirsrelationship?

6. What cross-references exists between business and technical data?

7. How often does the data get reloaded?

8. How much data is there - this helps to the end user to avoid submittingunrealistic queries. Given some means of determining the size of tables,staff can tell the end-users “You can do anything you like with 15,000rows, but if it turns out to be 15 million rows — back off and ask forhelp!”

22

3.5 Metadata Components

Warehouse metadata is not very different from ordinary database metadata,although it is versioned because of providing historical analysis. Following isa breakdown of warehouse metadata:

MAPPING — The mapping information records how the data is trans-formed from operational sources to the data warehouse. The typicalcontent is:

• Identification of source fields.

• Simple attribute-to-attribute mapping.

• Attribute conversions.

• Physical characteristic conversions.

• Encoding/reference table conversions.

• Naming changes.

• Key changes.

• Logic to choose from between multiple sources.

• Algorithmic changes.

EXTRACT HISTORY — Whenever a historical information is analyzed,precise update records has to be kept. The metadata history is a goodplace where to start whenever is necessary to create a time-based re-port, because an analyst has to know, when rules has changed, in orderto apply correct rules to relevant data. For example when if salesplaces were remapped in 1998, it is necessary to notice that resultsbefore from this date may not be directly comparable with other morerecent results.

SUMMARIZATION ALGORITHMS — Typical data warehouse con-tains lightly and heavily summarized data as well as full detailed data.Summarization algorithms are of course in interest to anyone who isresponsible for interpretation of meaning of summaries. These meta-data also can save a time by making it easier to decide which level ofsummarization is most suitable for a given purpose.

OWNERSHIP/STEWARDSHIP — Operational databases are usuallyowned by a single department or a business group. In a nature ofcorporation data warehouse is that all data are stored in a commonformat and accessible to all. That is why it is necessary to identify an

23

originator of each set of data in order to allow to make changes andcorrections for a concrete group. It is necessary to distinguish between“ownership” of data in operational environment and “stewardship” indata warehouse environment.

ACCESS PATTERNS — It is desirable to record patterns of access tothe data warehouse, in order to optimize and tune performance. Lessfrequently used data can be moved to cheaper storage place, whiledifferent methods can be used to accelerate access to the data that ismost requested. Many databases do a good job of hiding such physicaldetails, but specialized performance analysis tools are usually available.

MISCELLANEOUS • Aliases can make the data warehouse more user-friendly, by allowing a table to be queried by “Widgets producedby each factory” rather than “MF-STATS”. They also becomesuseful when different departments wants to use their own namesto refer to the same underlying data. Of course when aliases arenot chosen carefully , they can cause also a big confusion.

• Sometimes happen, that different data warehouse’s parts are indifferent stages of development. In order to keep informationabout the state of a concrete data warehouse part can be usedstatus flags. For example tables can be marked as “in design”, “intest”, “in active”, “active” etc.

• Volumetric information allows to user to know how much of datais dealing with. From this information can be deduced whatprize will be for a concrete query in term of time and resources.Volumetrics could include useful information as number of rows,growth rate, indexing etc.

• It is also useful to publish the criteria and timescales for purgingold data.

REFERENCE TABLES/ENCODED DATA — Reference data is spe-cial data that is stored in an external table and contains commonly usedtranslations of values. The contents of these tables must be stored inorder to guarantee ability to recover the original unencoded data, to-gether with “effective from” and “effective to” dates.

ELEMENT AND HISTORY RELATIONSHIP — Data warehousesimplement relationships in a different way from production databases.Metadata pertaining to related tables, constraints, and cardinality willbe maintained, together with text descriptions and ownership records.

24

This information and the history of changes to it can be useful to an-alysts.

DATA MODEL — DESIGN REFERENCE — Building a data ware-house without first constructing a data model is very difficult. Whena data model is used, metadata describing the mapping between thedata model and the physical design should be stored. This allows anambiguities or uncertainties to be resolved.

3.6 Four Layer Meta-data Architectures

The traditional framework for meta-modeling is based on an architecturewith four layers. These layers are conventionally described as follows:

• The user object layer is comprised of the information that we wish todescribe. This information is typically referred to as “data.”

• The model layer is comprised of the meta-data that describes informa-tion. Meta-data is informally aggregated as models.

• The meta-model layer is comprised of the descriptions (i.e., meta-meta-data) that define the structure and semantics of meta-data. Meta-meta-data is informally aggregated as meta-models. A meta-model can alsobe thought of as a ”language” for describing different kinds of data.

• The meta-meta-model layer is comprised of the description of the struc-ture and semantics of meta-meta-data. In other words, it is the ”lan-guage” for defining different kinds of meta-data. The traditional frame-work is illustrated in Figure 3.1 on the next page.

This particular example shows how the meta-data for simple records (i.e.,“StockQuote” instances) might be represented. The layers are populated asfollows:

• The information layer includes some illustrative StockQuote instances.

• The model level includes the meta-data that represents the record typefor “StockQuote” instances. The record type has a name (“Stock-Quote”) and two fields, each of which also has a name and a type.This type will typically be part of some larger scale data schema (notshown here).

25

meta−meta−model

MetaClass ( "Field", ... )

MetaClass ( "Record", [ MetaAttr ( "name", String),MetaAttr ( "fields", List < "Field"> ) ])

Hard−wired Meta−meta−model

Record ( "StockQuote", [ Field ( "company", String )Field ( "price", FixedPoint ) ] )

StockQuote ("Sunbeam Harvesters", 98.77)StockQuote ("Ace Taxi Cab Ltd", 12.32)

information

model

meta−model

Figure 3.1: Four Layer Meta-data Architectures

• The meta-model level defines what it means to be a record type. Themeta-Class for Record is shown as having two meta-Attributes, thefirst defining the Record’s name, and the second defining its fields.The Meta-Class for a Field (not shown in full) would similarly definethe meta-Attributes for the field name and type.

• The meta-meta-model level is typically hard-wired, and defines the ma-chinery that supports the meta-data framework’s meta-modeling con-structs; e.g. meta-Classes and meta-Attributes.

While the diagram above shows only one model and one meta-model,the primary aim of having four meta-layers is to support multiple modelsand meta-models. Just as the model that defines the ”StockQuote” typedescribes many StockQuote instances at the information level, the meta-model that defines ”Record” and ”Field” can describe many record typesat the model level. Similarly, the meta-meta-model level can describe manyother meta-models that in turn represent other kinds of meta-data. The fourlayer meta-data architecture has a number of advantages:

• Assuming that the meta-meta-model is rich enough, it can supportmost if not all kinds of meta-information imaginable.

• It potentially allows different kinds of meta-data to be related. (Thisdepends on the design of the framework’s meta-meta-model.)

• It potentially allows interchange of both meta-data (models) and meta-meta-data (meta-models). (This presupposes that the parties to theexchange are using the same meta-meta-model.)

26

3.7 Metadata management

The knowingness that metadata has to be managed has been hand in handwith the data warehousing growth. As the number and heterogeneity ofdata warehousing implementation began to grow, IT managers and end-usersbegan perceive that the data warehouse is as useful as the quality, accuracy,and ease of data inside it is.

3.7.1 Metadata and the growth of data warehousing

For all that growth mentioned in the section 2.1 on page 6, the problem ofmetadata management remained dispersed. Extraction tools, loading tools,cleansing tools and analysis tools, all claimed to have a piece of the metadataproblem solved. In fact, until recently, there has been little progress in termsof a solution to the integrated metadata issue.

However, corporations today are still calling for such a solution. Usersneed to know what they are looking at if they are to make intelligent decisionsbased on the information they have received from the data warehouse. Thesystem can’t let it up to the user to assume the business rules embedded in acalculated data element because different assumptions will lead to differentaction that may cause dangerous conflicts. Today in most cases metadata isspread across different components of a data warehouse from the schedulerto the data extraction/cleansing tools which claim to build metadata as theyare extracting and cleansing, to the loading tools, to the OLAP tools, whichneed to present metadata to users in order to navigate. Business rules areseparated from technical metadata, as they should be, but are kept by dif-ferent systems in different formats with different user interfaces. In the caseof multiple data sets spread across the enterprise, this situation is multipliedby the number of marts. And, if there is a need for a user in departmentA to use a data set created by department B, then in most cases, that userhas to relearn the metadata navigation for that system. Clearly, users wouldlike to go to one place and be able to see either the business metadata or thetechnical metadata in a single system with a single user interface and singlescreen metaphor for any and all data residing in the enterprise.

3.7.2 Enterprise view of Metadata Management

Metadata is a key resource of the warehouse during all phases of its lifecycle, from the warehouse construction, through the user access, and intothe maintenance and update of the data inside.

27

Metadata is collected and/or generated in a variety of places in the DataWarehousing architecture, from data extraction, from data manipulation andapplication specifics, and from query engines such as OLAP. Today, each ofthese areas has a number of vendors who offer products, and each vendorhas a slightly different approach or paradigm to their metadata solution.There are several possible approaches, and some vendors are propagatingthe concept of a single enterprise level metadata repository to integrate theenterprise’s disparate metadata Tower of Babel. The simplest approach is tohave all vendors utilize the same semantics, paradigm, etc. and collect all ofthe metadata in a single format in a single location.

3.8 Tools for Metadata

Many well-known software tools have long been used for generating metadatain other contexts. These include CASE tools, and to a limited extent compil-ers and 4GLs too. More recently, vendors like Prism, Carleton and ETI havedesigned products specifically to create data warehouse metadata. Special-ists have been declaring that their repositories can play a valuable role in thewarehouse environment. These repositories have ample ’power reserves’ tostore metadata from the underlying operational systems, the business rulesused for transformation, and end-user views. Repository browsers such asthe Reltech Data Shopper, IBM’s Data Guide, and IBI’s EDA/SAF helpadministrators and users to browse the repository’s structure and explore itsresources.

Turning to consumers of metadata, the first obvious category consists ofdata transfer tools that load data from operational systems into the ware-house.

There are many of these; some of the best known are Prism, Carleton,ETI, Infopump, and IBI’s Copy Manager. Different tools store metadata intheir own ways. The simplest is to keep it in the data warehouse’s own cata-log, but some transformation products keep metadata in a separate databasewhile they are working on it.

Tools such as Focus and Software AG’s Esperant provide a very usefulextra layer of metadata between the end-user and the raw database cata-logues. The underlying truth is that a data warehouse is only as good as itsfacilities for handling metadata. Every time data moves from one place toanother, it is vital that it is correctly and appropriately labelled.

Such essential activities as filtering, cleaning, summarizing and consol-idation all involve mapping from one metadata representation to another.If there is to be any hope of setting up a fast and reliable system of bulk

28

replication from the operational systems to the data warehouse, it followsthat metadata checking and mapping must be carried out automatically.

29

Chapter 4

Metadata Repository

Scenarios mentioned above in the Metadata section are the primary driversbehind the concept of a single metadata repository. A repository is the vehicleof metadata. A repository is a place where information (metadata) about anorganization’s information systems components (objects, table definitions,fields, business rules and so on) is held. A repository also contains tools forthe manipulation and query of the metadata.

A repository has a number of potential applications within an enterpriseschema that deliver value beyond that exclusively in the domain of a datawarehouse. For example a repository can:

• help in the integration of the views of different systems by helping tounderstand how the data used by those systems are related;

• support rapid change and assistance in building systems quickly byimpact analysis and provision of standardized data;

• make easier the reuse by using object concepts and central accessibility;

• assist in implementation for data warehousing (A central repositorycan be built in advance of the warehouse, purely for data and applica-tion integration purposes, and then be ready to support a warehouseimplementation. Alternatively, if the repository is built in support ofthe initial warehousing effort, it can be of enormous value in deployingsubsequent efforts.);

• support software development teams.

One of the primary benefits of a repository is that it provides consis-tency of key data structures and business rules, which makes it easier to putwarehousing efforts together across the enterprise.

30

The repository also leverages an organization’s investment in existinglegacy systems by documenting program information for future applicationdevelopment.

The problem of using metadata repository in data warehouses is describedin [19] and in [20].

4.1 Using a Repository as a Metadata Inte-

gration Platform

Ideally, a corporation should adopt a repository as a metadata integrationplatform, making metadata available across the organization. This wouldserve to manage key metadata across all of the data warehouse and data setimplementations in an organization. This would allow to all to share commondata structures, definitions of business rule, and data definitions from systemto system across the enterprise.

The platform would be able to accept and manage information from mul-tiple sources. These would include systems from major vendor technologydatabases (e.g. IBM, Informix, Oracle, Microsoft, Sybase, etc.) and tools,from extraction tools to analysis tools. On the output side, the system shouldprovide open access by multiple tools as well as API’s for custom needs.

The metadata repository also facilitates consistency and maintainabil-ity. It provides a common understanding across warehouse efforts promotingsharing and reuse. If a new data element definition is required the platformshould permit versioning to support the need. With a shared metadata repos-itory the exchange of key information between business decision managersbecomes more feasible. And, when multiple data sets and data warehousesare involved, a central metadata platform will simplify and reduce the effortrequired to maintain them.

4.2 Repository Lifecycle

Repository systems need to contribute to and integrate with the existinglegacy system environment and play an active role throughout the lifecy-cle of data warehousing systems to be truly considered enterprise metadatarepositories.

Documenting database and legacy information are important capabilitiesin metadata repositories. Legacy models provide the information sourcing,data inventorying, and design that are key to developing an effective data

31

warehouse. The metadata surrounding the acquisition, access, and distri-bution of warehouse data is the key to providing the business user with acomplete map of the data warehouse.

The repository should play an active role in the entire life cycle of the datawarehouse and all the output attributes of system and business value. Thisincludes existing legacy system as sources, third party tools, etc. This thenleverages the repository’s role so it contributes in the development phasesas well as the bulk cost of all IS systems (the downstream support andmaintenance costs). These would include systems management, databasemanagement, business intelligence, and application development tools andcomponents listed below.

• Systems management tools that can be used to manage jobs, improveperformance, and automate operations, not only in operational systemsbut also in data warehouse systems.

• Database management tools that can help create and maintain thedatabase management systems for data warehouses, data sets and op-erational systems.

• Data movement tools that transform and integrate disparate data typesand move data reliably to the warehouse. Business intelligence toolsthat provide end-user access and analysis for making business decisions.

• Business applications that provide packaged warehouse solutions forspecific markets.

• Data warehouse consulting that uses a methodology based on the ex-periences of hundreds of other companies, thereby reducing the riskassociated with making uninformed business decisions.

• Application development solutions that help you build, test, deploy,and manage operational and warehouse applications throughout theenterprise.

• CASE tools support that provide consistency and maintainability im-mediately by developing consistent terminology and structures.

• Repository-to-CASE interfaces that enable an organization to managemultiple CASE workstations from the repository. These tools are de-signed to allow an organization to better utilize the data maintainedin their CASE workstations by providing a central point of control andstorage.

32

• Sophisticated version control, collision management, and bi-directionalinterfaces, enabling the sharing and reuse of metadata among program-mers and analysts working independently.

4.3 Database Management System

A repository should ideally use a standard relational Database ManagementSystem (DBMS) which provides significant advantages over vendor-developedDBMSs. These advantages include advanced tools and utilities for databasemanagement (such as backups and performance tuning) as well as dramat-ically enhanced reporting capabilities. Furthermore, maintainablity and ac-cessibility are enhanced by an ”open” system.

Using a standard database also allows the repository vendor to focus onthe quality of the repository, not the features of the database managementsystem. In addition, it allows the vendor to take advantage of new featuresmade available by the DBMS vendor.

4.4 Fully Extensible Meta Model

A repository should be a complete self-defining, extensible repository basedon a common entity/relationship diagram. By using a model that reflectsstandards, it can provide users to easily customize the meta model to meettheir specific needs. The repository should support the following meta modelextensions:

• adding custom command macros

• adding or modifying commands or user exits

• adding or modifying help and informational messages

• modifying the list of allowable values for an attribute type

• adding or modifying an entity type

• adding user views to entities or relationships

• adding, deleting, or modifying attributes of relationships or entities

• adding or modifying a associations or relationships between entity types

33

4.5 Central Point of Metadata Control

The repository serves as a central point of control for data, providing a singleplace of record about information assets across the enterprise. It documentswhere the data is located, who created and maintains the data, what ap-plication processes it drives, what relationship it has with other data, andhow it should be translated and transformed. This provides users with theability to locate and utilize data that was previously inaccessible. Further-more, a central location for the control of metadata ensures consistency andaccuracy of information, providing users with repeatable, reliable results andorganizations with a competitive advantage.

4.6 Impact Analysis Capability

If the repository has an impact analysis facility it can provide virtually un-limited navigation of the repository definitions to provide the total impactof any change. Users easily determine where any entity is used or what itrelates to by using impact analysis views.

An impact analysis facility answers the true questions in the analysisphases without forcing a user to sift through large quantities of unfocusedinformation. Furthermore, sophisticated impact analysis capabilities allowbetter time estimates for system maintenance tasks. They also reduce theamount of rework resulting from faulty impact analysis (e.g., a program notbeing changed as a result of a change to a table that it queries).

4.7 Naming Standards Flexibility

A repository should provide a detailed map of data definitions and elements,thereby allowing to evaluate redundant definitions and elements and decidewhich ones should be eliminated, translated, or converted. By enforcingnaming standards, the repository assists in reducing data redundancies andincreasing data sharing, making the application development process moreefficient and cheaper. In addition, an easily enforceable standard encouragesorganizations to define and use consistent data definitions, thereby increasingthe reuse of standard definitions across disparate tools.

34

4.8 Versioning Capabilities

In repository discussions, ”versioning” can have many different definitions.For example some version control capabilities are:

• version control as in test vs. production (lifecycle phasing);

• versions as unique occurrences;

• versioning by department or business unit;

• version by aggregate or workstation ID.

The repository’s versioning capabilities facilitate the application lifecycledevelopment process by allowing developers to work with the same objectconcurrently. Developers should be able to modify or change objects to meettheir requirements without affecting other developers.

4.9 Query and Reporting

The repository should provide business users with a vehicle for robust queryand report generation. The end user tool should seamlessly pass queriesto its own tool or third party products for automatic query generation andexecution. Furthermore, business users should be able to create detailedreports from these tools, increasing the amount of valuable decision supportinformation they are able to receive from the repository.

4.10 Data Warehousing Support

The repository provides information about the location and nature of op-erational data which is critical in the construction of a data warehouse. Itacts as a guide to the warehouse data, storing information necessary to de-fine the migration environment, mappings of sources to targets, translationrequirements, business rules, and selection criteria to build the warehouse.

35

Chapter 5

SumatraTT project

This chapter is the main chapter of the whole diploma work. In this chap-ter will be mentioned a small part of the SumatraTT project which is nowin progress in the Czech Technical University, Department of Cybernetics.There are many of technical reports about the project itself: [4], [5], [6], [3],[1], [34] and so I will mention here only the core of the SumatraTT projectwhich is necessary for understanding of the rest of the work.

What the SumatraTT is good for? In the chapter 2 on page 6 we weretalking about data warehouse and its parts. One of its parts mentioned inthe section 2.3 on page 8 is a data pump. As already mentioned the datapump is a special part of the system, which loads data from available OLTPsources to a data warehouse structure. During the loading process, datacan be transformed or modified. Even this simple description shows, thatthe data pump is a very important part of the system. There are many oftools on the commercial market e.g. Microsoft DTS or Oracle WarehouseBuilder, but the cost of these tools is very high and not only because ofthey are often a part of a complex set of tools. The SumatraTT is anexperimental data pump where data transformations and loadingis directed using metadata.

The SumatraTT is based on specialized technical reports, books and spec-ifications: [25], [28], [6], [32], [3]. Because of the SumatraTT keeps many ofideas mentioned in these specifications and technical reports, the system isallowing transformations and data modifications which are not allowed inother products. Because of the transformations and loading process is di-rected using metadata, the system is really benevolent in transformationdefining.

In the next section will be described a little how the SumatraTT is work-ing and in the section 5.2 on the next page will be described the new graphicaluser interface which should simplify the work with the SumatraTT.

36

5.1 SumatraTT Overview

“The SumatraTT system is a complex data transformation tool. It is meantto be simple to use, effective, reusable and as extensible as possible. Someother features added by accompanying tools, like wizards and graphical userinterface.

The system is easily applicable in system integration: for data transfor-mation on input of data warehouse and load into it, data pre-processing fordata mining.”[4]

The SumatraTT system contains some important features. The mostimportant feature is, that the system uses metadata. It is not just usingthem but the metadata controls everything done in the system. The tool isworking with two basic elements:

DataSource – a set of records (rows), consisting from fields (columns).

Transformation – a task in SumatraTT to be processed.

Metadata is used for both these elements.The second very important feature is a SumatraScript language. The

script is used in all transformation templates, which determine a global be-havior of a concrete transformation. The template can be modified by userin order to provide suitable transformation functionality.

Metadata are stored in a metadata repository in an XML[30] format.The SumatraScript is working directly with the repository using methodsfor access into it.

A schema of the SumatraTT containing the GUI in on the figure 5.1 onthe following page.

The tool is developed as a command-line application but the press ofvicinity to have a graphical user interface was so strong, that it was necessaryto develop one. In this chapter on next pages and sections is describedthe development of the graphical user interface for the SumatraTT. Thegraphical user interface is using the SumatraTT core in order to run thephysical transformation, but the design of transformation templates is doneright in the graphical user interface.

5.2 SumatraTT GUI

In this section will be described how the IDE was implemented, with whatkind of Objects is the IDE treating, what techniques and technologies wereused for implementation etc.

37

inte

rpre

ter

Inp.Open("data"....

for(i=0;i<....

Declarative part Processing part

GU

I

Sum

atra

TT

driv

ers

Dat

a so

urce

Sum

atra

Scri

ptco

de

Sum

atra

Scri

pt

Met

adat

a re

posi

tory

Tem

plat

es

Cod

e ge

nera

tor

Wiz

ards

Dat

a so

urce

sM

etad

ata

Dat

a so

urce

s

Tra

nsfo

rmat

ions

othe

rs...

Ext

erna

lW

EK

A

Prol

ogJD

BC

Del

imite

d T

ext

SQL

Figure 5.1: Schema of SumatraTT

38

5.2.1 The Java(TM) programming language

“Like any human language, Java provides a way to express concepts. Ifsuccessful, this medium of expression will be significantly easier and moreflexible than the alternatives as problems grow larger and more complex.”[10]

The Java(TM)[18][12] programming language is an Object oriented pro-gramming language containing many features which makes this languagepopular in recent days. The main reason why I have chosen the Java(TM)programming language for the IDE implementation is its platform indepen-dency. There exists other languages providing the same feature e.g. ANSI Cor SmallTalk, but if we will look at these closer, we can see for each one areason why not to choose them.

ANSI C was created as a standard. Unfortunately many companies hasadded some libraries which are not in compliance with the standard definedwhen the language was born. If we will take all non standard libraries out ofthe distribution, in order to be in compliance with the standard, we will getreally platform independent language but with very limited functionality.

SmallTalk was also created as a standard and stays as a standard. It isreally platform independent programming language and provides many us-able features. Unfortunately the SmallTalk was created for high commercialprogramming e.g. Bank information systems etc. and so it is not respectinglimited resources as memory capacity and others.

Java(TM) was also created as a standard and it is kept strictly. It hashigher resource requirements than ANSI C but lower than SmallTalk andalso provides many usable features for application developing. The platformindependency in the Java(TM) is done because of Virtual machine some-times called also as an interpreter which implementation is different for eachplatform. When a Java(TM) code is compiled then a byte code is createdand using the interpreter the byte code can be executed.

The platform independency was the main but not only requirement. An-other requirement was that the language has to provide a standard com-fortable graphical user interface and accessibility. This requirement is inJava(TM) language realized by its graphical library known as swing. Finallythe language has to provide a way how to extend the application when itis already done. This requirement is in Java(TM) language realized by itspackaging library known as Java ARchive (JAR)[17].

5.2.2 IDE’s objects interactivity model

In this section will be described which classes and objects are in the IDE,what is theirs responsibility and at the end of this section will be the graph

39

of objects interactivity.There is a number of objects in the IDE, some of them are in a separate

class, others are implemented as an inner class of an other existing class.I will here describe all important classes in the part of IDE implementedby myself. All classes are separated into packages in order to keep logicalstructure of the application source code. To know what is each object doingand what is its responsibility is very important in order to understand nextsections in this chapter.

Package SumatraGUI\GUI

frmMain – This object represents the main frame of the application. It isnot only a frame but through this object are communicating all its com-ponents. When this object is initialized, available transformations andwizards are loaded and all configuration files are loaded into the mem-ory. Each object can ask the frmMain for a configuration information,for information about another component etc.

treePanel – The frmMain can be divided into individual sections1. ThetreePanel can be found on the left side of the frmMain. This panel pro-vides function concatenated with DataSources loaded into the project.The panel allows to add a DataSource from a file, create a new Data-Source, edit a DataSource properties, add a DataSource to a Workspaceand provides methods for saving and loading information about theDataSource status of the opened project.

AboutBox – This object is a simple dialog displaying static informationabout the SumatraTT GUI.

DataSourceNode – The treePanel displays loaded DataSources as iconssymbolizing a concrete DataSource type, but behind of the graphicalrepresentation is a DataSourceNode object. The DataSourceNode ob-ject represents a DataSource in the treePanel in order to separate thegraphical function from the DataSource’s properties and methods. TheDataSourceNode has a direct access to the DataSource’s properties andso the treePanel can address concrete DataSource through its graphicaldelegate – DataSourceNode.

FieldNode – The treePanel provides also a functionality to display Fieldsinside a DataSource. Those fields are also represented as a small pic-

1frmMain sections and other visual components are described in the section 5.2.3 onpage 46

40

ture, symbolizing a concrete Field type but as in case of the Data-SourceNode, behind of the picture exists an object delegating all meth-ods and properties of a real Field – FieldNode.

TypesReader – As you can read in the section 5.2.10 on page 68,it is possi-ble to change the basic configuration of the IDE. It is possible to changethe visual representation of all DataSources and Fields and also theirsproperties and attributes. The configuration is stored in a file and dur-ing the IDE’s start is loaded into the memory using the TypesReader.

TypesWriter – When using IDE’s tools for configuration changing detaileddiscussed in the section 5.2.10 on page 68, it is necessary to rewrite con-figuration files when the configuration finishes. The new configurationfile creating is done by the TypeWriter object.

Package SumatraGUI\Classes

DataSource – This is a very important object which contains all neces-sary information about the physical DataSource. These informationare read from a XML file which contains Metadata about the phys-ical DataSource. The DataSource object contains information aboutthe name, type, fields and attributes. It also provides some specialfunctions allowing to get an AST2 description etc.

DataSourceType – This object is finally a part of the DataSource objectbecause of it contains information about the DataSource type and at-tributes which are unique for the concrete DataSource type. AvailableDataSource types and theirs unique Attributes are loaded as a part ofIDE’s configuration.

Field – As mentioned, the DataSource can contain fields. Those fields hasa name and a type. The physical field is in the DataSource objectrepresented as a Field object. As in case of the DataSource types, allField types are loaded as a part of IDE’s configuration.

Attribute – Another DataSource’s property is an Attribute. The Attributehas a name and a value. There exists two kinds of Attributes. Thefirst kind of Attribute is optional and can be attached by the user tothe DataSource definition. In this case the user can set both Attributeproperties – name and value. The second kind of Attribute is obligatoryand is attached to the DataSource type. This Attribute is loaded as

2AST is another standard Metadata description

41

a part of IDE’s configuration and the user can change only a valueproperty of this Attribute. In most cases are these Attributes necessaryfor the correct template using.

RelativePath – A simple object providing a functionality for absolute torelative path transformation and relative to absolute path transforma-tion. It is used during the project loading and saving.

JarClassLoader – The IDE’s extensibility described in the section 5.2.5 onpage 52 is generally based on the Java ARchive (JAR) technology.In the Java(TM) programming language can be used each resourceavailable on a classpath. The classpath can be set statically before theapplication is started and dynamically when the application has alreadystarted. The JarClassLoader object has two main responsibilities — atfirst it is used to obtain information about the JAR file content. Secondresponsibility is that the JarClassLoader is used to dynamically createa classpath pointing to the JAR file. You can read more informationin sections 5.2.7 on page 55 and 5.2.9 on page 65. In these sections isdescribed the concrete implementation of the Extensibility model.

TransformationLoader – This object is used for loading modules of aTransformation type. It is described in the section 5.2.7 on page 55.

TransformationHash – In this object are stored necessary informationabout a Transformation module obtained using the JarClassLoaderand the TransformationLoader. It is described in the section 5.2.7 onpage 55.

TransformationVector – This object is just a simple accumulator storingTransformationHash objects. It also provides some special functionswhich are not in the standard Vector definition. It is described in thesection 5.2.7 on page 55.

WizardLoader – This object is used for loading modules of a Wizard type.It is described in the section 5.2.9 on page 65.

WizardHash – In this object are stored necessary information about a Wiz-ard module obtained using the WizardLoader and the JarClassLoader.It is described in the section 5.2.9 on page 65.

WizardVector –This object is just a simple accumulator storing Wizard-Hash objects. It also provides some special functions which are not inthe standard Vector definition. It is described in the section 5.2.9 onpage 65.

42

Package SumatraGUI\XML

XmlTool – This object provides many functions for reading and writingXML files of a special format. The supported XML format was definedby the diploma work manager. XmlTool object is used for loadingDataSource from its Metadata XML file, writing the DataSource objectinto the Metadata XML file etc. It is primary used by the treePanlelmentioned above.

XmlDSParentMaint – This object also can read and write XML files. Itis not used for reading and writing DataSources definitions but forreading and writing information necessary to store when an user wantsto save the project. It provides functions for loading and saving thestate of the treePanel.

Package SumatraGUI\TypesCustomizer

dlgDataTypeCustomizer – As already mentioned the IDE contains a toolfor customization described in the section 5.2.10 on page 68. ThedlgDataTypeCustomizer object represents a graphical tool for Data-Source customization. Functions of this tool are described in sec-tion 6 on page 69.

dlgFieldTypeCustomizer – As in case of the dlgDataTypeCustomizer ob-ject, the dlgFieldTypeCustomizer represents also a graphical tool. Thisone is used for Fields customization. Functions of this tool are describedin section 6 on page 70.

Package SumatraGUI\DataSourceEditor

frmMainDSE – This object can be marked as a standalone applicationallowing to modify properties of an existing DataSource. It is a partof the IDE in order to extend its functionality. The frmMainDSE isdescribed in the section 5.2.11 on page 71.

Package SumatraGUI\AST

ASTDSGenerator – This object is used to generate an AST descriptionof a DataSource object.

There is one more package in the SumatraGUI — goal. This packagecontains objects implementing the panel which can be found on the rightside of the frmMain. This panel and other classes in this package are veryimportant because they allow to design a Transformation, allow to run theTransformation process etc. Classes inside of this package were not imple-mented by myself and they are the main topic of another diploma work andso I will not describe them here.

43

ASTDSGeneratorAttribute

DataSource

DataSourceNode

FieldNode

DataSourceType

Field

FunctionalObjects

GraphicalDelegates

Nodes

Figure 5.2: DataSource object interactivity diagram

DataSource object interactivity model

In order to make the main interactivity model smaller and easier to under-stand, I will describe in this section one its sub model. The sub model con-tains objects which are used when the IDE is working with the DataSourcesdefinition and they can be treated as a unit.

The sub model contains these objects:

• DataSource

• DataSourceType

• Field

• Attribute

• DataSourceNode

• FieldNode

• ASTDSGenerator

You can see the interactivity sub model on the figure 5.2.

44

frm

Mai

n Oth

er

Wiz

ard

Cus

tom

izer

dlgF

ield

Typ

e

Cus

tom

izer

dlgD

ataT

ype

Loa

der

JarC

lass

Vec

tor

Vec

tor

Tra

nsfo

rmat

ion

Has

hW

izar

d

Has

hT

rans

form

atio

n

Loa

der

Wiz

ard

Loa

der

Tra

nsfo

rmat

ion

Pane

lMai

ntX

ML

DS

Abo

utB

ox

Path

Rel

ativ

e

Wri

ter

Typ

es

Rea

der

Typ

es

XM

LT

ool

frm

Mai

nDSE

Nod

es

Obj

ects

SaxP

arse

r

Xer

ces

MyJ

Pane

ltr

eePa

nel

Figure 5.3: IDE object interactivity diagram

45

Objects interactivity model of the whole IDE

The objects interactivity diagram can be found on the figure 5.3 on thepreceding page. It contains the DataSource objects interactivity diagraminside in a simplified form – a square divided into the functional objects partand a graphical interpretation (nodes) part.

5.2.3 IDE’s UI model

When I have designed the user interface of the application I wanted to keepsome trends for applications of the IDE type. In order to find those trends Ihad to compare several IDEs developed by different companies e.g. BorlandInc. or NetBeans, and then use the core from theirs User Interfaces. Themodel of the UI is on the figure 5.4 on the next page.

There are six basic parts which are united for almost all types of IDEs:

Menu – The menu is now a standard almost for all kinds of applicationsand for IDE it is not different. The menu allows to user run commandsallowing e.g. saving, loading, creating projects, configure the IDE, runother applications concatenated with the applications etc.

Control buttons – These buttons are placed right under the menu andoften offers the a subset from the menu functionality. Panels wherethese buttons are placed are called toolbars. Each IDE has its ownunique toolbars.

Component palette – This is often a special toolbar which contains avail-able components to use in the IDE. It is often separated to groupsaccording to components type.

File system – This panel is sometimes called also as a repository. On thispanel are displayed available source files of different kind. Each IDEhas its own source file types. The panel is mostly on the left side ofthe IDE.

Workspace – Workspace is a place where the main designing is done. It ispossible to add components from the component palette and use them.The workspace is mostly placed on the right side of the IDE.

Status bar – The status bar is a place, where information about the currentrunning processes are displayed.

During the IDE user interface designing were used all mentioned IDEparts without the status bar. The final IDE’s user interface is on the fig-ure 5.5 on page 48.

46

Wor

kspa

ce

Com

pone

nt p

alet

te w

ith c

ateg

orie

sC

ontr

ol b

utto

ns

Men

u

Stau

s ba

r

Rep

osito

ry

o

rFi

le s

yste

m

Figure 5.4: User Interface model

47

Figure 5.5: User Interface of the IDE

48

Add DataSource

EndCreates

Creates

PostsCalls

DataSourceNodetreePanel

DataSource

treePanel XMLTool

menu click.

from file popup

Figure 5.6: Add DataSource from file

5.2.4 IDE’s Use Case/Data Flow model

This section is devoted to basic actions which are available to do using theIDE. These actions will be described and each will have its own data flowmodel in order to understand how the IDE handles these actions and whichobjects are used. Data flow diagrams will contain objects which have beenalready described in the section 5.2.2 on page 39. Because of there aremany of actions in the IDE, I will describe here only actions implementedby myself and concerning DataSources and handling appending of propertransformation to the workspace.

Add DataSource from file – This action is available from a popup menudisplayed from the root node on the treePanel. The action will allow tothe user to choose a XML or STTDS3 file from a standard file chooser.This file is posted to the parser and the result is displayed on thetreePanel as a DataSourceNode picture. It is possible to have morethan one DataSource definition in the file. In this case the parsercreates for each DataSource definition a separate STTDS and in thefuture is the IDE working with the STTDS file. See figure 5.6.

Add new DataSource – This action creates an empty DataSource of un-known type and creates a STTDS file for it. User is able to modifyit using the DataSource editor. This action is available from a popupmenu displayed from the root node on the treePanel.See figure 5.7 onthe next page.

Remove all DataSources – This action is available from a popup menudisplayed from the root node on the treePanel. It is used to remove allDataSources from the project and from the treePanel. It is little bitcomplicated because each DataSource is tested whether is present on

3The extension STTDS marks file containing DataSources definition readable by theIDE.

49

XMLTool

popup menu clickAdd new DataSource

End

Creates

Creates

Posts

DataSourceNode

DataSource

treePanel

Figure 5.7: Add new DataSource

popup menu click

DS

For each

MyJPanel

GetsRemove all

treePanel

DataSource

Posts

Figure 5.8: Remove all DataSources

the workspace or not. DataSource existing on the workspace can’t beremoved from the project. See figure 5.8.

Remove – This action is available from a popup menu displayed from theDataSource node on the treePanel. It is used to remove selected Data-Source from the project and from the treePanel. DataSource existingon the workspace can’t be removed from the project and so it is nec-essary to check if the selected DataSource is present on the workspace.See figure 5.9.

popup menu clickRemove

End

treePanelMyJPanel

Gets

treePanel

DataSource

Posts

Figure 5.9: Remove

50

popup menu clickAdd to the WorkSpace

Posts

MyJPanelASTDSGenerator

Gets

treePanel

DataSource

PostsEnd

Figure 5.10: Add to the WorkSpace

popup menu click

Creates

DataSourceNode

XMLTool

treePanel

DataSourceGets

PostsfrmMainDSE

Modify Gets

treePanel

DataSource

Posts

End

Figure 5.11: Modify

Add to the WorkSpace – This action is available from a popup menudisplayed from the DataSource node on the treePanel. It checks if theDataSource already exists on the workspace and according to the resultit appends the DataSource on the workspace. See figure 5.10.

Modify – This action is available from a popup menu displayed from theDataSource node on the treePanel. It runs the DataSource editor andallows user to modify properties of the DataSource. This action consistsof two actions — user run the Modify command and press the Savechanges button on the DataSource editor frame. See figure 5.11.

Clone – This action is available from a popup menu displayed from theDataSource node on the treePanel. This action is a special one. It isable to create a new DataSource as a clone of existing one. The newDataSource created as a clone has the same definition saved to anotherSTTDS file. See figure 5.12 on the following page.

Refresh – This action is available from a popup menu displayed from theDataSource node on the treePanel. It is possible to modify the STTDSfile outside of the IDE even when the IDE is running. This actionsdoes the synchronization of the DataSource loaded to the IDE and ofthe STTDS file where the definition of the DataSource is stored. Seefigure 5.13 on the next page.

Adding transformation to the workspace – This action contains two

51

popup menu clickClone

CreatesCreates

DataSourceNode

XMLTool

DataSource

Posts

Gets

treePanel

DataSource

End

Figure 5.12: Clone

End

Calls

DS action

Remove

Refresh popup

menu click.

XMLTooltreePanel

DataSource

treePanel DataSourceNode

CallsPosts

Creates

Creates

Figure 5.13: Refresh

users actions inside—selecting transformation in the component paletteand then clicking on the workspace. When the IDE gets the click onthe workspace event, it checks whether a transformation is selected andaccording to this creates a transformation instance on the workspace.See figure 5.14.

5.2.5 IDE’s Extensibility model

In order to provide the application extensibility, it is necessary to define whatkind of objects we will allow to add into the IDE structure in order to extendits functionality. There are some special requirements on the object which

Add to the WorkSpace

Transformation

Creates

HashTransformationGet find

VectorTransformation

transformation

of selected

Get name

PostsfrmMain

popup menu clickEnd

MyJPanelPosts

Figure 5.14: Adding transformation to the workspace

52

will extend the application and also on the application to be extended.

Object requirements

When an object is able to extend an application functionality, it means, thatit is also able to “live” alone. It means that the object can be developed with-out the target application and its functions can be called without the targetapplication. Such object can be named as a JavaBean and its obligatoryproperties have been already documented in the JavaBean specification[16].

A JavaBean is an object or a set of objects containing hidden (private)properties and visible (public) methods providing an access to these prop-erties. It can also provide any kind of other functionality. There are nolimitation for the JavaBean’s functionality but there are some rules whichhas to be kept during the JavaBean’s implementation. The proper imple-mentation defined in the JavaBean specification allows to the applicationto introspect the JavaBean to get necessary information in order to use itproperly.

In order to don’t have many files (objects) representing one extensionmodule (JavaBean), a Java ARchive technology is used to pack the mod-ule into a single file. The Java ARchive technology is a standard packagingtechnology based on the ZIP packaging technology. It encapsulate the wholemodule into one JAR file and adds an information file called Manifest. TheManifest file is a text file containing information about the module. It canbe edited in order to allow to the application obtain some information aboutthe module without unpacking the JAR file, because the standard Java(TM)library supporting JAR files usage contains methods for direct Manifest con-tent reading.

So the resume of an extension object requirements is that the object hasto be in compliance with JavaBean specification and has to be encapsulatedin the JAR file containing a proper Manifest file.

Application requirements

Requirements of the application are resulting from the object requirements.The application has to provide an API4 where is strictly defined the commu-nication between the module and the application e.g. which methods will becalled on the JavaBean to initialize it and run its functions. Then has to bedefined which entries has to be in the Manifest file. The application itself thehas to contain a loader allowing JAR file reading and module object reading.The loader has to be able to distinguish between different kinds of modules

4Application Programming Interface

53

JavaBean

Loader

Application

CORE

API

fest

Mani−

Figure 5.15: IDE’s Extensibility process model

(if necessary) and also loads proper objects into memory in order to allowtheirs using.

5.2.6 Modules

This section will describe what types of modules is possible to add to theIDE and extend its functionality.

Transformation module

Generally a transformation is a process when something becomes somethingelse so there always exists an input and output. The course of the process isgiven by the transformation definition.

The physical transformation is not done in the IDE but the SumatraTTdoes it. The transformation has always an input DataSource and often also

54

an output DataSource but it is not a rule. For example the GUI transforma-tion discussed in the section 5.2.8 on page 61, has not a DataSource as anoutput but its output is a set of graphical outputs displayed on the screen.

The definition of the transformation in the IDE is done by a template. Thetemplate contains a statical metadata description and parameters which willbe appended to the template by the IDE during the transformation designand so the transformation is customized.

Each transformation can contain different kinds of parameters and so it isnecessary to allow to edit theirs values. The transformation module consistsof two basic parts:

• template

• tools allowing parameters editing for the concrete template

There exists some parameters which are united for many types of trans-formations. Actually some transformation parameters can be completelycovered using the same tool for parameters editing. The IDE has alreadyimplemented a simple tool allowing to edit some kinds of parameters and soit is not necessary to create a special tool for each new template.

In order to allow edit all kinds of parameters and configure the transfor-mation, I have constructed an API allowing to handle transformation imple-mentation, loading, initializing etc. The API is discussed in section 5.2.7.

Wizard module

As already mentioned, the physical DataSource is described by a file in XMLformat containing metadata. For each DataSource is necessary to create suchfile. The IDE provides some functionality to create the file and modify it inthe graphical interface. The problem is, that is necessary always to startwith a blank file and it can be terrible when the DataSource is complicated.

The Wizard module provides functionality which can be used for creatingthe mentioned file. The file is not created by user but the Wizard tries to ana-lyze the DataSource and writes results to the file. Often for each DataSourcetype is necessary to have one Wizard. The Wizard has to be implementedand added to the IDE using the Wizard API discussed in section 5.2.9 onpage 65.

5.2.7 Transformation extension API

This chapter is intended for extension module programmers.

55

This API which allows to an implementor of a transformation to appendthe it to the IDE as its module.

The module can be defined as a set of components and logic packed intoa JAR file. This file has to be identifiable by the IDE module loader and hasto contain a class files extending and implementing necessary interfaces andbase classes defined in the IDE. The module extends IDE functionality andusability.

In fact if the implementor of a new module, will respect all rules and rec-ommendations defined in this appendix, there are no functionality limitationsof the module.

The transformation module definition

In this section, you will understand what the transformation module is fromthe IDE’s point of view.

Each transformation module consists of two basic parts:

• Transformation template—.sum file with Transformation definition.

• Transformation logic—.jar file containing classes with transforma-tion logic and methods for user customization etc.

The rule which has to be kept for those two files is, that the template filehas to have the same name as its relevant .jar file.

Number of available transformations is determined by number of transfor-mation templates — .sum files which are stored in the “templates” directoryseparately from the .jar files — the transformation logic. Number of tem-plates and .jar files need not to be the same. If a template will be found bythe loader and no corresponding .jar file will be found, the .jar file with theDefault Transformation will be used. The figure 5.16 on the following pageshows the encapsulation of the Transformation module.

Creating custom transformation

The transformation extension API is really benevolent to the transformationimplementor. There are only a few rules and recommendations to keep.We will concern here only by the transformation logic stored in the .jar filebecause that is what the API is about.

The .jar file can contain many different classes and other files like picturesand HTML documentation files. One of those classes have to keep two basicrules:

56

Template

OR

Logic

Manifest file

TransformationDefault

TransformationCustom

Manifest file

Logic

Transformation

JAR file

Figure 5.16: Transformation module diagram

1. The class has to implement the UniversalTransformation in-terface.

2. The class has to extend the Transformation class.

The class which keeps these rules will be called transformation MainClass.If the transformation module is successfully loaded, the IDE is communi-cating with the transformation by means of methods defined in mentionedUniversalTransformation interface and the Transformation class. Many ofmethods can be overwrited and so it is provided to the transformation im-plementor to call his own methods and classes which almost are not limitedin functionality.

Functions available from the MainClass are described in the Appendix A onpage 74.

The MainClass is not the only necessary part of the .jar file. Becauseof this API is using Java(TM) standards, the .jar file contains a manifestfile. This is a special file containing information about the content of the.jar file. This file and its entries are available for reading directly (withoutdecompressing the .jar) through the Java JAR API. All mentioned entriesare described in the Table 5.1 on the following page and they are obligatory.

57

Table 5.1: Table of entries in the manifest necessary for the proper transfor-mation module loading

Manifest entry DescriptionDisplay-Name The name of the transformationMain-Class The full path (with packages in the transformation

.jar file scope) to the MainClass.Icon-Name The full path (with packages in the transforma-

tion .jar file scope) to the .gif file with size 30x30pixels. This icon will be placed on button in thetransformation tool bar.

Module-Type Here is necessary to specify the module type. Inthis case, the module type is “Transformation”.

Short-Description The short transformation description will be usedas a part of a tooltip for the relevant button in thetransformation toolbar.

Transformation loading

In this section, you will understand what is the IDE doing during its eachstart, and how the transformation loading works.

The transformation loading is done by two classes:

• TransformationLoader

• JarClassLoader

The JarClassLoader is generally used for module loading because of it con-tains the logic for JAR file decompressing, manifest reading etc. Over against,the TransformationLoader is doing the transformation files searching and isusing the JarClassLoader to obtain information.

The searching and loading process is very easy to understand. At firstthe TransformationLoader obtains list of available templates. For each foundtemplate, the TransformationLoader tries to find relevant .jar file in the trans-formations directory. Mind that the relevant .jar file means that the nameof the file is similar to the name of the template. If the relevant .jar file wasfound, then the JarClassLoader is used to obtain necessary information. Ifthe file is not found, then the DefaultTransformation.jar is used. Informationis stored in a special class called TransformationHash. This class providessome setters and getters for information storing and one special method for

58

transformation instance creating. Here are a list of information entries storedin the class TransformationHash.

Type Entry namejava.lang.Class MainClassjavax.swing.ImageIcon Iconjava.lang.String DisplayNamejava.lang.String ShortDescriptionjava.lang.String Template

So now we have for each template one instance of TransformationHash con-taining necessary information. These instances are stored in a special vectorcalled TransformationVector. This class is a special vector providing stan-dard java.util.Vector methods and some special finding methods. At the endof the loading process, one instance of the TransformationVector is passedto the IDE. The loading process is also described by a process diagram onfigure. 5.17 on the next page

Transformation initializing

In this section will be described how is the IDE initializing loaded transfor-mation. This section will be very short because of all important informationwere described in previous sections.

As mentioned, all available transformations are stored in a special vectorcalled TransformationVector. This vector is set in the main frame as a globalvariable named “transformations”. When the main frame is initialized, foreach entry in the transformations vector creates a toggle button and usesinformation which are stored in the entry itself. So the icon is placed to thetoggle button, tooltip etc. Each toggle button is placed into a toolbar panel.Every time an user presses one of toggle buttons, a global variable named“selectedTransformation” gets a TransformationHash relevant to the selectedtoggle button.

If an user clicks on the right panel (Work space), the right panel simplycalls:

selectedTransformation.getTransformationInstance()

A transformation instance is returned by this method and the transformationis also placed to the work space. Now user can show and setup properties,run the transformation etc. using standard transformation graphical userinterface.

59

Use

Transformation loader List of templates

For each template do:

Exists .jar file ?+−

JarClassLoader

TransformationHash

TransformationVector MainFrame

On

Relevant .jar fileDefaultTransformation.jar

Use

JarClassLoader

On

Use

Figure 5.17: Loading process diagram.

60

Figure 5.18: Graphical Transformation customization dialog

Transformation API summary

As you can see to create new transformation is relatively simple. It is neces-sary to keep all rules and recommendations to allow to the loader initializethe transformation properly.

5.2.8 Graphical Transformation

The Transformation API discussed in the previous section was used to createthe Graphical Transformation. The Graphical Transformation is a specialtransformation which provides a functionality allowing to display data froma concrete DataSource into table or as a graph5.

The input of this transformation is a DataSource. The output is a set ofgraphical outputs displayed on the screen. Before the transformation is used,it is necessary to configure it. The configuration screen is on the figure 5.18.User has to choose which fields from the DataSource will be displayed andwhich rows. Then it is necessary to choose which graphical outputs will beused. There are four possibilities to choose:

Table – All selected data will be displayed in a table. It is possible to sort

5For the data displaying is used an Open Source project called JFreeChart. It is agraphical library, which allows to draw different kinds of graphs simply. It is available onwww.jFreeChart.com

61

Figure 5.19: Graphical Transformation — Table screen

the table in ascending order and to change order of displayed columns.The table is on the figure 5.19.

Simple graph – Simple graph is for users which want to see data displayedin the graph but don’t want to lose time with graphs configuration.In this case the Graphical Transformation will display a tabbed dialogwhere for each field exists a separate tab. On each tab is a graphconstructed as an XY graph, where X value is the row count and theY value is a real field value. User can fully configure the final view e.g.color of the graphs background, the width of lines etc. The graph is onthe figure 5.20 on the following page.

Graph – User can display data in a different types of graphs. When userwants to choose which data display in which graph type, it is nec-essary to configure it. The dialog allowing graph configuration is onfigure 5.21 on the next page.

In this dialog can user choose which type of graph he want to see,which fields will be used for the graph construction etc. User can createmore than one graph configuration and when the transformation willbe started, each configuration will be used to create a separate graphview. There are five types of graph to show data of the Number type,other are prepared but not implemented. Implemented graphs are onfigure 5.22 on page 64.

Statistics – This choose serves for displaying some statistics informationabout data. It was not developed by myself and so I will not describe

62

Figure 5.20: Graphical Transformation — Simple graph

Figure 5.21: Graphical Transformation — Graph configuration

63

Figure 5.22: Graphical Transformation — Available graphs

64

it here in details.

5.2.9 Wizard extension API

In this section we will talk about an API allowing to add a wizard moduleinto the IDE.

What the Wizard is was described in the section 5.2.6 on page 54. TheWizard extension API is in a compliance with the general Extension APIproposed in the section 5.2.5 on page 52. It means that the Wizard moduleshould be an independent JavaBean encapsulated in a JAR file with a propermanifest file inside and the IDE should have implemented this API andshould have implemented a loader able to load the Wizard module into theIDE.

The principle of the Wizard extension API is very similar to the Transfor-mation API.

Wizard module definition

As already mentioned the Wizard is a JavaBean allowing to the user to createa metadata description of an existing DataSource. There is a number ofmethods how to do this and the used method depends on the concrete Wizardand of course on its implementor. The mentioned Wizard functionality neednot be the only functionality the Wizard provides. In fact as in case of theTransformations API there are almost no limits.

Custom Wizard creating

The Wizard itself can be a very complicated module and so it is not reallygood to implement all its methods and properties into one single class file, if itis even possible. There can be many classes which together form one Wizard.For the Wizard using in the IDE is necessary that one of these classes willextend a basic class called — UniversalWizard. This class provides somemethods which are necessary and useful in communication and informationchanging with the IDE. The Wizard class extending the UniversalWizardclass will be in future called a MainClass. The Wizard without the MainClasscan not be used as an IDE module.

Methods available from the MainClass are discussed in the Appendix B onpage 77.

65

WizardLoader

Post

Use

Get

frmMainWizardVector

WizardHash

JarClassLoader

Figure 5.23: Wizard loading process

Wizard loading

All classes in the Wizard module have to be packed into a JAR file in orderto allow to use the JAR file loader and also to keep the granularity and atlast to keep the extension specification. There are two loaders used to loadthe Wizard into the IDE.

• WizardLoader

• JarClassLoader

The loading process uses the WizardLoader to get Wizards and the Wiz-ardLoader uses the JarClassLoader to obtain necessary information and ref-erences from the physical JAR file.

The process is on the figure 5.23 and can be described in this way:

1. The WizardLoader goes into the modules\Wizards directory and getsall .jar files he can found there.

2. Each .jar file found in the directory is processed to the JarClassLoaderand it is tested if all necessary information and references can be ob-tained.

3. The JarClassLoader posts all information to the WizardLoader, whichcreates a special class named WizardHash, where all obtained informa-tion are stored.

66

Table 5.2: Table of entries in the manifest necessary for the proper wizardmodule loading

Manifest entry DescriptionDisplay-Name The name of the wizardMain-Class The full path (with packages in the wizards .jar

file scope) to the MainClass.Module-Type Here is necessary to specify the module type. In

this case, the module type is “Wizards”.Short-Description The short wizard’s description will be used as a

part of a tooltip for the relevant places (menu,combo boxes etc.).

4. The WizardLoader puts the created WizardHash into a WizardVectorwhich is an accumulator, where objects of WizardHash type are stored.

5. When all .jar files are processed, the instance of a WizardVector isposted to the frmMain, where it is stored and used to create the propermenu for each Wizard etc. In order to process this, the Wizard JARfile has to contain a proper manifest file inside. Obligatory manifestentries are in the table 5.2.

Wizard initializing

It is necessary to know that even if the Wizard has been loaded into the IDEand it is offered in the menu, there is no real instance of the Wizard yet.All information the IDE contains are only references or information obtainedfrom the manifest file. The real instance is created not before the user wantto see the Wizard, that means not before the user chooses the proper menucommand etc. When this is done, then a simple process is called.

The process initializes the Wizard and an be described as:

1. An instance of the Wizard is created. This is done using a specialmethod in the WizardVector

2. The parent frame is set.

3. The runWizard() method is called on the created instance.

67

5.2.10 IDE configurability

This chapter is intended to users which wants to Add/Remove or configuresupported DataSources and Fields properties and theirs graphical represen-tation.

I will not repeat in this section the Transformation API and the WizardAPI even though they can be also treated as a configuration possibilities forthe IDE but I will describe here two IDE basic elements which are config-urable.

For each one the configuration is possible to process inside or outside ofthe IDE, because both configurations are stored in separate text files and itis easy to understand what each parameter inside of the file signifies. Therecommendation for users is to use for configuration tools which are part ofthe IDE.

DataSources configuration

DataSources are in the IDE represented by an object which has these prop-erties:

Type – a text value which signifies what kind of DataSource is used e.g.DelimitedText or JavaDB etc.. It is used to recognize the DataSourceduring the DataSource file loading and for users information. In thiscase the DataSource type is represented by a small picture – Icon. Ifuser tries to load a DataSource which is not defined in the configurationfile, the Unknown DataSource will be displayed instead.

Icon – as mentioned, the icon graphically represents the DataSource type inthe IDE. It is displayed in the left IDE’s panel and in the DataSourceeditor which is also a part of the IDE. Parameters of the picture areonly two: the picture must be a .gif file and its size should be 20x20pixels.

Attributes – because of each DataSource can contain some special proper-ties e.g. the DelimitedText contains a property Delimiter and a prop-erty File, it was necessary to allow to store somewhere these specialproperties which can be different for each DataSource type. The At-tributes parameter represents a set of optional DataSource properties.

These parameters are stored in a special file dataSources which is storedin the Gui\Config directory. All changes done in this file will take effect assoon as the IDE will be restarted.

68

Manual DataSource configuration

To change the file manually, it is necessary to understand the file structure.The structure is very similar to a structure of the standard property filewhich means the the file contains on each line a statement in form:

property=value

In the definition of a DataSource the type property has to be set at first,then the icon and at last attributes has to be set each property on a separateline.6 The definition of another DataSource starts right on the new line afterthe previous definition. Here is an example of a dataSources file where twodifferent DataSources are defined. Each has a type and an icon, the first hastwo attributes and the second has no attributes defined.

type=DelinitedTexticon=/Support/Icons/delText.gifattribute=delimiterattribute=filenametype=ADOSQLicon=/Support/Icons/adosql.gifetc...

User can configure properties of DataSources modifying this file.

DataSources configuration using IDE’s tool

The IDE contains a tool which allows to the user change the configuration filein a GUI mode. This way is much safer than the one described in the previoussection. The tool can be found in the Options menu in the running IDE. Theedit DataSource types command will show up a dialog where user can add orremove DataSources, edit its properties—change type name, choose an icon,modify attributes etc.. The tool is on the figure 5.24 on the following page.

As soon as the user press the Save changes button, a new dataSource filewill be written.

Fields configuration

The fields configuration is very similar to the DataSource one. Fields config-uration file is stored in the same directory and its name is fields. The onlydifference between DataSource properties and Field properties is, that theField does not have Attributes. All changes done in the fields file will takeeffect as soon as the IDE will be restarted.

6The icon value has to contain the full path to the .gif file starting from the GUIdirectory and using the ’/’ file separator.

69

Figure 5.24: DataSource configuration using IDE’s tool

Manual Fields configuration

As in case of DataSource manual configuration, it is necessary to understandthe fields file structure. The structure is the same as in the dataSources filebut as already mentioned the Field does not have Attributes. Here is anexample of the fields file where two different Fields are defined. Each has atype and an icon.

type=Stringicon=/Support/Icons/string.giftype=Numbericon=/Support/Icons/Number.gifetc...

Fields configuration using IDE’s tool

As in case of DataSource, the IDE provides a GUI tool allowing to changethe configuration file. The tool also can be found in the Options menu in therunning IDE. The edit Field types command will show up a dialog where theuser can add or remove Fields, edit its properties—change type name andchoose an icon. When the user clicks on the Save changes button, new fieldsfile will be created. The tool is on the figure 5.25 on the next page.

70

Figure 5.25: Field configuration using IDE’s tool

5.2.11 DataSource editor

This section will belong to the DataSource editor tool which is a part of theIDE.

As already mentioned, the IDE is working with DataSources stored in theXML file as a METADATA of true DataSources. User can write or modifythose XML files manually, or use a tool providing this functionality. I haveintegrated such a tool into the IDE in order to allow the user to create anew DataSource or to modify an existing one. The DataSource editor canbe started from a popup menu which owner object is a DataSource. Thefigure 5.26 on the following page is a screen shot of the DataSource editor.

71

Figure 5.26: DataSource editor

72

Chapter 6

Summary

Organizations are becoming increasingly aware of the limitations of their ownsystems and internal data. The attempts to liberate and leverage data acrossthe organization’s stovepipes have been replete with frustration and too manyexamples of failure. These experiences, coupled with drivers demanding flex-ibility in business processes, are hastening the day that businesses will im-plement an enterprise system which will allow to create intelligent businessdecisions using data warehouses with the metadata background. Rozsirit ccao 50

Jak z diplomky vyplyva, ze je dulezity system vyuzivajici metadataGUI = priparava metadat. (apon 1/2 stranky)The system might consist of many different parts and components. Meta-

data adds a new dimension of data using, preparing, transforming etc. TheSumatraTT is a project which should uncover possibilities and new ways in adata transformation section of development. The graphical user interface isnecessary in order to “get closer” to the end user. It seems that the graphicaluser interface has increased the presentability of the project and so the inter-est of 3rd parties which’s goal is to accommodate the “Business InformationDemand” — An organization’s continuously increasing, constantly changingneed for current, accurate information, often on short notice, to support itsbusiness activities.

73

Appendix A

Methods of the MainClass inthe Transformation API

Description of methods available from the MainClass:

Constructor – as you know, this is not a real method but there exists arule that the transformation constructor has to call super().

void customize() – this method allows you to handle a double-click on thetransformation placed on the right panel. you can display your owndialogs for transformation customization or use standard dialog call:

createFrame(getParentFrame(),actionCommand);

where “actionCommand” can be:

[A_C_RENAME] opens a rename dialog.

[A_C_TRANSACTION_PROPERTIES] displays a dialog with transforma-tion properties which were read from the template.

[A_C_CHANGE_TEMPLATE] open a dialog for template choosing.

void writeObjectToXML(PrintStream f) – this method will be calledwhen a project is saved. The transaction implementor can decide whatwill be stored. At first, this method has to call:

super.writeObjectToXML(f);

then should be send into the PrintStream f in the XML format thetransformation class name and its attributes:

74

<Class name=”className”><Attribute name=”classVersionID”> VersionID</Attribute>...</Class>

void readObjectFromXML(int state, String params) – this methodallows you to handle the class loading when the project is loaded.

These were method which are obligatory. Followings are methods whichare already implemented and can be over-written:

void drawObject(Graphics2D g, JPanel io) – this method allows youto change the visual representation of the transformation.

boolean connectedToDS(SumatraGUI.Classes.DataSource d) – themethod is called when an user is trying to connect a DataSource withthe transformation. This method calls the accept method.

boolean accept(SumatraGUI.Classes.DataSource d) – in the acceptmethod you can allow or disallow a connection from a DataSource tothe transformation. The original method always returns true. Youcan return true or false according to the DataSource type or any othercriteria.

void conectDS(SumatraGUI.Classes.DataSource d) – this method iscalled when the accept method return true.

MouseListeners methods – these are standard set of methods of a Mouse-Listener.

void setName(String name) – sets a name of a transformation inscance.

Methods for color changing :

void setDefaultBarvuPisma(Color c);void setDefaultBarvuOkraje(Color c);void setDefaultBarvuVyplne(Color c);void setBarvuVyplneOznac(Color c);void setBarvuPisma(Color c);void setBarvuOkraje(Color c);void setBarvuVyplne(Color c);

By these methods you can change colors of background, text etc. andchange them back to default values.

75

void setTemplate(String template) – sets the template of the transfor-mation and so its properties.

void setDescription(String template) – sets the description to the tem-plate and the transformation itself.

void setParentFrame(JFrame parent) – sets the parent frame of thetransformation. This is necessary for modal dialog displaying.

All mentioned methods are public.

76

Appendix B

Methods of the MainClass inthe Wizard API

Here is a basic set of methods in the UniversalWizard class with theirs shortdescription:

public void setName(String name) –this method allows to set a nameof the Wizard. This name will be displayed in the Wizards menu ofthe Main Frame and on other places like in combo boxes etc.

public String getName() –this methods returns a name of the Wizard setby the previous method.

public void setShortDescription(String desc) this method is used toset the Wizard toolTip. It is displayed whenever an user place themouse cursor over the relevant Wizard menu etc.

public String getShortDescription() –this methods returns the toolTipof the Wizard set by the previous method.

public void setParent(JFrame parent) –sets a parent frame for the wiz-ard in order to allow to display it as a modal dialog or to get someinformation.

public void runWizard() –this method can be called as the main methodbecause it is called to run the Wizard functionality. If the Wizard hasa GUI, then it is displayed, if the Wizard is automatic, then it is ranetc.

public static final String supportedDS() –this method is very impor-tant because the main application can not decide whether this Wizard

77

is proper for this DataSource. Using this method, the IDE can recog-nize and offer relevant Wizards to different DataSources. This methodshould be rewritten, because the one implemented in the UniversalWiz-ard returns an empty array and it means that the Wizard is suitablefor all DataSources types.

Important is, that the constructor of the Wizard class has to be a defaultconstructor, which means, that it has not any parameter inside. This is arequirement of a JavaBean standard and it also allows to load the moduleinto the IDE safely.

78

Bibliography

[1] Petr Aubrecht. Sumatra Basics. Technical Report GL–121/00, CzechTechnical University, Department of Cybernetics, Technicka 2, 166 27Prague 6, December 2000.

[2] Petr Aubrecht. Tutorial of Sumatra Embedding. Technical Report GL–101/00, Czech Technical University, Department of Cybernetics, Tech-nicka 2, 166 27 Prague 6, October 2000.

[3] Petr Aubrecht. MINDANAO — ETL proces rızeny meta daty. InOlga Stepankova Jan Rauch, editor, Znalosti 2001. Vysoka Skola Eko-nomicka v Praze, 2001.

[4] Petr Aubrecht. Specification of SumatraTT. Technical Report K333–2/01, Czech Technical University, Department of Cybernetics, Technicka2, 166 27 Prague 6, 2001.

[5] Petr Aubrecht. SumatraTT — predzpracovanıdat. Research ReportK333–1/01, Czech Technical University, Department of Cybernetics,Technicka 2, 166 27 Prague 6, July 2001.

[6] Petr Aubrecht and Zdenek Kouba. Metadata Driven Data Transforma-tion. In SCI 2001, volume I, pages 332–336. International Institute ofInformatics and Systemics and IEEE Computer Society, 2001.

[7] Zohra Bellahsene. View Adaptation in Data Warehousing Systems. InErich Schweighofen Gerald Quirchmayr, editor, DEXA ’98, pages 300–309. Vienna, Springer-Verlag, Heidelberg, August 1998.

[8] SYSTEMS Enterprise Computing. Metadata and e-business. http://

www-aadc.antdiv.gov.au/metadata/Metadata_E-Business.asp, Oc-tober 1999.

[9] The MITRE Corporation. A metadata resource to promote data in-tegration. http://computer.muni.cz/conferen/meta96/seligman/

seligman.html.

79

http://www-aadc.antdiv.gov.au/metadata/Metadata_E-Business.asp�

http://www-aadc.antdiv.gov.au/metadata/Metadata_E-Business.asp�

http://computer.muni.cz/conferen/meta96/seligman/seligman.html�

http://computer.muni.cz/conferen/meta96/seligman/seligman.html�

[10] Bruce Eckel. Thinking in Java. Prentice Hall PTR, Upper Saddle River,New Jersey, USA, second edition, 1998.

[11] EPIFOCAL. Metadata. http://inf2.pira.co.uk/top037.htm,February 2001.

[12] Mark Grand. Java—Language Reference. Computer Press, Prague,Czech Republic, 1997.

[13] META Group. What is metadata, data warehouse tools. http://www.

computerwire.com/bulletinsuk/212e_1a6.html, January 1996.

[14] Metadata Working Group. General guidelines for descriptive metadatacreation & entry. http://coloradodigital.coalliance.org/glines.html, Spring 1999.

[15] ComputerWire Inc. What is metadata. http://www.computerwire.

com/bulletinsuk/212e_1a6.htm, March 1996.

[16] Sun Microsystems Inc. Javabeans(tm) specification. http://java.sun.com/beans, 1997.

[17] Sun Microsystems Inc. The java archive (jar) file format. http://

developer.java.sun.com/developer/Books/JAR/, September 1998.

[18] Sun Microsystems Inc. The source of java(tm) technology. http://

java.sun.com/, 2001.

[19] Computer Associates International. Putting metadata to work in thewarehouse. http://www.platinum.com/products/wp/wp_meta.htm,2000.

[20] Softlab News International. Repository technology - the heart of maestroii. http://www.softlab.de/english/int_news/erepos.html, June1995.

[21] J.I.Baxter. Metadata. http://oulib1.open.ac.uk/wh/general/

metadata.html, July 2001.

[22] Ralph Kimball. The Data Warehouse Toolkit. John Wiley & Sons, Inc.,1996.

[23] Zdenek Kouba and Kamil Matousek. Datove sklady. In Sbornıkprednasek: Dobyvanı znalostı z databazı 2000. CVUT FEL, Katedrakybernetiky – Gerstnerova laborator, 2000.

80

http://inf2.pira.co.uk/top037.htm�

http://www.computerwire.com/bulletinsuk/212e_1a6.html�

http://www.computerwire.com/bulletinsuk/212e_1a6.html�

http://coloradodigital.coalliance.org/glines.html�

http://coloradodigital.coalliance.org/glines.html�

http://www.computerwire.com/bulletinsuk/212e_1a6.htm�

http://www.computerwire.com/bulletinsuk/212e_1a6.htm�

http://java.sun.com/beans�

http://java.sun.com/beans�

http://developer.java.sun.com/developer/Books/JAR/�

http://developer.java.sun.com/developer/Books/JAR/�

http://java.sun.com/�

http://java.sun.com/�

http://www.platinum.com/products/wp/wp_meta.htm�

http://www.softlab.de/english/int_news/erepos.html�

http://oulib1.open.ac.uk/wh/general/metadata.html�

http://oulib1.open.ac.uk/wh/general/metadata.html�

[24] Zdenek Kouba, Kamil Matousek, and Petr Miksovsky. On Data Ware-housing and GIS Information. In Database and Expert Systems Appli-cations – DEXA 2000, volume 1, pages 604–613. Berlin, Springer, 2000.

[25] Zdenek Kouba, Kamil Matousek, Petr Miksovsky, and Olga Stepankova.On Updating the Data Warehouse from Multiple Data Sources. InDEXA ’98. Vienna, Springer-Verlag, Heidelberg, 1998.

[26] Doug Laney. All roads lead to the data warehouse. http://www.idwa.org/spr96/roads.html, 1996.

[27] Oral Lassila and Ralph R. Swick. Resource description framework(rdf) model and syntax specification. http://www.w3.org/TR/1999/

REC-rdf-syntax-19990222/, February 1999.

[28] Petr Miksovsky and Olga Stepankova. Data Preprocessing for DataMining. Research Report GL–108/00, Czech Technical University, De-partment of Cybernetics, Technicka 2, 166 27 Prague 6, 2000.

[29] Inc. Object Management Group. Meta object facility (mof) speci-fication. http://www.omg.org/technology/documents/formal/omg_

modeling_specifications.htm, March 2000.

[30] Inc. Object Management Group. Omg xml metadata interchange (xmi)specification. http://www.omg.org/technology/documents/formal/

xmi.htm, 2000.

[31] Commonwealth of Australia. Metadata. http://standards.edna.edu.au/metadata, May 2001.

[32] Dorian Pyle. Data Preparation for Data Mining. Morgan KaufmannPublishers, Inc., San Francisco, CA, USA, 1999.

[33] Karen Rubenstrunk. Standardizing metadata. http://www.dbmsmag.

com/int9602.html, February 1996.

[34] Bernard Zenko and Petr Aubrecht. Experiments with the Pilot Ver-sion of SumatraTT—A Case Study. Research Report K333–6/01, CzechTechnical University, Department of Cybernetics, Technicka 2, 166 27Prague 6, May 2001.

81

http://www.idwa.org/spr96/roads.html�

http://www.idwa.org/spr96/roads.html�

http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/�

http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/�

http://www.omg.org/technology/documents/formal/omg_modeling_specifications.htm�

http://www.omg.org/technology/documents/formal/omg_modeling_specifications.htm�

http://www.omg.org/technology/documents/formal/xmi.htm�

http://www.omg.org/technology/documents/formal/xmi.htm�

http://standards.edna.edu.au/metadata�

http://standards.edna.edu.au/metadata�

http://www.dbmsmag.com/int9602.html�

http://www.dbmsmag.com/int9602.html�

data transformation interactive design

Documents