[acm press the 4th workshop - glasgow, scotland, uk (2011.10.28-2011.10.28)] proceedings of the 4th...

7
E-ETL: Framework For Managing Evolving ETL Processes Artur Wojciechowski Pozna ´ n University of Technology, Institute of Computing Science Pozna ´ n, Poland [email protected] ABSTRACT External data sources (EDSs) being integrated in a data warehouse (DW) frequently change their data structures (schemas). As a consequence, in many cases, an already deployed ETL workflow executes with errors. Since struc- tural changes of EDSs are frequent, an automatic reparation of an ETL workflow after such changes is of a high impor- tance. In this paper we present a framework for handling the evolution of an ETL layer. To this end, structural changes are monitored and stored in a Metabase. An erroneous exe- cution of an ETL workflow causes a reparation of the ETL activities that interact with the changed EDS, so that the repaired activities can work on the changed EDS schema. The reparation of the ETL activities is guided by several customizable reparation algorithms. The proposed frame- work was developed as a module external to an ETL engine, accessing the engine by means of API. The innovation of this framework are algorithms for semi-automatic reparation of an ETL workflow. Categories and Subject Descriptors H.2.4 [DatabaseManagement]: Systems General Terms Management Keywords ETL, evolution, evolving ETL, E-ETL 1. INTRODUCTION The data warehouse (DW) architecture has been devel- oped for the purpose of: (1) providing a framework for the integration of multiple heterogeneous, distributed, and au- tonomous external data sources (EDSs) spread across a com- pany and (2) providing means for advanced data analysis, called On-Line Analytical Processing (OLAP). The DW ar- chitecture typically is composed of four layers, i.e., (1) an Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PIKM’11, October 28, 2011, Glasgow, Scotland, UK. Copyright 2011 ACM 978-1-4503-0953-0/11/10 ...$10.00. external data source layer that represents integrated produc- tion systems, (2) an Extraction-Translation-Loading (ETL) layer that is responsible among others for extracting data from EDSs, transforming data into a common data model, cleaning data, removing missing, inconsistent, and redun- dant values, integrating data, and loading them into a DW, (3) a repository layer (a data warehouse) that stores the integrated and summarized data, and (4) an OLAP layer responsible for various types of data analysis and visualiza- tions. An inherent feature of EDSs is their evolution in time with respect not only to their contents (data) but also to their structures (schemas). As reported in [25, 8], structures of data sources change frequently. For example, the Wikipedia schema changed on average every 9-10 days during last 4 years. Structural changes must be propagated to the DW ar- chitecture. They are difficult to handle and manage since they have an impact on multiple layers of the DW architec- ture. Firstly, structural changes have an impact on the ETL layer that must be redesigned and redeployed. Secondly, they have an impact on a data warehouse schema that must be modified in order to follow changes in EDSs. The DW schema changes result, in turn, in changes that have to be made to analytical applications. For these reasons, devel- oping a technology for handling structural changes of EDSs and managing the evolution of the DW architecture is of a high practical importance. The research and technological developments in the area of handling structural changes of EDSs in the DW archi- tecture has mainly focused on managing changes in a DW schema. In this field, five following approaches can be distin- guished: (1) materialized view adaptation, (2) schema and data evolution, (3) temporal schema and data extensions, (4) partial versioning of a schema and data, and (5) the Mul- tiversion Data Warehouse Approach (all of them are out- lined in Section 4). Handling and incorporating structural changes to the ETL layer received so far little attention from the research community [10, 11]. Paper Contribution. This paper contributes a frame- work, called E-ETL, for: (1) detecting structural changes of EDSs and (2) handling the changes at the ETL layer. To this end, structural changes are monitored and stored in a Metabase. Changes are detected either by means of Event- Condition-Action (triggers) mechanism (if possible to apply such mechanisms) or by means of comparing two consecu- tive EDS metadata snapshots. An erroneous execution of an ETL workflow registers in the Metabase error code and a de- 59

Upload: artur

Post on 27-Jan-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

E-ETL: Framework For Managing Evolving ETL Processes

Artur WojciechowskiPoznan University of Technology,Institute of Computing Science

Poznan, [email protected]

ABSTRACTExternal data sources (EDSs) being integrated in a datawarehouse (DW) frequently change their data structures(schemas). As a consequence, in many cases, an alreadydeployed ETL workflow executes with errors. Since struc-tural changes of EDSs are frequent, an automatic reparationof an ETL workflow after such changes is of a high impor-tance. In this paper we present a framework for handling theevolution of an ETL layer. To this end, structural changesare monitored and stored in a Metabase. An erroneous exe-cution of an ETL workflow causes a reparation of the ETLactivities that interact with the changed EDS, so that therepaired activities can work on the changed EDS schema.The reparation of the ETL activities is guided by severalcustomizable reparation algorithms. The proposed frame-work was developed as a module external to an ETL engine,accessing the engine by means of API. The innovation of thisframework are algorithms for semi-automatic reparation ofan ETL workflow.

Categories and Subject DescriptorsH.2.4 [DatabaseManagement]: Systems

General TermsManagement

KeywordsETL, evolution, evolving ETL, E-ETL

1. INTRODUCTIONThe data warehouse (DW) architecture has been devel-

oped for the purpose of: (1) providing a framework for theintegration of multiple heterogeneous, distributed, and au-tonomous external data sources (EDSs) spread across a com-pany and (2) providing means for advanced data analysis,called On-Line Analytical Processing (OLAP). The DW ar-chitecture typically is composed of four layers, i.e., (1) an

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PIKM’11, October 28, 2011, Glasgow, Scotland, UK.Copyright 2011 ACM 978-1-4503-0953-0/11/10 ...$10.00.

external data source layer that represents integrated produc-tion systems, (2) an Extraction-Translation-Loading (ETL)layer that is responsible among others for extracting datafrom EDSs, transforming data into a common data model,cleaning data, removing missing, inconsistent, and redun-dant values, integrating data, and loading them into a DW,(3) a repository layer (a data warehouse) that stores theintegrated and summarized data, and (4) an OLAP layerresponsible for various types of data analysis and visualiza-tions.

An inherent feature of EDSs is their evolution in time withrespect not only to their contents (data) but also to theirstructures (schemas). As reported in [25, 8], structures ofdata sources change frequently. For example, the Wikipediaschema changed on average every 9-10 days during last 4years.

Structural changes must be propagated to the DW ar-chitecture. They are difficult to handle and manage sincethey have an impact on multiple layers of the DW architec-ture. Firstly, structural changes have an impact on the ETLlayer that must be redesigned and redeployed. Secondly,they have an impact on a data warehouse schema that mustbe modified in order to follow changes in EDSs. The DWschema changes result, in turn, in changes that have to bemade to analytical applications. For these reasons, devel-oping a technology for handling structural changes of EDSsand managing the evolution of the DW architecture is of ahigh practical importance.

The research and technological developments in the areaof handling structural changes of EDSs in the DW archi-tecture has mainly focused on managing changes in a DWschema. In this field, five following approaches can be distin-guished: (1) materialized view adaptation, (2) schema anddata evolution, (3) temporal schema and data extensions, (4)partial versioning of a schema and data, and (5) the Mul-tiversion Data Warehouse Approach (all of them are out-lined in Section 4). Handling and incorporating structuralchanges to the ETL layer received so far little attention fromthe research community [10, 11].

Paper Contribution. This paper contributes a frame-work, called E-ETL, for: (1) detecting structural changes ofEDSs and (2) handling the changes at the ETL layer. Tothis end, structural changes are monitored and stored in aMetabase. Changes are detected either by means of Event-Condition-Action (triggers) mechanism (if possible to applysuch mechanisms) or by means of comparing two consecu-tive EDS metadata snapshots. An erroneous execution of anETL workflow registers in the Metabase error code and a de-

59

scription of the error. An error registration in the Metabasecauses a reparation of the ETL activities that interact withthe changed EDS, so that the repaired activities can workon the changed EDS schema. The reparation of the ETLactivities is guided by several customizable reparation algo-rithms. The proposed framework was developed as a moduleexternal to an ETL engine. The framework communicateswith the ETL engine by means of the ETL engine API. Theframework is customizable and it allows to: (1) work withdifferent ETL engines that provide API communication, (2)define the set of detected structural changes, (3) modify andextend the set of algorithms for managing the changes, (4)define rules for the evolution of ETL processes (5) present tothe user the What-If analyses of the ETL workflow, (6) storeversions of the ETL process and history of EDS changes.The framework has a graphical user interface for visualiz-ing ETL processes. The framework can work either witha traditional data warehouse or with a multiversion datawarehouse.The main conception of the presented solution is based on

ideas introduced in [10, 11]. The possibility of using E-ETLis extended by co-operation with external ETL tools. E-ETLdefines a system of detecting structural changes in EDSs andextends algorithms for managing the changes.Paper Organization. The paper is organized as follows.

Section 2 presents the concept of the E-ETL framework andits technical architecture. Section 3 overviews the schemaof the Metabase. Section 4 outlines research related to thetopic of this paper. Section 5 summarizes the paper andoutlines issues for future development.

2. E-ETL FRAMEWORK: CONCEPT ANDARCHITECTURE

E-ETL is a project that aims at developing a frameworkthat will be able to support semi-automatic evolution of ETLprocesses. The current project is based on our previous de-velopments. In [37] we proposed a prototype system that canautomatically detect changes in EDSs and propagate theminto a DW. The prototype allows to define changes that areto be detected and associates with the changes actions exe-cuted in a DW. The main limitation of the prototype is thatit does not allow ETL processes to evolve, instead of that itfocuses on propagating EDSs’ changes into a DW.E-ETL project focuses on developing a method and a

framework for the evolution of ETL process. In particular,the research and development focus on:

• the development of a prototype architecture, calledE-ETL that will be able to co-operate with a few com-mercial ETL development environments;

• a graphical interface for visualizing ETL processes;

• tools for detecting structural changes and propagatingthem into an ETL layer;

• a language for defining rules for the evolution of ETLprocesses;

• a method for checking the validity of an evolved ETLprocess;

• a metamodel for storing versions of ETL processes.

E-ETL is designed to co-operate with a few commercialor open source ETL development environments (currentlythe Microsoft SQL Server Integration Services is supported).To this end, E-ETL is an external system to the commer-cial ones. E-ETL connects to these commercial developmentenvironments by means of API.

E-ETL analyses the design of an ETL process which isdefined in an ETL development environment, and on thebasis of this project an internal model of the ETL processis created. Next, an ETL designer defines which changesare supposed to be detected and defines a set of rules thatspecify how the ETL process should evolve in response tothe changes. Then, when E-ETL detects structural changesin an EDS, it proposes semi-automatically (in some casesautomatically) the modifications of the ETL process. Aftera user’s acceptance of the changes, E-ETL applies them tothe ETL process in the ETL development environment.

2.1 Overall system architectureAn overall technical architecture of E-ETL is shown in

Figure 1.

ETL

ETL

Connector

Model

Translator

DB DW

Record file

Excel file

Xml file

Schema

analyser

Metabase

Evolution

Manager

Defined

rules

Standard

rules

Alternate

scenarios

EVE

module

Rule repository

Schema

analyser

Schema

analyser

Schema

analyser

Trigers

ETL

ETL

Connector

Model

Translator

ETL

ETL

Connector

Model

Translator

Visualiser

Schema

analyser

Figure 1: The technical architecture of the E-ETLprototype system

The central module of the prototype system is the Evo-lution Manager. Its tasks include: (1) reading data froman external ETL system, (2) analysis of the content andstructure of EDSs, (3) modification of an ETL process, (4)export of the modified process to an ETL development en-vironment.

The connection to ETL development environments is re-alised by the ETL Connectors. A dedicated connector isused to connect to a specific ETL development environment,by means of API.

60

Different ETL development environments may use differ-ent data models. Therefore, for every ETL development en-vironments, a dedicated Model Translator needs to be pro-vided. It supports the conversion of a data model used bythe ETL development environments to the data model usedby the Evolution Manager, and back.A Schema Analiser allows to analyze changes in EDSs.

Depending on the EDS types (Database, Record files, XMLfile, Excel file), the proper Schema Analiser is used. Itchecks if the structure of the EDS changed. Changes aredetected either by comparing two successive snapshots of anEDS’s metadata or by the mechanism of schema triggers (ifsuch triggers are supported and allowed to be installed inan EDS). The second approach was described in [37]. Snap-shots of the EDS metadata are stored in the Meta database.After the detection of changes in an EDS, the algorithms

that adapt the ETL process to the detected changes areexecuted. These algorithms have been categorised as fol-lows: Defined rules, Standard rules, Alternative scenarios,EVE Module, and stored in appropriate modules. An ETLdesigner can specify the categories that are supposed to beused to modify the ETL process, and their priorities.The Defined rules module applies user defined evolution

rules to particular elements of an ETL process. For eachelement (table attribute, table view, etc.) of an ETL pro-cess, a user can define whether this element is supposed topropagate the changes, to block them, or to ask a user. Incontrast to this module, the Standard rules module allowsto apply default evolution rules to the entire ETL process.When the structure of one of the EDSs has changed, then

the Alternative scenarios module tries to find another EDSwith a similar structure. After finding a similar EDS, se-quences of operations related to both EDSs (the changedEDS one and the similar one) are analysed. Basing on thedetected differences between these sequences, the Alterna-tive scenarios module proposes modifications of the ETLprocess. Since the history of the ETL process evolution andhistory of the EDSs changes are stored in the system, theAlternative scenarios module can search not only in the cur-rent EDS, but also in their previously used versions. Suchfunctionality may be useful when some changes will be un-done in EDS.The EVE Module module implements rules based on the

solution presented in [17]. For each SQL query a designercan define the level of automatic modifications. The EVEModule works on the basis of these definitions.The rules used by these four modules, are stored in the

Rules repository. The Visualiser module is the part of theGraphical User Interface. It visualizes an ETL process asa graph. The graphical presentation makes the process ofdefining rules simple for the Defined rules module. Anotherfunction of Visualiser is highlighting parts of an ETL pro-cess that have to evolve as the result of changes in EDSs.

2.2 Internal metamodelThe Evolution Manager uses its own internal data model

that allows to unify work with external ETL systems. In thismodel, an ETL process is represented as a directed graph.Each activity in the ETL process is presented as SuperNode.SuperNode consists of Nodes.There is one Node for each input or output parameter of

an ETL activity. An input parameter can be a table at-tribute that the activity reads, a node in XML structure, a

FirstName

SurName

Address

Age

Name

Address

Class

SELECT FirstName + ' ' + SurName as Name,

Address, ‘med’ as Class

FROM Clients WHERE Age>30

Const

= 30

Figure 2: Example of SuperNode

column in a spreadsheet, or some constant value defined inthe activity. SuperNodes without inputs define data sources.SuperNodes without outputs define targets of an ETL pro-cess.

Edges between nodes (inside and outside SuperNode) de-termine dependencies between these nodes. So, if there is adirected edge from node A to node B, then it means thatnode B depends on node A. Such model allows to do im-pact analyses. The impact analyses mark the parts of anETL process that has to evolve as the result of structuralchanges in EDSs. These analyses are done by selecting allnodes succeeding the nodes that have been changed (nodesthat describe EDS attributes that have been changed).

Figure 2 presents an example of SuperNode. ExemplarySuperNode defines an activity that is described as an SQLquery. This activity concatenates FirstName and SurNameto Name, passes Address and creates attribute Class withvalue “med”. All that operations are done for tuples readfrom the Clients table that have value of Age higher than30. FirstName, SurName, Address, Age and constant value’30’ are the input parameters. Name, Address and Class arethe output parameters. The Name output is created fromFirstName and SurName. It means that Name depends onFirstName and SurName. Therefore, there are edges be-tween FirstName and Name as well as between SurNameand Name. The Address output depends on the Addressinput. The Age input parameter and constant value ’30’originate from the ’where’ clause, therefore, all outputs de-pend on them.

2.3 Monitored Structural ChangesAs mentioned before, the most common structural changes

of EDSs include increasing the length or changing a datatype of a column, or adding a new column. Other changesthat may occur include: renaming a column, deleting acolumn, renaming a table, deleting a table, splitting a table,adding a new table. All of these changes are handled byour framework at the level of SuperNode. We adopt a sim-ilar solution to one presented in [10]. On each SuperNodeand even on all Nodes for every type of change user candefine one of three evolution rules: propagate, block or ask

61

that respectively propagate the change, block the change orask user to decide at the moment of the change occurrence.The block rule is simple, it just ignores the change and doesnot modify the SuperNode. The propagate rule instructsthe SuperNode that it should be modified accordingly to thechange. Every ETL activity represented by SuperNode canwork in a different way. For example, it can be a simpleSQL query, or it can just count duplicated elements. Forthe simple SQL query, a change like adding attribute maymodify both input and output Nodes. However, for activitythat counts elements, a similar change may modify only theinput Nodes. The output remains just as one numeric value.Activities based on SQL queries are similar and can be han-dled by rewriting the query. Contrary to this, activities like“remove duplicates” are more complex and each of them hasits own parameters set that can be modified. Therefore, forevery type of activity there must be a method for handlingall types of changes. Since every ETL development environ-ment can have a different set of available ETL activities andthey can work in a different way, the handling methods arespecific for every ETL development environment.

3. METASCHEMAAs already mentioned, the Rule repository stores the de-

fined rules and the Metabase stores snapshots of EDSs struc-tures. Moreover, the Metabase stores the internal defini-tion of ETL processes, the description of EDSs, data de-scribing all detected changes and data describing all appliedmodifications. To simplify the structure of the Metabase,it can be divided into four parts: (1) internal definition ofETL processes, (2) EDS description, (3) detected changes,(4) rule definitions. All parts have one common Projectstable which links data from all parts.The first part (internal definition of ETL processes) is re-

sponsible for storing internal data model (2.2). This part iscomposed of 10 tables (including the Projects table), shownin Figure 3. The SuperNodes table stores information aboutSuperNodes. The EtlObjects table contains an optional in-formation that was read from an external ETL system. Thisinformation may be helpful during applying modificationsin an external ETL system. These objects are specific toan external ETL system and may vary for different systems.The Nodes table stores information about Nodes. The Edgestable stores information about Edges. The ProjectFiles ta-ble stores various files that can be read from an externalETL system. Those files may, for example, contain defini-tion of an ETL process (e.g. project files form an externalETL system) or definitions of data source structures (e.g.XML Schema files). Tables SuperNodeTypes, NodeTypes,EdgeTypes, and FileTypes are dictionaries of types Nodes,Edges and ProjectFiles, respectively. Each modification inan ETL process generates new version of an internal datamodel, therefore tables SuperNode, Node, Edge and File con-tain attribute Version which indicates the version of theelement.The second part (EDS description) contains data about

the structure of an EDS. This part is composed of 7 tables(including the Projects table), shown in Figure 4. Generalinformation about each data source is stored in the Sourcestable. If the data source is a relational database or is a typeof tabular data, then metadata about tables are stored inthe Tables table. Otherwise, if the data source is a struc-tured file (for example XML), then metadata about that

Edges

Id

SourceNodeId

DestNodeId

Version

Type

EdgeTypes

Type

Name

FileTypes

Type

Name

Nodes

Id

Version

Type

SuperNodeId

NodeTypes

Type

Name

ProjectFiles

Id

FileName

Type

Version

CheckSum

FileContent

ProjctId

Projects

Id

Name

Description

LastVersion

SuperNodes

ProjectId

Id

Version

Type

EtlObjectId

SuperNodeTypes

Type

Name

EtlObjects

ObjectId

SuperNodeId

SuperNodeVers...

Description

Name

Sequence

Object

Figure 3: Metaschema - the internal definition ofETL processes

structure are stored in the Structures table. The Columnstable contains metadata about attributes (columns) of tablesstored in the Tables table and metadata about attributes ofthe structure stored in the Structures table. Tables Source-Types and StructureTypes are dictionaries of types Sourcesand Structures, respectively.

Columns

Id

TableId

StructureId

Name

Description

Virual

Version

Projects

Id

Name

Description

LastVersion

Sources

Id

Type

Name

Description

ProjectId

Version SourceTypes

Type

Name

Structures

Id

SourceId

Name

Type

De nition

Description

Version

StructureTypes

Type

Name

Tables

Id

SourceId

Name

Description

Version

Figure 4: Metaschema - EDS description

The third part (Detected changes) contains metadata con-cerning all changes detected in the EDS. This part is com-posed of 7 tables (including the Projects table) which arepresented in Figure 5. The Changes table stores basic meta-data about every change, e.g., an element that is associatedwith this change, an old and new value of a changed elementor a command that caused this change. If there are someerrors caused by the change, they are stored in the EtlErrorstable. If any change was handled by the system or manu-

62

ally by a user, metadata about this fact are stored in theResolved table. The Solutions table stores rules that wereapplied to handling detected changes as well as rules’ pa-rameters. The ElementTypes and ChangeTypes tables storedictionary data. ElementTypes stores types of elements thatcan evolve, whereas ChangeTypes stores predefined types ofchanges.

Changes

Id

ElementId

ElementType

ChangeType

Date

Processed

OldVars

NewVars

ChangeCommand

Error

ProjectId

ChangeTypes

Type

Name

ElementTypes

Type

Name

EtlErrors

Id

Description

Processed

Code

Message

Date

Resolved

ChangeId

SolutionId

Solutions

Id

UsedRuleId

RuleParameters

Version

Projects

Id

Name

Description

LastVersion

Figure 5: Metaschema - detected changes

The fourth part (Rule definitions) contains metadata con-cerning rules of handling detected changes. This part iscomposed of 6 tables (including the Projects table), shownin Figure 6. This part of the metaschema stores both defaultrules for the whole ETL process and rules for particular partsof the process. The RuleTypes table is a dictionary of ruletypes. Every rule type can have its own set of attributes.These attributes are stored in the RuleTypeAttributes ta-ble. Instances of rule types that were created in a particularproject are stored in the Rules table. Attributes for this ruleinstances are stored in the RuleAttributes table.

RuleAttributes

Id

RuleId

Name

TextValue

IntValue

RealValue

BoolValue

ColumnId

Rules

Id

Active

Name

Description

Type

ProjectId

RuleTypeAttributes

Id

RuleType

Name

TextValue

IntValue

RealValue

BoolValue

RuleTypes

Type

Name

Solutions

Id

UsedRuleId

RuleParameters

Projects

Id

Name

Description

LastVersion

Figure 6: Metaschema - rule definitions

4. RELATED WORKThe research and technological developments in the area

of handling structural changes of EDSs in the DW architec-ture have mainly focused on managing changes in a DW. In

this field, the five following approaches can be distinguished:(1) materialized view adaptation, (2) schema and data evo-lution, (3) temporal schema and data extensions, (4) par-tial versioning of schema and data, and (5) the MultiversionData Warehouse approach. Since they are not directly re-lated to the topic of this paper, they will not be describedhere. An overview of these approaches can be found in [36].

As mentioned in Section 1, structural changes in EDSshave strong impact on the ETL layer. Research develop-ment in the ETL area so far focused mainly on: (1) datacleaning techniques [4, 5, 13, 14] and assuring high qualityof data [6, 7, 15], (2) designing and modeling ETL develop-ment environments and architectures [1, 9, 19, 23, 20, 28,29, 34, 35, 33, 24], (3) optimizing ETL executions [2, 18, 21,22, 30, 27, 32], and (4) designing ETL processes for real-time/near-real-time/active DWs [26, 31].

Detecting structural changes in EDSs and propagatingthem into the ETL layer received less attention from theresearch community. One of the first solution of this prob-lem was Evolvable View Environment (EVE) presented in[17]. EVE is the environment that allows the evolution ofan ETL process implemented by means of views. For everyview it is possible to specify which elements of the viewsmay change. It is possible to determine whether a particu-lar attribute, both in the select and where clauses, can beomitted, or be replaced by another attribute. Another pos-sibility is that for every table, which is referred by a givenview, a user can define whether this table can be omitted orreplaced by another table.

Recent developments in the field of evolving ETL pro-cesses include a framework called Hecateus [10, 11, 12]. InHecateus, all ETL activities and EDSs are modeled as agraph whose nodes are relations, attributes, queries, con-ditions, views, functions, and ETL steps. Nodes are con-nected with edges that represent relationships between dif-ferent nodes. The graph is annotated with rules that de-fine the behavior of a graph in response to a certain EDSchange event. In response to an event, Hecateus can eitherpropagate the event, i.e. modify the graph according to apredefined policy or prompt an administrator, or block theevent propagation.

E-ETL versus Hecataeus. The E-ETL framework, pre-sented in this paper, is related to Hecataeus. However,E-ETL differs from Hecataeus with respect to:

• E-ETL detects structural changes in EDSs either bymeans of schema triggers (if such triggers are availableand allowed to be installed in EDSs) or by comparingtwo consecutive snapshots of EDS metadata (no infor-mation was provided how Hecataeus detects structuralchanges);

• E-ETL can be connected to any ETL engine and de-velopment environment that offers API, whereas Heca-taeus needs a specific ETL engine that models ETLtasks by means of graphs;

• E-ETL support ETL workflows built of several com-plex operations (i.e. the operation of removing dupli-cates that may be available only in the external ETLtool), whereas Hecataeus work with ETL workflowsdeveloped as sequences of SQL queries;

• E-ETL can work with different types of EDS (i.e.data

63

base, XML files, spreadsheet, record files), whereasHecataeus supports only data bases as EDSs.

5. SUMMARYIn this paper we discussed the E-ETL framework for han-

dling structural changes in EDSs and for propagating thechanges to the ETL layer. In the framework, the prede-fined structural changes are being automatically detectedand reported to the E-ETL Evolution manager that repairsan ETL workflow. This repairing is guided by the set of ac-tions executed in response to certain EDSs’ changes. Cur-rently we are implementing the presented framework. Weare also preparing test in an environment including struc-tural changes that appeared in the real production DW sys-tems, outlined in Section 1. Furthermore, we focus on devel-oping a language for defining structural changes that are tobe detected and propagated, and for repairing algorithms.E-ETL API is currently under development for accessing Mi-crosoft ETL engine, i.e., SQL Server Integration Services.As stressed in [3, 16] even ordinary content (data) changes

of an EDS may cause structural changes at a DW or changesto the structure of dimension data in a DW. Neither Hecatae-us nor E-ETL supports handling appropriately such contentchanges. In future, we will work on handling such kinds ofcontent changes at the ETL layer and on correctly propa-gating them into a DW.

6. REFERENCES[1] Zineb El Akkaoui and Esteban Zimanyi. Defining ETL

Worfklows using BPMN and BPEL. In DOLAP ’09:Proceeding of the ACM twelfth international workshopon Data warehousing and OLAP, pages 41–48, NewYork, New York, USA, 2009. ACM.

[2] J. Andzic, V. Fiore, and L. Sisto. Extraction,transformation, and loading processes. In R. Wrembeland C. Koncilia, editors, Data Warehouses andOLAP: Concepts, Architectures and Solutions, pages88–110. Idea Group Inc., 2007. ISBN 1-59904-364-5.

[3] Johann Eder, Christian Koncilia, and Tadeusz Morzy.The comet metamodel for temporal data warehouses.In Proceedings of the 14th International Conference onAdvanced Information Systems Engineering, CAiSE’02, pages 83–99, London, UK, UK, 2002.Springer-Verlag.

[4] Helena Galhardas, Daniela Florescu, Dennis Shasha,and Eric Simon. Ajax: an extensible data cleaningtool. In Proceedings of the 2000 ACM SIGMODinternational conference on Management of data,SIGMOD ’00, page 590, New York, NY, USA, 2000.ACM.

[5] Helena Galhardas, Daniela Florescu, Dennis Shasha,and Eric Simon. An Extensible Framework for DataCleaning. In ICDE ’00: Proceedings of the 16thInternational Conference on Data Engineering. IEEEComputer Society, 2000.

[6] Matthias Jarke, Manfred Jeusfeld, Christoph Quix,and Panos Vassiliadis. Architecture and quality indata warehouses. In Barbara Pernici and CostantinoThanos, editors, Advanced Information SystemsEngineering, volume 1413 of Lecture Notes inComputer Science, pages 93–113. Springer Berlin /Heidelberg, 1998. 10.1007/BFb0054221.

[7] Matthias Jarke, Christoph Quix, Guido Blees, DirkLehmann, Gunter Michalk, and Stefan Stierl.Improving OLTP data quality using data warehousemechanisms. ACM SIGMOD Record, 28(2):536–537,1999.

[8] Hyun J. Moon, Carlo A. Curino, Alin Deutsch,Chien-Yi Hou, and Carlo Zaniolo. Managing andquerying transaction-time databases under schemaevolution. Proc. VLDB Endow., 1:882–895, 2008.

[9] Lilia Munoz, Jose-Norberto Mazon, and Juan Trujillo.Automatic generation of ETL processes fromconceptual models. In Proceeding of the ACM twelfthinternational workshop on Data warehousing andOLAP - DOLAP ’09, page 33, New York, New York,USA, 2009. ACM Press.

[10] George Papastefanatos, Panos Vassiliadis, AlkisSimitsis, T. Sellis, and Y. Vassiliou. Rule-basedManagement of Schema Changes at ETL sources. InAdvances in Databases and Information Systems:Associated Workshops and Doctoral Consortium of the13th East European Conference, ADBIS 2009,page 55. Springer, 2010.

[11] George Papastefanatos, Panos Vassiliadis, AlkisSimitsis, and Yannis Vassiliou. What-if analysis fordata warehouse evolution. In Proc. of Int. Conf. onData Warehousing and Knowledge Discovery(DaWaK’07), pages 23–33. Springer, 2007.

[12] George Papastefanatos, Panos Vassiliadis, AlkisSimitsis, and Yannis Vassiliou. Policy-regulatedmanagement of etl evolution. J. Data Semantics,pages 147–177, 2009.

[13] Erhard Rahm and H.H. Do. Data cleaning: Problemsand current approaches. IEEE Bulletin of theTechnical Committee on Data Engineering, 23(4):3,2000.

[14] Vijayshankar Raman and J.M. Hellerstein. Potter’swheel: An interactive data cleaning system. InProceedings of the international conference on VeryLarge Data Bases, pages 381–390. Citeseer, 2001.

[15] Jasna Rodic and Mirta Baranovic. Generating dataquality rules and integration into ETL process. InProceeding of the ACM twelfth international workshopon Data warehousing and OLAP - DOLAP ’09,page 65, New York, New York, USA, 2009. ACMPress.

[16] E. Rundensteiner, A. Koeller, and X. Zhang.Maintaining data warehouses over changinginformation sources. Communications of the ACM,43(6):57–62, 2000.

[17] E. A. Rundensteiner, A. Koeller, X. Zhang, A. J. Lee,A. Nica, A. Van Wyk, and Y. Lee. Evolvable ViewEnvironment (EVE): Non-Equivalent ViewMaintenance under Schema Changes. In Proceedings ofthe 1999 ACM SIGMOD international conference onManagement of data - SIGMOD ’99, pages 553–555,New York, New York, USA, 1999. ACM Press.

[18] Timos Sellis and Alkis Simitsis. Etl workflows: Fromformal specification to optimization. In YannisIoannidis, Boris Novikov, and Boris Rachev, editors,Advances in Databases and Information Systems,volume 4690 of Lecture Notes in Computer Science,pages 1–11. Springer Berlin / Heidelberg, 2007.

64

[19] A. Simitsis, P. Vassiliadis, S. Skiadopoulos, andT. Sellis. Data warehouse refreshment. In R. Wrembeland C. Koncilia, editors, Data Warehouses andOLAP: Concepts, Architectures and Solutions, pages111–134. Idea Group Inc., 2007. ISBN 1-59904-364-5.

[20] Alkis Simitsis, Dimitrios Skoutas, and MaluCastellanos. Natural language reporting for ETLprocesses. Proceeding of the ACM 11th internationalworkshop on Data warehousing and OLAP - DOLAP’08, page 65, 2008.

[21] Alkis Simitsis, Panos Vassiliadis, and T. Sellis.Optimizing ETL Processes in Data Warehouses. In21st International Conference on Data Engineering(ICDE’05), number Icde in ICDE ’05, pages 564–575.IEEE, 2005.

[22] Alkis Simitsis, Panos Vassiliadis, and Timos Sellis.State-space optimization of ETL workflows. IEEETransactions on Knowledge and Data Engineering,17(10):1404–1419, 2005.

[23] Alkis Simitsis, Panos Vassiliadis, Manolis Terrovitis,and Spiros Skiadopoulos. Graph-Based Modeling ofETL Activities with Multi-level Transformations andUpdates, volume 3589 of Lecture Notes in ComputerScience, pages 43–52. Springer-Verlag,Berlin/Heidelberg, 2005.

[24] Alkis Simitsis, Kevin Wilkinson, Malu Castellanos,and Umeshwar Dayal. Qox-driven etl design: reducingthe cost of etl consulting engagements. In Proceedingsof the 35th SIGMOD international conference onManagement of data, SIGMOD ’09, pages 953–960,New York, NY, USA, 2009. ACM.

[25] D. Sjøberg. Quantifying schema evolution. Informationand Software Technology, 35(1):35–54, 1993.

[26] Maik Thiele, Ulrike Fischer, and Wolfgang Lehner.Partition-based workload scheduling in living datawarehouse environments. In Proceedings of the ACMtenth international workshop on Data warehousing andOLAP, DOLAP ’07, pages 57–64, New York, NY,USA, 2007. ACM.

[27] Maik Thiele, Tim Kiefer, and Wolfgang Lehner.Cardinality estimation in ETL processes. In Proceedingof the ACM twelfth international workshop on Datawarehousing and OLAP - DOLAP ’09, page 57, NewYork, New York, USA, 2009. ACM Press.

[28] Christian Thomsen and Torben Bach Pedersen.pygrametl: a powerful programming framework forextract-transform-load programmers. In DOLAP,pages 49–56, 2009.

[29] Juan Trujillo and Sergio Lujan-Mora. A uml basedapproach for modeling etl processes in datawarehouses. Conceptual ModelingER 2003, pages307–320, 2003.

[30] Vasiliki Tziovara, Panos Vassiliadis, and AlkisSimitsis. Deciding the physical implementation of ETLworkflows. In Proceedings of the ACM tenthinternational workshop on Data warehousing andOLAP - DOLAP ’07, page 49, New York, New York,USA, 2007. ACM Press.

[31] Panos Vassiliadis and Alkis Simitsis. Near real timeetl. In S. Kozielski and R. Wrembel, editors, NewTrends in Data Warehousing and Data Analysis,Annals of information systems, pages 19–49. Springer,2008.

[32] Panos Vassiliadis, Alkis Simitsis, and EftychiaBaikousi. A taxonomy of ETL activities. In Proceedingof the ACM twelfth international workshop on Datawarehousing and OLAP - DOLAP ’09, page 25, NewYork, New York, USA, 2009. ACM Press.

[33] Panos Vassiliadis, Alkis Simitsis, Panos Georgantas,Manolis Terrovitis, and Spiros Skiadopoulos. A genericand customizable framework for the design of etlscenarios. Inf. Syst., 30:492–525, November 2005.

[34] Panos Vassiliadis, Alkis Simitsis, and SpirosSkiadopoulos. Conceptual modeling for ETLprocesses. Proceedings of the 5th ACM internationalworkshop on Data Warehousing and OLAP - DOLAP’02, pages 14–21, 2002.

[35] Panos Vassiliadis, Alkis Simitsis, and SpirosSkiadopoulos. Modeling ETL activities as graphs. InProc. 4th Intl. Workshop on Design and Managementof Data Warehouses (DMDW), pages 52–61.CEUR-WS.org, 2002.

[36] R. Wrembel. On handling the evolution of externaldata sources in a data warehouse architecture. InD. Taniar and L. Chen, editors, Data Mining andDatabase Technologies: Innovative Approaches. IGIGroup, 2011. ISBN ISBN-13: 9781609605377.

[37] Robert Wrembel and Bartosz B ↪ebel. The Frameworkfor Detecting and Propagating Changes from DataSources Structure into a Data Warehouse.Foundations of Computing & Decision Sciences,30(4):361–372, 2005.

65