data warehousing best practices whitepaper

Upload: rama

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    1/23

    Data Warehouse Best Practices

    April, 09, 2009

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    2/23

    White Paper

    Copyright 2009 Intrasphere Page 2 of 23 Confidential

    Table of Contents

    Introduction ................................................................................................................... 4

    Planning ......................................................................................................................... 5

    Readiness Assessment........................................................................................... 5

    Corporate Sponsorship...................................................................................... 5

    Appropriate Budget, Time and Resource Commitments.................................... 6

    User Demand..................................................................................................... 6

    IT Readiness...................................................................................................... 6

    Project Planning....................................................................................................... 7

    Realistic Timelines and Scope........................................................................... 7

    Phased Approach .............................................................................................. 7

    Communication Channels and Issue Tracking................................................... 8

    Data Governance Committee ............................................................................ 8

    Analysis ......................................................................................................................... 9

    Data Analysis ........................................................................................................... 9

    Data Profiling ..................................................................................................... 9

    System Analysis ...................................................................................................... 9

    Capacity Planning.............................................................................................. 9

    Tool Assessment ............................................................................................. 10

    Design .......................................................................................................................... 11

    Data Modeling ........................................................................................................ 11Dimensional versus Relational......................................................................... 11

    Data Mapping .................................................................................................. 11

    Data Marts ....................................................................................................... 12

    System Design ....................................................................................................... 12

    Modular Design................................................................................................ 12

    A/B Data Environments ................................................................................... 13

    Development................................................................................................................ 14

    Development Environment ................................................................................... 14

    Multiple Dev Databases................................................................................... 14

    Architecture Team ........................................................................................... 14

    Golden Copy.................................................................................................... 15

    Full Sets of Source Data.................................................................................. 15

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    3/23

    White Paper

    Copyright 2009 Intrasphere Page 3 of 23 Confidential

    Test............................................................................................................................... 16

    Test Environment................................................................................................... 16

    Concurrent Testing .......................................................................................... 16

    Regression Testing.......................................................................................... 17

    Automated tools............................................................................................... 17

    Deployment.................................................................................................................. 19

    Production Build.................................................................................................... 19

    Deployment Checklist ...................................................................................... 19

    Initial System Burn in ....................................................................................... 20

    Summary...................................................................................................................... 21

    About Intrasphere ....................................................................................................... 22

    For More Information ................................................................................................ 23

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    4/23

    White Paper

    Copyright 2009 Intrasphere Page 4 of 23 Confidential

    Introduction

    Data warehouses are large, expensive systems. They typically take years to fully implement, require the

    efforts of large teams, and often fail to deliver on the initial promise due to technical, procedural and other

    reasons. However, successful data warehouses that result in true business value and competitiveadvantage can be built if the correct approach and best practices are followed.

    This white paper serves as a practical best practices guide to use for Data Warehouse initiatives. It draws

    upon the authors years of experience designing and building Data Warehouses in the Pharmaceutical

    Life Sciences industry.

    This document is organized around the end-to-end process that includes the following key phases:

    Planning

    Analysis

    Design

    Development

    Test

    Deployment

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    5/23

    White Paper

    Copyright 2009 Intrasphere Page 5 of 23 Confidential

    Planning

    Data warehouses require lots of planning in order to achieve success. The following are some best

    practice activities that help ensure the appropriate level of planning is done prior to building the system.

    Readiness Assessment

    In the planning phase, it is important to honestly assess the readiness of the organization for the

    implementation of a data warehouse. A data warehouse readiness assessment is important to identify

    areas of potential failure. There are several published methods for executing this assessment, such as

    those found in Ralph Kimballs The Data Warehouse Lifecycle Toolkit. The important factors to consider

    usually cover the following basic areas:

    Corporate Sponsorship

    Appropriate Budget, Time and Resource Allocations

    User Demand

    IT Readiness

    Corporate Sponsorship

    Corporate Sponsorship is a key factor to consider in assessing the readiness of the organization.

    Successful data warehouses have strong senior sponsorship from the leadership of the organization.

    Ideally, the senior sponsor will be a respected visionary with the clout to influence budgets and convince

    others of the importance of the data warehouse. Strong sponsorship will help:

    Get corporate buy in, generating the acceptance of the data warehouse within the organization

    Allocate resources and budget, ensuring the data warehouse has the funding and support it

    needs to be built

    Assist with removing any roadblocks that may come up during the building and deployment of the

    data warehouse

    Bridge departments to garner cooperation across departmental lines

    Create and share a vision or mission statement that will convince the company as a whole of the

    importance of the data warehouse

    If the senior sponsor is not fully committed to the expense, time, and effort of the data warehouse, it may

    be difficult to get others to fully support the effort, especially if timelines or budgets run over. While it is

    possible to build a data warehouse without strong sponsorship, it is much more difficult and risky. This

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    6/23

    White Paper

    Copyright 2009 Intrasphere Page 6 of 23 Confidential

    step in the readiness assessment serves to identify or recruit the person(s) that will act as the sponsor(s),

    and to gauge their commitment to championing the data warehouse effort.

    Appropriate Budget, Time and Resource Commitments

    Data warehouses require large investments in time, resources and money. They are usually implementedvia multi-year projects, and often have large teams both building and supporting them. A key to the

    success of building a data warehouse is to make sure that the budget, time and resource needs are met.

    It is critical to set realistic expectations and to gain commitments before starting. It is not atypical for a

    data warehousing project to cost 50%-100% more than original estimates, or take two times longer to

    complete. In addition to initial commitments on budgets, time and resources, it is wise to set aside

    contingency amounts.

    Many times, it is easier to break up the effort into several phases, and procure the budget and resource

    commitments based on smaller efforts. Care should be taken to ensure that the areas of the datawarehouse that will produce the highest ROI receive the highest priority. This will help to prove the value

    of the data warehouse and to acquire future funds and resources. This step in the assessment is to

    understand how difficult it will be in raising the funds and getting resources committed to the effort of

    building the data warehouse.

    User Demand

    Data warehouses need to meet the demands of its users, or it will not be adopted by the user community

    and thus the proposed value of the data warehouse will not be realized and the project deemed a failure.

    It is very important to make sure that the user community is open to changing its business operations to

    include using the data warehouse. The key here is to get the user community eager to get involved and

    excited in the potential of the data warehouse. Not only will the data warehouse be built to better answer

    the types of questions the users want to ask, but it will be better accepted by the user community when

    deployed. This step in the assessment is to interview a few users and ensure that there is a need, and

    that if you build it, they will come.

    IT Readiness

    A data warehouse will usually be built and supported by a companys Information Technology

    department. It is critical to evaluate the technical abilities of the IT department in hosting the data

    warehouse. Several factors should be investigated to assess whether or not the IT department is ready to

    support the effort:

    Ability to acquire, deploy and host the necessary hardware

    Ability to acquire the software licenses necessary

    Necessary resources and budgets to host the system

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    7/23

    White Paper

    Copyright 2009 Intrasphere Page 7 of 23 Confidential

    Technical skills and experience with the hardware and software platforms chosen to implement

    the data warehouse (database, ETL, business intelligence tools, etc.)

    Technical skills to restart the system in case of failure

    Technical skills to back up and restore the system as needed

    Technical skills to rebuild the system in case of disaster

    If the IT department is already supporting data warehouses, it is a good idea to plan to use similar

    platforms if possible. This assessment step is critical in identifying any gaps in the support the system will

    need once built. These gaps, if any, should be addressed in the initial data warehouse project.

    Project Planning

    There are several activities conducted during the project planning phase of the data warehouse projectthat can help ensure success. While most are typical in any project, the size and complexity of a data

    warehouse project can make them especially critical. Creating a realistic project timeline, ensuring clear

    communication channels, setting up rigorous scope and change controls, issue tracking and escalation,

    and frequent status checks are all important in any project, but critical to the success of a data

    warehouse. In addition, a data governance board should be established to quickly resolve any data

    issues found, or escalate them to proper resolution as fast as possible.

    Realistic Timelines and Scope

    It is highly recommended that the project timeline be driven by a bottom up approach. The scope for the

    initial release (in a multi-phased approach) should be clearly identified. Each task needed to accomplish

    the scoped result should be assigned an estimated effort, named resources (if possible), and identifiedconstraints. It is very risky to time-box a data warehouse project to meet a specific deadline. Usually,

    attempting delivery with a top-down timeline will result in a very limited project scope.

    Phased Approach

    A phased approach is highly recommended. Data warehouses are easily chunked into work efforts

    based on either sets of data source systems or sets of data marts addressing specific related business

    function needs. A well designed data warehouse needs to be able to add new source systems and new

    business intelligence tools throughout its lifetime. This property lends itself to phased development and

    deployment. Phased efforts usually take longer and can cost more in the long run than all or nothing

    efforts, but are inherently much less risky, and users can gain access and use the system much earlier

    (which means the return on the investment is realized earlier as well.)

    Another major benefit of a phased approach is that it is more flexible to users needs. Users typically

    develop new ideas and requirements once they start accessing a data warehouse. A phased approach

    can respond quickly and increase the value of the system to these users.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    8/23

    White Paper

    Copyright 2009 Intrasphere Page 8 of 23 Confidential

    Finally, a phased approach allows teams to include bug fixes or resolved data issues that may not have

    been caught during testing. This will increase the quality of the data warehouse, and increase its value to

    the users.

    Communication Channels and Issue Tracking

    A clear organization chart of the project and clear, articulated communication channels should be

    established to ensure that issues are raised and dealt with in a timely manner. Data warehouses typically

    uncover many unexpected problems, and problem resolution is key to ensuring the project runs smoothly

    and is not excessively delayed by unresolved issues. Issues should be tracked, and the status of the

    issues should be reviewed frequently to ensure progress is being made. Utilizing an issue tracking tool

    can assist in making sure issues are visible to the entire team, but should not be the sole channel for

    communicating issues. When issues are raised, they should be acknowledged before being assumed to

    be assigned.

    A weekly status meeting is effective for tracking major issues. However, due to the number of teams and

    the complexity of the data warehousing development and testing effort, it is recommended that each teamhave frequent team meetings in addition to project status meetings.

    Test team members should also be paired with development team members to report bugs directly. This

    will usually allow for better communication between tester and developer, and will help speed the overall

    resolution of issues found during testing.

    Data Governance Committee

    Data warehouse projects uncover many hidden data issues in operational databases. Due to the high

    number of data issues found during initial development of a data warehouse, it is highly recommended

    assembling a Data Governance Committee. The role of the committee will be to:

    Resolve or seek resolution on data issues

    Publish a set of master data by identifying official sources for data lists (such as product lists)

    Identify methods for fixing data errors

    Describe the definitions of data elements across multiple source systems

    Enforce data standards

    The Data Governance Committee should be staffed by knowledgeable data stewards from the business

    users, source systems and the data warehouse. Its members may also perform some tasks in the data

    warehouse development effort.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    9/23

    White Paper

    Copyright 2009 Intrasphere Page 9 of 23 Confidential

    Analysis

    The analysis phase of a data warehousing project typically takes longer than other system development

    projects. This is because of the effort required to profile and analyze the data from multiple source

    systems. There are several activities that are best practices during this phase. They include:

    Data profiling

    Capacity planning

    Tool assessment

    Data Analysis

    Data Profiling

    Data profiling is a critical component to building a data warehouse. It consists of investigating the raw

    data in the source systems, looking for data patterns, limits, string lengths, distinct values, constraint

    violations, etc. It is the first pass at identifying potential data issues, and is also the analysis step that will

    generate requirements for the designs of the data models and the data mappings. This step is typically

    done by a data analyst using SQL editors or data profiling tools.

    Most times this step is done in parallel with data modeling and data mapping efforts, usually by the same

    team. This approach allows the team to focus on specific areas, modeling, mapping, and assessing the

    data quality all at once. Even though the data modeling and data mapping activities reside in the Design

    Phase, the reality is that these three activities lend themselves to iterative execution very well, and are

    usually done together. In fact, it is not uncommon for these iterative activities to continue into the

    development phase, since data issues and anomalies are sometimes not encountered until then.

    System Analysis

    Capacity Planning

    A capacity plan is a critical component for any data warehouse. It is the guide to growing the system over

    time. Capacity plans for data warehouses consider the following:

    Initial data storage size requirements of the data warehouse

    Incremental data growth due to ETL migrations

    Number of users (usually identified as named users, active users and concurrent users)

    Estimated processing requirements based on concurrent user queries and other processes

    ETL schedule requirements

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    10/23

    White Paper

    Copyright 2009 Intrasphere Page 10 of 23 Confidential

    User access time (up-time) requirements

    Archiving and Partitioning

    The inputs into the initial capacity plan should be gathered during the analysis phase, since data profiling

    can give a clear understanding of data size and growth expectations. These will be used to ensure

    appropriate data storage capacity is available, and that the ETL servers can process the required data

    within the ETL batch window.

    A capacity plan should be a living document, measuring the initial growth and user pattern estimates

    against actual values periodically. This will ensure that performance and functionality is not limited by

    data warehouse growth.Tool Assessment

    Data warehouses require a number of specialized software tools. The most simple data warehouses willhave database software, ETL software, and some business intelligence or reporting software. During the

    analysis phase, it is critical to identify the tools to be used to implement the data warehouse, especially if

    they will require the assigned development resources to be trained in their use. It is essential to choose

    the tools that will not only fit the current vision, but the future expectations of the data warehouse. Scaling,

    support, and product upgrades should be considered. Time should be set aside to allow for vendor

    demos and bake-offs. Time should be allowed for the vendors to build demos based on specific

    requirements as this can help identify the fit of the tools.

    The foundation software components of a data warehouse are typically very expensive, and require a

    good deal of technical experience and competence to ensure a successful implementation. This causes

    most companies to standardize on specific tool vendors. If that is the case, a tool assessment is stillrecommended to ensure that gaps in functionality are identified so that alternate solutions can be

    designed.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    11/23

    White Paper

    Copyright 2009 Intrasphere Page 11 of 23 Confidential

    Design

    The design phase of a data warehouse is typically longer than other system projects. This is due to

    several factors, including the iterative nature of data modeling and data mapping, the design of complex

    technical infrastructure, and the interactions with the several source systems. This section of bestpractices will identify some key considerations during the design phase of a data warehouse.

    Data Modeling

    The most critical task for any data warehouse is the data model. Deciding the appropriate model is key to

    the success and performance of the data warehouse, as well as to the types and diversity of queries

    supported. There are also several best practices based on or related to the data model. Data mapping

    describes the movement of data from the source systems into the target data models. Data marts are

    views or sub-components of the data warehouse built to support a specific business process or group of

    related processes for a specific business functional group. A/B data switches are a mechanism used to

    maintain separation between ETL and end users, allowing both access to the data concurrently.

    Dimensional versus Relational

    There are two approaches to data modeling for a data warehouse, each with strengths in various

    scenarios. They are the Dimensional and the Relational models. Dimensional models (also called star

    schemas) are especially effective for data that is aggregated at different levels, and for building cube

    databases for drilling on various data attributes. Dimensional models retrieve large amounts of data more

    efficiently and faster than Relational models. Relational models are effective for pinpointing individual

    records quickly. Most data warehouses use dimensional models for the data marts (where users interact

    with the data.) The back office staging areas usually contain both dimensional and relational tables, each

    depending on the needs of the ETL and other back office processing.

    The data modeling effort in a data warehouse takes 4 times as long as an operational or transactional

    system. It is an iterative effort, and becomes increasingly (some say exponentially) more complicated with

    each additional source system conformed. It is critical to have an experience data warehouse data

    modeler work with the project team and data governance board to accomplish this critical task.

    Data Mapping

    Data mapping is an essential part in designing the ETL of the data warehouse. It identifies each source

    data element, any transformations or processing routines applied, and the target data element into which

    the data is loaded. Two main tips for data mapping are to use a single spreadsheet worksheet for each

    table, and map target to source. When mapping target to source, it is easier to ensure that all required

    data fields in the target data model are addressed. Mapping source to target can include many

    unnecessary data elements (in the source systems) that will confuse and distract.

    The data mapping is also used by the development team for implementing the ETL. All transformations,

    data quality checks, and cross references must be included in the data mapping document. It is also not

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    12/23

    White Paper

    Copyright 2009 Intrasphere Page 12 of 23 Confidential

    uncommon for the final revisions of the data mapping to be made during or shortly after the ETL

    development, to reflect any modifications made during this phase.

    Data Marts

    Data marts are the views or tables of data that the users interact with. Data marts are tailored to answerspecific sets of user queries in order to ensure the best possible performance. This means that data

    marts are typically limited to a single business process or a group of related business processes for a

    single business functional group. This allows some variations on data modeling approaches and physical

    implementation approaches for each data mart with performance tuning in mind. As mentioned above,

    data marts can be either views or tables, or even files, depending on query performance requirements.

    Often, data marts are materialized views that are refreshed with each ETL batch execution. However,

    some data marts are frozen points in time (for example, quarterly data) in order to ensure best possible

    performance. The bottom line is that a data mart is an individually tailored set of data to best serve the

    users accessing it for the types of questions they need answered. A good data warehouse will have many

    data marts specific to identified needs, rather than just piling all the data together to let the users do whatthey want.

    System Design

    Modular Design

    One critical design best practice in data warehousing is to ensure that the design of all the system

    components is as modular as possible. Data warehouses are complicated systems that interact with

    many other transactional and operational systems. Over the lifespan of a typical data warehouse, source

    systems will be retired and replaced by new sources. In addition, new data marts will be required to ask

    and answer new business questions. New business intelligence tools will be leveraged to analyze and

    model the results. This means that several times during the lifespan of a data warehouse, certain parts ofthe warehouse will be redesigned, replaced, or retired. In order to minimize the impact to the system

    during these changes, a modular design is critical.

    The design should balance the desire for reuse with the expectation of replacement. For example,

    designs should abstract the ETL staging area to allow for a data source to be replaced and minimally

    impact the reports leveraging the data. There are several data modeling techniques that can help

    minimize the impact of change (such as slowly changing dimensions.) Other heuristics include:

    ETL should be done in several legs (acquire from source, process and transform data, integrate

    with other data, load into data marts)

    Data marts should be used to group specific related business functional areas

    Data models should support changes with minimum impact

    One business function per code module

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    13/23

    White Paper

    Copyright 2009 Intrasphere Page 13 of 23 Confidential

    A/B Data Environments

    There are a few strategies for minimizing the impact of the ETL on end users. Typically, the ETL

    processes run during off hours, and users are either discouraged from or even denied access to the data

    during these times. An A/B switch is a set of identical tables, one available for user access while the other

    is being updated by the ETL. This can be an especially effective strategy for ETL processes that run morethan once a day. The concept can be implemented in a number of ways, but the idea is fairly simple. The

    ETL updates one set of tables while the users access the other. Then, when the ETL is finished, they

    switch (usually by dropping and recreating synonyms.) This strategy also has the benefit of allowing the

    users to continue to have access to the data in the event of an ETL failure.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    14/23

    White Paper

    Copyright 2009 Intrasphere Page 14 of 23 Confidential

    Development

    The development phase of a data warehouse typically involves multiple sub-teams building various

    components of the system. There may be one or more ETL teams, a database team, and one or more BI

    system development teams (for the end user reports and query tools.) Careful planning is required toprevent these resources from stepping on each other toes during component development. Typically,

    data warehouses will have multiple development environments to allow each team to focus on its

    components, without negatively impacting other teams. This adds complexity, however, when the

    underlying architecture components change, since these changes must be tracked and published to the

    many development environments. These complications can be minimized by maintaining clear

    communication channels and architecture oversight, usually in the form of the architecture team.

    Development Environment

    Multiple Dev Databases

    As mentioned above, having several development environments will prevent teams from impacting each

    other. An example would be the ETL team running component tests changing the data while the report

    development team is writing SQL queries. The goal of this best practice activity is to give each team a

    development environment that will minimize the impact of such situations. Depending on the design

    structure of the data models, this can be a single database schema within a development instance, or

    may be a copy of the entire database itself in separate database instances. The design of the

    development environments will depend on balancing available hardware, software, and support resources

    with the needs of the development teams.

    Architecture Team

    An architecture team is required to mitigate the potential dangers of variations in the architecture and data

    models of multiple development environments. The role of the architecture team is to:

    Control changes to the underlying architecture of the data warehouse

    Brainstorm to resolve architectural issues

    Represent each development team with regards to impacts

    Communicate across development teams

    Plan inter-team activities

    Changes to the logical and physical database structures are especially important, since they will impact

    all teams. The architecture team should consist of members of each development team to review and

    assess the impact of proposed modifications with regards to the team they are representing. They also

    represent the communication channel back to the team.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    15/23

    White Paper

    Copyright 2009 Intrasphere Page 15 of 23 Confidential

    In addition to controlling changes, the architecture team should be a conduit for each development team

    to communicate to the other teams. Sharing of issues and resolutions, enhancement ideas, status, and

    coordination of cross team activities (such as the periodic refresh of the dev environments) should be part

    of the discussions of the architecture team.

    Golden Copy

    One of the key activities in minimizing the potential impact of multiple development environments is the

    periodic refresh of those environments with the latest version of the official development environment

    footprint. This official version is often called the golden copy. It is also the version of the development

    environment that will be promoted to the test environments when the time comes to do so. The golden

    copy should be maintained by one or more members of the architecture team. Whenever changes to the

    underlying architecture are made, and these changes pass component (unit) testing, they should be

    integrated into the golden copy. Then, the golden copy should be distributed to the various development

    environments to allow team members to test for impacts to their components and to adjust their code if

    necessary. Each team will then submit tested versions of its components for inclusion into the golden

    copy, where it can be distributed to the other teams in the next periodic refresh. In addition, the goldencopy should maintain a full set of source data.

    Full Sets of Source Data

    Acquiring full, up-to-date copies of the data sources is an important key to the success of building and

    testing data warehouse components. It will allow realistic component testing, is a key for performance

    testing, and is also vital for preventing data issues that cause errors in the ETL and Business Intelligence

    queries. In addition, it saves the need to create test data, which can be extremely time-consuming. It may

    be necessary to update the copies with newer copies if a source system is altered (upgraded) or if the

    overall data changes significantly.

    Any manufactured data created by the development or test teams should not be stored in the golden copy

    of the source data, unless all teams agree.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    16/23

    White Paper

    Copyright 2009 Intrasphere Page 16 of 23 Confidential

    Test

    The testing phase of a data warehouse usually takes significantly longer than the testing of an operational

    system. This is due in large part to the need to run multiple ETL cycles in order to fully test a data

    warehouse release. If the initial data load will be implemented through the ETL processes, testing the firstfull run of the ETL can take several days. Similar to development, having a few test environments can

    accelerate the process by allowing some tests to be run concurrently. In addition, the utilization of

    automated testing tools can significantly decrease the time it takes to implement regression tests.

    Test Environment

    Concurrent Testing

    Similar to the best practice of having several development environments, several test environments will

    allow for concurrent testing. The following list is an example of the types of concurrent tests that can be

    run if there are multiple test environments (database instances and schemas):

    Initial ETL Load/Performance Test (tests the performance of the first run of ETL)

    Incremental ETL Tests (both performance and functional tests)

    Business Intelligence Data Mart Tests (tests the business intelligence components)

    User Acceptance Tests

    Typically, the initial load of an ETL process can take from several hours to several days, depending on

    the amount of data processed. For example, an ETL run loading 10 years of data will handle 520x the

    data a weekly incremental ETL run will process. Performance tuning changes require that these tests be

    run a few times. This alone is why Performance and Load Testing is usually done in a separate

    environment.

    The incremental ETL testing will require several cycles to ensure data conditions are thoroughly tested

    (e.g. testing inserts, updates, and deletes to source data.) Since these tests are focused on testing

    changes to the data, combining these tests with Business Intelligence report and query tests can extend

    the timeline of both. Typically, a specific ETL process is run to create the data mart environments for the

    Business Intelligence testing processes/environment; these data marts are then left untouched by the

    ETL until the next cycle.

    While multiple environments can speed up the testing cycles, care must be taken to ensure that all

    environments are rebuilt with the latest golden copy for each cycle. The golden copy will include the fixes

    to all bugs found in the previous cycle, and each environment will need regression testing to ensure the

    fixes did not break something already tested.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    17/23

    White Paper

    Copyright 2009 Intrasphere Page 17 of 23 Confidential

    Finally, it should be noted that the challenges, complexities, and costs of maintaining several test

    environments has encouraged some projects to forgo the benefits of time savings and adopt a single

    threaded, serial approach to testing.

    Regression Testing

    Regression testing is a key component of data warehouse development. In the first release, regression

    testing is used to ensure code stability by testing bug fixes for negative impacts to components that have

    already passed testing. While this is important for the first release, it is critical for the ongoing lifecycle of

    the data warehouse. Subsequent releases will include new source systems, new business intelligence

    tools, and/or new data marts with new ETL driven data processing. While regression testing is not unique

    to data warehousing, developing a good regression test plan is essential to ensuring that multiple future

    releases are deployed smoothly with little impact to the users. A best practice with regards to regression

    testing is the use of automated testing tools.

    Automated tools

    There are several factors in choosing and using automated testing tools for a data warehouse. It should

    be noted that in most scenarios, the initial implementation of automated testing is time consuming. The

    activities involved in setting up the tools, creating the test scripts, and debugging test cycles, as well as

    training testers, can add significant time to the test planning phase of the project. However, over the life of

    a data warehouse, this initial investment will pay off. There are some key features that these tools should

    possess:

    Test script repository

    Test script version control

    Ability to organize and combine test scripts

    Thorough result reporting

    Ability to continue processing after errors are encountered

    Ability to initiate ETL processes

    Ability to query database

    Ability to interact with Business Intelligence tools

    Intuitive user interface

    Issue tracking

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    18/23

    White Paper

    Copyright 2009 Intrasphere Page 18 of 23 Confidential

    It may be necessary to use several automated test tools. For example, performance and load testing may

    require a different tool than Business Intelligence report testing. Using a suite of vendor related tools is

    common, and may have the benefit of leveraging a single repository for test scripts and issue tracking.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    19/23

    White Paper

    Copyright 2009 Intrasphere Page 19 of 23 Confidential

    Deployment

    Deploying a data warehouse usually takes longer than deploying an operational system as well. (Why

    would this be any different?) This is especially true for the first release. The initial data load of the system

    can take several days. In addition, the system is usually given a few days to cycle through the ETL toensure all system integration points are correct, and no issues arise. The deployment package should

    include the golden copy and checklists of steps to deploy the system, set up scheduled batch jobs, modify

    permissions and create user accounts on both source and target databases, and implement other

    necessary preparatory actions

    Production Build

    Deployment Checklist

    The deployment checklist is an essential tool to ensure a smooth deployment. The checklist should be

    developed by cataloguing all activities required to build the test environments. During the final test cycle,

    the test environment should be built from scratch using the deployment checklist in order to test the

    checklist. The checklist is usually a spreadsheet or series of spreadsheets that contain the following

    information:

    Activity name

    Assigned resource

    Step-by-step detailed activity instructions

    Golden copy filenames or component names

    Hardware and software component identifiers (such as server ids and schema names)

    Required parameters (including login ids, etc.)

    Communication directions (notification of completion, errors, etc.)

    Execution or scheduling of initial processes (such as kicking off the ETL)

    Including passwords in the deployment checklist can be a security risk. It is recommended that any

    passwords that are needed to execute a deployment checklist step be supplied outside of the checklist in

    a secure manner. Another alternative is to util ize temporary passwords during the deployment, with a finalchecklist step noting a change to the temporary passwords.

    The checklist should be owned by the architecture team. While it is recommended that one person

    coordinate the deployment effort through monitoring the checklist activities, it is vital to ensure that more

    than one person be fully knowledgeable on the details of the checklist activities.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    20/23

    White Paper

    Copyright 2009 Intrasphere Page 20 of 23 Confidential

    Initial System Burn in

    A system burn-in period is recommended after the initial deployment of the data warehouse. During this

    burn-in period, the ETL is run for a few days to ensure there are no unexpected problems, such as a

    scheduled backup interfering with the ETL. A few users should be granted access to ensure there are no

    problems with the user-facing business intelligence applications and report data. This is a highlyrecommended best practice activity, because during the first release, something will go wrong. A week of

    burn-in is typical.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    21/23

    White Paper

    Copyright 2009 Intrasphere Page 21 of 23 Confidential

    Summary

    Data warehouse projects are similar to many other system development projects, and many of the

    standard project methodology best practices apply. However, data warehouses are typically much larger

    and much more complex than other types of systems, and this complexity brings challenges and risks.The underlying theme of many of the best practices listed in this document is to reduce the complexity

    and size of the activities into smaller, more manageable chunks. Using a modular approach to component

    design, a phased approach to scoping, and an iterative approach to analysis and testing can help

    accomplish this goal. A methodical, piece-by-piece approach to building a data warehouse is much more

    manageable than an all or nothing approach.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    22/23

    White Paper

    Copyright 2009 Intrasphere Page 22 of 23 Confidential

    About Intrasphere

    Intrasphere Technologies, Inc. (www.intrasphere.com) is a consulting firm focused on the Life Sciences

    industry. We provide comprehensive, business-focused services that help companies achieve

    meaningful results. Our professionals leverage strategic acumen, deep industry knowledge and provenproject execution abilities to deliver superior service that builds true business value.

    Our strategy, business process and technology services are developed to specifically address areas that

    are most important to our clients including; Drug Safety, Business Intelligence, Enterprise Content

    Management, Compliance and IT Management, to name a few.

    We understand the unique nature of the Life Sciences working environment and clients need to reduce

    costs, drive business processes and speed-to-market, while satisfying regulatory mandates.

    Some of the worlds leading global companies including Pfizer Inc. (NYSE:PFE), Johnson & Johnson

    (NYSE: JNJ), Novartis (NYSE: NVS), Eli Lilly (NYSE: LLY), Vertex Pharmaceuticals (Nasdaq: VRTX) andHarperCollins Publishers (NWS), among others, look to Intrasphere as their trusted solutions partner.

    Founded in 1996, Intrasphere is headquartered in New York City with operations in Europe and Asia.

    Intrasphere has been recognized nationally for performance by industry leading organizations such as,

    Deloitte & Touche, Crains New York Business and Inc. Magazine.

  • 8/3/2019 Data Warehousing Best Practices Whitepaper

    23/23

    White Paper

    For More Information

    Jim Brown

    Intrasphere Technologies

    (212) [email protected]

    Locations

    North America:

    Corporate Headquarters

    New York City

    Intrasphere Technologies, Inc.

    100 Broadway, 10th Floor

    New York, NY 10005

    ph: +1 (212) 937-8200

    fax: +1 (212) 937-8298

    Europe:

    United Kingdom

    4th Floor

    Brook House

    229-243 Shepherds Bush Road

    Hammersmith London, W6 7NL

    ph: +44 (0) 208 834 3700

    fax: +44 (0) 208 834 3701

    Asia:

    India

    Block 2-A, DLF Corporate Park

    DLF City, Phase III

    Gurgaon, Haryana 122002

    ph: +91 (0124) 4168200

    fax: +91 (0124) 4168201