3900355fsfvb

Upload: khalid-ahmad

Post on 08-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 3900355fsfvb

    1/35

    Khalid Ahmad

    1

  • 8/7/2019 3900355fsfvb

    2/35

    Chapter 11

    Definition of terms Reasons for information gap between

    information needs and availability Reasons for need of data warehousing Describe three levels of data warehouse

    architectures List four steps of data reconciliation Describe two components of star schema Estimate fact table size Design a data mart

    2

  • 8/7/2019 3900355fsfvb

    3/35

    Chapter 11

    Data Warehouse: A subject-oriented, integrated, time-variant, non-

    updatable collection of data used in support ofmanagement decision-making processes

    Subject-oriented: e.g. customers, patients, students,products

    Integrated: Consistent naming conventions, formats,encoding structures; from multiple data sources

    Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed

    Data Mart: A data warehouse that is limited in scope

    3

  • 8/7/2019 3900355fsfvb

    4/35

    Chapter 11

    Integrated, company-wide view of high-quality

    information (from disparate databases) Separation ofoperational and informationalsystems and data (for improved performance)

    4

  • 8/7/2019 3900355fsfvb

    5/35

    5

    Source: adapted from Strange (1997).

  • 8/7/2019 3900355fsfvb

    6/35

    Chapter 11

    Generic Two-Level Architecture Independent Data MartDependent Data Mart and

    Operational Data StoreLogical Data Mart and @ctive

    Warehouse

    Three-Layer architecture

    6

    All involve some form ofextraction, transformation and loading(ETLETL)

  • 8/7/2019 3900355fsfvb

    7/357

    Figure 11-2: Generic two-level architecture

    E

    T

    L

    One,company-

    wide

    warehouse

    Periodic extraction data is not completely current in warehouse

  • 8/7/2019 3900355fsfvb

    8/358

    Figure 11-3: Independent data martData marts:Data marts:Mini-warehouses, limited in scope

    E

    T

    L

    Separate ETL for each

    independentdata mart

    Data access complexity

    due to multiple data marts

  • 8/7/2019 3900355fsfvb

    9/359

    Figure 11-4:Dependentdata mart with operational data store

    E

    T

    L

    Single ETL for

    enterprise data warehouse

    (EDW)(EDW)

    Simpler data access

    ODSODS provides option for

    obtaining currentdata

    Dependentdata marts

    loaded from EDW

  • 8/7/2019 3900355fsfvb

    10/3510

    E

    T

    L

    Near real-time ETL for

    @active Data Warehouse@active Data Warehouse

    ODSODS and data warehousedata warehouseare one and the same

    Data marts are NOT separate databases,

    but logical views of the data warehouse

    Easier to create new data marts

  • 8/7/2019 3900355fsfvb

    11/35

  • 8/7/2019 3900355fsfvb

    12/35

    Chapter 1112

    Status

    Status

    Event = a database action(create/update/delete) that

    results from a transaction

  • 8/7/2019 3900355fsfvb

    13/35

    Chapter 1113

    Data CharacteristicsData CharacteristicsTransient vs. Periodic DataTransient vs. Periodic Data

    Changes to existing

    records are written

    over previous

    records, thus

    destroying the

    previous data content

    Data are never

    physically altered or

    deleted once they

    have been added to

    the store

  • 8/7/2019 3900355fsfvb

    14/35

    Chapter 11

    Typical operational data is: Transient not historical

    Not normalized (perhaps due to denormalization forperformance)

    Restricted in scope not comprehensive

    Sometimes poor quality inconsistencies and errors

    After ETL, data should be: Detailed not summarized yet Historical periodic

    Normalized 3rd normal form or higher

    Comprehensive enterprise-wide perspective

    Timely data should be current enough to assist decision-

    making Quality controlled accurate with full integrity

    14

  • 8/7/2019 3900355fsfvb

    15/35

    Chapter 11

    Capture/Extract

    Scrub or data cleansingTransformLoad and Index

    15

    ETL = Extract, transform, and load

  • 8/7/2019 3900355fsfvb

    16/35

    16

    Static extractStatic extract = capturing a

    snapshot of the source data at

    a point in time

    Incremental extractIncremental extract =

    capturing changes that have

    occurred since the last static

    extract

    Capture/Extractobtaining a

    snapshot of a chosen subset of

    the source data for loading

    into the data warehouse

  • 8/7/2019 3900355fsfvb

    17/35

    17

    Scrub/Cleanseuses pattern

    recognition and AI techniques to

    upgrade data quality

    Fixing errors:Fixing errors: misspellings,erroneous dates, incorrect field usage,

    mismatched addresses, missing data,

    duplicate data, inconsistencies

    Also:Also: decoding, reformatting, timestamping, conversion, key

    generation, merging, error

    detection/logging, locating missing

    data

  • 8/7/2019 3900355fsfvb

    18/35

    18

    Transform = convert data from format

    of operational system to format of data

    warehouse

    Record-level:Record-level:Selection data partitioning

    Joining data combining

    Aggregation data summarization

    Field-level:Field-level:single-field from one field to one field

    multi-field from many fields to one, or

    one field to many

  • 8/7/2019 3900355fsfvb

    19/35

    19

    Load/Index= place transformed data

    into the warehouse and create indexes

    Refresh mode:Refresh mode: bulk rewriting oftarget data at periodic intervals

    Update mode:Update mode: only changes insource data are written to data

    warehouse

  • 8/7/2019 3900355fsfvb

    20/35

    Chapter 1120

    Figure 11-11: Single-field transformation

    In general some transformation function

    translates data from old form to new form

    Algorithmic transformation uses aformula or logical expression

    Tablelookup anotherapproach, uses a separate table

    keyed by source record code

  • 8/7/2019 3900355fsfvb

    21/35

    Chapter 1121

    Figure 11-12: Multifield transformation

    M:1 from many source

    fields to one target field

    1:M from one

    source field to

    many target fields

  • 8/7/2019 3900355fsfvb

    22/35

    Chapter 11

    Objectives Ease of use for decision support applications Fast response to predefined user queries

    Customized data for particular target audiences

    Ad-hoc query support

    Data mining capabilities Characteristics Detailed (mostly periodic) data

    Aggregate (for summary)

    Distributed (to departmental servers)

    22

    Most common data model = star schemastar schema(also called dimensional model)

  • 8/7/2019 3900355fsfvb

    23/35

    23

    Figure 11-13: Components of a star schemastar schema

    Fact tables containfactual or quantitative

    data

    Dimension tables containdescriptions about thesubjects of the business

    1:N relationship

    between dimensiontables and fact tables

    Excellent for ad-hoc queries,

    but bad for online transaction processing

    Dimension tables are

    denormalized to

    maximizeperformance

  • 8/7/2019 3900355fsfvb

    24/35

    24

    Figure 11-14: Star schema example

    Fact table provides statistics for salesbroken down by product, period and store

    dimensions

  • 8/7/2019 3900355fsfvb

    25/35

    25

  • 8/7/2019 3900355fsfvb

    26/35

  • 8/7/2019 3900355fsfvb

    27/35

    Chapter 11 27

    Figure 11-16: Modeling dates

    Fact tables contain time-period data

    Date dimensions are important

  • 8/7/2019 3900355fsfvb

    28/35

    Chapter 11 28

    Helper table simplifiesrepresentation of hierarchies indata warehouses

  • 8/7/2019 3900355fsfvb

    29/35

    Chapter 11

    Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise dat

    warehouses, including derivation rules Indicate how data is derived from operational

    data store, including derivation rules Identify available reports and predefined queries

    Identify data analysis techniques (e.g. drill-down Identify responsible people

    29

  • 8/7/2019 3900355fsfvb

    30/35

    Chapter 11

    The use of a set of graphical tools that provides

    users with multidimensional views of their dataand allows them to analyze the data using simplewindowing techniques

    Relational OLAP (ROLAP) Traditional relational representation

    Multidimensional OLAP (MOLAP) Cube structure

    OLAP Operations Cube slicing come up with 2-D view of data

    Drill-down going from summary to more detailedviews

    30

  • 8/7/2019 3900355fsfvb

    31/35

    31

    Figure 11-22: Slicing a data cube

  • 8/7/2019 3900355fsfvb

    32/35

    Chapter 11 32

    Figure 11-23:

    Example of drill-down

    Summary report

    Drill-down with

    color added

    Starting withsummary data,users can obtain

    details for particularcells

  • 8/7/2019 3900355fsfvb

    33/35

    Chapter 11

    Knowledge discovery using a blend of statistical, AI, and computergraphics techniques

    Goals: Explain observed events or conditions

    Confirm hypotheses

    Explore data for new or unexpected relationships

    Techniques Statistical regression

    Decision tree induction

    Clustering and signal processing

    Affinity

    Sequence association

    Case-based reasoning

    Rule discovery

    Neural nets

    Fractals

    Data visualization representing data in graphical/multimedia formats foranalysis

    33

  • 8/7/2019 3900355fsfvb

    34/35

    Chapter 11 34

    OLTP: On Line Transaction Processing Describes processing at operational sites

    OLAP: On Line Analytical Processing

    Describes processing at warehouse

  • 8/7/2019 3900355fsfvb

    35/35

    Chapter 11

    Standard DB (OLTP) Mostly updates Many small

    transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data

    Thousands of users(e.g., clerical users)

    Warehouse (OLAP) Mostly reads

    Queries are long and

    complex

    Gb - Tb of data

    History

    Lots of scans

    Summarized, reconcileddata

    Hundreds of users (e.g.,

    decision makers anal sts)