best dwh basics

Upload: ramasundari-vadali

Post on 02-Apr-2018

234 views

Category:

Documents


1 download

TRANSCRIPT

  • 7/27/2019 Best Dwh Basics

    1/15

    Dimension: The same category of information. For example, the time dimension.

    Attribute: A unique level within a dimension. For example, Month is an attribute

    in the Time Dimension.

    or

    Attribute represents a single type of information in a dimension. For example, year

    is an attribute in the Time Dimension.

    Hierarchy: The specification of levels that represents relationship between

    different attributes within a dimension. For example, one possible hierarchy in the

    Time dimension is Year Quarter Month Day.

    Dimensional data model contains two types of tables. They are:

    1)Fact table: Fact table in a dimensional data model contains the measures of all

    interest, such measurements or metrics or facts of business processes. Take the

    example of the sales amount of a business. The amount can be a monthly sales

    number or sales number for a day. This measure is stored in the fact table with the

    appropriate granularity.

    For sales measures, a fact table generally contains three columns: a date column, a

    store column and a sales amount column. Besides the measurements the table will

    also contain foreign keys for the dimension tables.

    or

    It contains numeric values and also contain composite keys (i.e. collection offoreign keys)

    E.g. sales and profit.

    Dimension table:The dimension table in a dimensional model represents

    the context of the measurements. The context of measurements can also be

    understood as the characteristics such as who, what, where, when, how of a

    measurement (subject).

    For example, in a business process Sales, the characteristics of the 'monthly

    sales number' measurement would be a Location (Where), Time (When) and

    Product Sold (What). A dimension table contains a number of dimension

    attributes or columns. In the Location dimension the various attributes can

    be Location Code, State, Country, Zip code. Further, dimension attributes

    contain one or more hierarchical relationships.In the Location dimension the

  • 7/27/2019 Best Dwh Basics

    2/15

    various attributes can be Location Code, State, Country, Zip code. Further,

    dimension attributes contain one or more hierarchical relationships.

    or

    It contains character values

    E.g. Customer_name, Customer_city.

    What is dimension modeling: A data model that maintains all the dimensionsin their own tables and the fact in a separate table (with the necessary relationshipswith all dimensions) is called Dimensional Model. This is a de-normalized model

    as this is used for report generation. The only data feeds can be through a

    scheduled and structured process (ETL) which in turn fetches data from a

    relational / transactional data source(s).

    Ex: Here's a different way to look at dimensional modeling:

    There are three basic styles of data models:

    1)Conceptual data model: The conceptual data model is sometimes called

    the domain model and it is typically used for exploring domain concepts in

    an enterprise with stakeholders of the project.

    2)Logical data model: The logical model is used for exploring the domain

    concepts as well as their relationships. This model depicts the logical entity

    types, typically referred to simply as entity types, the data attributes

    describing those entities, and the relationships between the entities.

    3)Physical data model: The physical data model is used in the design of

    the database's internal schema and as such, it depicts the data columns of

    those tables, and the relationships between the tables. This model represents

    the data design taking into account the facilities and constraints of any given

    database management system. The physical data model is often derived from

    the logical data model although some can reverse engineer this from any

    database implementation.

    Data/dimension modeling tools:

    1.) Oracle Designer

    2.) ERWin (Entity Relationship for windows)

    3.) Informatica (Cubes/Dimensions)

    4.) Embarcadero

    5.) Power DesignerSybase

  • 7/27/2019 Best Dwh Basics

    3/15

    Fact less Fact Table: A fact table contains only the keys i.e. foreign keys but no

    measures (numerics) are known as fact less fact table.

    or

    A Fact Table having no Facts is known as Fact less Fact Table.

    NOTE: Generally we using the fact less fact table when we want events that

    happen only at information level but not included in the calculations level, just

    information about an event that happen over a period.

    APPROACHES:

    At the time of software interrogation bottom/up is good but implementation time

    top/down is good.

    1)Top down: First we have to build data warehouse then we will build data marts.

    Which will need more cross functional skills and time taking process also costly

    ODS-->ETL-->Data warehouse-->Data mart-->OLAP

    2)Bottom up: First we will build data marts then data warehouse. The data mart

    that is first build will remain as a proof of concept for the others. Less time as

    compared to above and less cost.

    ODS-->ETL-->Data mart-->Data warehouse-->OLAP

    How do we maintain Primary key in Fact Table ?

    In data warehousing we are used surrogate keys by which we can change the value

    of primary key.Suppose you have two table emp and dept and empno is the primary key of dept. table

    and also it is used in emp table as fk In this case if we cannot modify the pk

    because it is used as a foreign key in dept table. Thats why we need a extra columnswhich have no actual meaning. Here we have to take a extra columns ID assurrogate key in both table which have no meaning. But it can perform thejoins between two tables.

    what is the difference between aggregate table and fact table ?

    A fact table contains million of records and retrieving therecords from fact

    table takes time. Where as aggregate tablecontains limited data from all the

    required tables, and we retrieve the data it takes less time.

    What is the difference between aggregate table and materialized view?

  • 7/27/2019 Best Dwh Basics

    4/15

    Aggregate tables are pre-computed totals in the form of hierarchical

    multidimensional structure. Where as materialized view is a database objectwhich caches the query result in a concrete table and updates it from theoriginal database table from time to time .Aggregate tables are used to speedup the query computing whereas materialized view speed up the data retrieval.

    What is aggregate table and aggregate fact table?

    Aggregate table contains summarized data. The materialized view is aggregated

    tables.

    For example, in sales we have only date transaction. if we want to create a report

    like sales by product per year. in such cases we aggregate the date vales into

    week_agg month_agg quarter_agg year_agg. to retrieve date from this tables we

    use @aggrtegate function.

    Schemas: Depends on the requirement we can choose the schemaIn designing data models for data warehouses / data marts, the most commonly

    used schema types are Star Schema and Snowflake Schema.

    Whether one uses a star or a snowflake largely depends on personal preference and

    business needs.

    Some Points on star schema:

    1) A star schema can be simple or complex. A simple star consists of one fact

    table, a complex star can have more than one fact table.

    2) In star schema, fact table in normalized format and dimension table is in de

    normalized format.3) If performance is the priority then go for star schema, since here dimension

    tables are de-normalized.

    4) We use star schema when the query involves few joins and for better

    performance. here data is de-normalized.

    Some Points on snowflake schema:

    1) Snowflake schema, both dimension and fact table is in normalized format only.

    It is also known as Extended star schema.

    2) Snowflake it requires more dimensions, more foreign keys and it will reduce thequery performance but it normalizes the records.

    3) If memory space is the priority then go for snowflake schema, since here

    dimension tables are normalized.

    4) For complex joins we go for snowflake. performance is little

    bit slower due to no. of joins. Here data is normalized.

    http://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.htmlhttp://www.1keydata.com/datawarehousing/star-schema.htmlhttp://www.1keydata.com/datawarehousing/snowflake-schema.html
  • 7/27/2019 Best Dwh Basics

    5/15

    Difference between Snowflake and Star Schema:

    1) Star Schema means A centralized fact table and surrounded by different

    dimensions

    2) Star Schema contains Highly De-normalized Data

    3) Star can not have parent table

    4) Why need to go for Star schema:

    Here a) less joiners contains

    b) simply database

    c) support drilling up options

    1) Snowflake means In the same star schema dimensions split into another

    dimensions

    2) Snowflake contains Partially normalized

    3) But snow flake contain parent tables4) Why need to go for Snowflake schema:

    Here some times we used to provide separate dimensions from existing

    dimensions that time we will go to snowflake

    Disadvantage Of snowflake:

    Query performance is very low because more joiners is there

    Star Schema Definition: The star schema is the simplest data warehouse schema.

    It is called a star schema because the diagram resembles a star with points radiating

    from a center.

  • 7/27/2019 Best Dwh Basics

    6/15

    Advantages:

    Simplest DW schema

    Easy to understand

    Easy to Navigate between the tables due to less number of joins.

    Most suitable for Query processing

    Disadvantages:

    Occupies more space

    Highly De-normalized

    Snowflake schema Definition: A Snowflake schema is a Data warehouse Schema

    which consists of a single Fact table and multiple dimensional tables. These

    Dimensional tables are normalized. A variant of the star schema where each

    dimension can have its own dimensions.

  • 7/27/2019 Best Dwh Basics

    7/15

    Advantages:

    These tables are easier to maintain

    Saves the storage space.

    Disadvantages:

    Due to large number of joins it is complex to navigate

    Types of schemas:

    1) Star Schema: In a star schema a central Fact table connects a number of

    individual dimension tables this is called as a star schema.

    It contains less joins so performance will be increase.

    Star schema contains de-normalized data.

    2) Snowflake Schema: One dimension table split into more than one dimension

    this is known as snowflake schema.

    It contains normalized data.

    There are more joins in snowflake schema. so the performance is degrade.

    3) Galaxy Schema: Galaxy schema is known as a

    Fact constollation schema. It requires number of fact tables and Dimension tables

    this is known as a Galaxy schema

  • 7/27/2019 Best Dwh Basics

    8/15

    4) Star flake schema: Hybrid structure that contains a mixture of (de-normalized)

    star and (normalized) snowflake schemas

    NOTE:Mainly in real time ...when we want to use existing data warehousing

    as source we will go for snow flake schema

    Types of Facts:

    1)Additive: Additive facts are facts that can be summed up through all of the

    dimensions in the fact table.

    2)Semi-Additive: Semi-additive facts are facts that can be summed up for some of

    the dimensions in the fact table, but not the others.

    Eg : Bank Balances - you can take a bank account as Semi-Additive since a currentbalance for the account can't be summed as time period; but if you want see current

    balance of a bank you can sum all accounts current balance.

    3)Non-Additive: Non-additive facts are facts that cannot be summed up for any of

    the dimensions present in the fact table.

    Eg: Ratios, Averages & Variance

    Types of Fact Tables:

    1)Cumulative: This type of fact table describes what has happened over a period

    of time.

    For example, this fact table may describe the total sales by product by store by

    day. The facts for this type of fact tables are mostly additive facts. The first

    example presented here is a cumulative fact table.

    2)Snapshot: This type of fact table describes the state of things in a particular

    instance of time, and usually includes more semi-additive and non-additive facts.

    The second example presented here is a snapshot fact table.

    Types of dimension tables:

    There are many dimension tables. The commonly used are:

    1) Confirmed dimension

  • 7/27/2019 Best Dwh Basics

    9/15

    2) Junk dimension

    3) Degenerated dimension

    4) Slowly changing dimension

    5) Rapidly changing dimension

    The others are:

    6) Virtual dimension

    7) Regular dimension

    8) Casual dimension

    9) Shared dimension

    10) Monster dimension

    11) Inferred Dimension12) Role Playing Dimension

    13) Shrunken Dimension

    14) Out Triggers

    15) Static Dimension

    Slowly Changing Dimension: Attributes of a dimension that would undergochanges very rarely and commonly over the time.Ex: Customer Name SexOr

    Slowly changing dimension (SCD) is the type of dimension which changes with

    respect to time or period.

    Ex: The employee of employee id say e23321 is presently in Hyderabad after a

    month he is re-located in Bangalore than we can say the address dimension is SCD

    w.r.t time

    Rapidly Changing Dimension: Attributes of a dimension that changefrequently.

    Or

    Rapidly changing dimension is that where the dimensions changes quickly.

    Ex: ATM transactions (banks).The data being changes continuously and

    concurrently for each second so it is very difficult to capture this dimensions.

    Conformed Dimension: The dimension table used by two or more fact tablesEx: Date dimensions

    or

    Conformed dimension is a dimension which is connected to or shared by more than

    one fact table.

    Eg: A business which takes care of both sales and orders of products then product

    dimension becomes a conformed dimension for both sales fact and order fact

  • 7/27/2019 Best Dwh Basics

    10/15

    Degenerate Dimension: The value of the dimension stored in fact table insteadof the dimension table.

    or

    The data items that are not facts and data items that do not fit into the existing

    dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are

    used when fact tables represent transactional data. They can be used as primary

    key for the fact table but they cannot act as foreign keys.

    For example In sales fact table Invoice number is a degenerated dimension. Since

    Invoice Number is not tied up to an order header table hence there is no need for

    invoice number to join a dimensional table; hence it is referred as degenerate

    dimension.

    Junk Dimension: It is a table with the combination of different and unrelated

    attributes to reduce the pk and fk relation.Ex: student attendance tracking

    or

    un wanted data which is not useful fo report generating purpose the data will be

    placed in the particular table that table is known as junk dimension. Generally it is

    used to provide extra informations.

    Ex:any yes or no like status is an example for junk dimension

    Differences between OLTP and OLAP are:

    OLTP: Online Transactional Processing, which deals with transactions.

    For e.g. withdrawals at ATM machines. It involves many transactions. The

    databases have to be updated more frequently after the successful completion of a

    transaction.

    1) customer-oriented, used for data analysis and querying by clerks, clients and IT

    professionals.

    2) manages current data, very detail-oriented.

    3) adopts an entity relationship(ER) model and an application-oriented database

    design.

    4) focuses on the current data within an enterprise or department.5) Is the E-R modleling, there are more concurrent users,

    6) It contains normalized tables so there is no redundancy.

    7) More tables, Joins and less Indexes,

    8) It stores daily transactional data

    9) It stores very less data

    10) It contains mainly current data

  • 7/27/2019 Best Dwh Basics

    11/15

    11) INSERT, UPDATE, MODIFY can be applied on OLTP.

    12) Performance will be high

    13) Users OLTP - clerk, DBA

    14) OLTP - Transactional Process

    15) No of Users OLTP-1000

    OLAP: Online Analytical Processing, which deals with analysis of data. It has to

    deal with historical data too (for analysis purpose) Not updated frequently. If

    required bulk update is allowed.

    1) market-oriented, used for data analysis by knowledge workers( managers,

    executives, analysis).

    2.) manages large amounts of historical data, provides facilities for summarization

    and aggregation, stores information at different levels of granularity to support

    decision making process.

    3. ) adopts star, snowflake or fact constellation model and a subject-orienteddatabase design.

    4) spans multiple versions of a database schema due to the evolutionary process of

    an organization; integrates information from many organizational locations and

    data stores

    5) It is the Dimensional Modeling

    6) It contains De-normalized tables there will be redundancy.

    7) Less tables, Joins and more Indexes

    8) It stores operational data

    9) It contains Historical and Present data

    10) only SELECT clause is applied on OLAP

    11) It stores very Huge data

    12) Performance will be low compared with OLTP

    13) OLAP - Analytical Process

    14) Users OLAP - Knowledge workers

    1) Manager

    2) Analysts

    15) No of Users OLAP- 100

    Types of OLAP:OLAP (ONLINE ANALYTICAL PROCESSING) is a set of specifications

    which allows the client applications in retrieving the data from the Data

    Warehouse for analytical process. There are 4 types of OLAPS we have

  • 7/27/2019 Best Dwh Basics

    12/15

    1.) DOLAP (DESKTOP OLAP): The OLAP which communicates with

    DESKTOP DATABASES to retrieve the data is called DOLAP.

    Ex: cognos business objects tools.

    2.) ROLAP (RELATIONAL OLAP): The OLAP which communicates withRELATIONAL DATABASES to retrieve the data is called ROLAP.

    Ex: COGNOS REPORT NET BUSINESS OBJECTS MICROSTRATAGY

    HYPERION

    3.) MOLAP (MULTIDIMENSTIONAL OLAP): The OLAP which

    communicates with MULTI DIMENSIONAL DATABASES to retrieve the

    data is called MOLAP.

    Ex: COGNOS HYPERION

    4.) HOLAP (HYBRID OLAP): The OLAP which uses the combined features

    of ROLAP MOLAP is called HOLAP.

    Ex: COGNOS

    OLAP Query:

    Roll-up : display data that increase in aggregation level

    Drill-down : display details using query for dimension table hierarchy

    Pivot : makes cross tabulation

    Slice and dice: Makes range selection on one or more dimension.

    Snapshot: A Snapshot is the copy of data, when we create a snapshot it

    copies the exact data that's related to the at particular report, we use snapshot

    when ever we want to compare reports(ex we want to compare this months

    report with previous months)

    Differences between a Data Warehouse and a Data Mart:

    Category Data Warehouse Data Mart

  • 7/27/2019 Best Dwh Basics

    13/15

    Scope Corporate Line of Business (LOB)

    Subject Multiple Single subject

    Data Sources Many Few

    Size (typical) 100 GB-TB+ < 100 GB

    Implementation Time Months to years Months

    slowly changing dimension: If the data in the dimension table happen to change

    very rarely then it is called as slowly changing dimension.

    ex: changing the name and address of a person which happens rerely.

    The price of the product, address of the person, name of the city are few examples

    of SCD.

    This change can be implemented in three ways...

    Type I: Replace the old record with a new record with updated data there bywe lose the history.

    Type II: Create a new additional dimension table record with new value. Bythis way we can keep the history. We can determine which dimension is currentby adding a current record flag or by time stamp on the dimensional row.

    Type III: In this type of implementation we create a new field in the dimensiontable which stores the old value of the dimension. When an attribute of the

    dimension changes then we push the updated value to the current field and oldvalue to the old field.

    In Type 1 Slowly Changing Dimension, the new information simply overwrites the

    original information. In other words, no history is kept.

    In our example, recall we originally have the following table:

    Customer Key Name State

    1001 Christina Illinois

    After Christina moved from Illinois to California, the new information replaces the

    new record, and we have the following table:

    Customer Key Name State

    1001 Christina California

    Advantages:

  • 7/27/2019 Best Dwh Basics

    14/15

    This is the easiest way to handle the Slowly Changing Dimension problem, since

    there is no need to keep track of the old information.

    Disadvantages:

    All history is lost. By applying this methodology, it is not possible to trace back in

    history. For example, in this case, the company would not be able to know that

    Christina lived in Illinois before.

    Usage:

    About 50% of the time.

    When to use Type 1: Type 1 slowly changing dimension should be used when it is

    not necessary for the data warehouse to keep track of historical changes.

    In Type 2 Slowly Changing Dimension, a new record is added to the table to

    represent the new information. Therefore, both the original and the new record will

    be present. The newe record gets its own primary key.

    In our example, recall we originally have the following table:

    Customer Key Name State

    1001 Christina Illinois

    After Christina moved from Illinois to California, we add the new information as a

    new row into the table:

    Customer Key Name State

    1001 Christina Illinois1005 Christina California

    Advantages:

    This allows us to accurately keep all historical information.

    Disadvantages:

    This will cause the size of the table to grow fast. In cases where the number of

    rows for the table is very high to start with, storage and performance can become a

    concern.

    This necessarily complicates the ETL process.

    Usage:About 50% of the time.

    When to use Type 2: Type 2 slowly changing dimension should be used when it is

    necessary for the data warehouse to track historical changes.

    In Type 3 Slowly Changing Dimension, there will be two columns to indicate the

    particular attribute of interest, one indicating the original value, and one indicating

  • 7/27/2019 Best Dwh Basics

    15/15

    the current value. There will also be a column that indicates when the current value

    becomes active.

    In our example, recall we originally have the following table:

    Customer Key Name State1001 Christina Illinois

    To accommodate Type 3 Slowly Changing Dimension, we will now have the

    following columns:

    Customer Key

    Name

    Original State

    Current State

    Effective Date

    After Christina moved from Illinois to California, the original information gets

    updated, and we have the following table (assuming the effective date of change is

    January 15, 2003):

    Customer Key Name Original State Current State Effective Date

    1001 Christina Illinois California 15-JAN-2003

    Advantages:

    This does not increase the size of the table, since new information is updated.

    This allows us to keep some part of history.Disadvantages:

    Type 3 will not be able to keep all history where an attribute is changed more than

    once. For example, if Christina later moves to Texas on December 15, 2003, the

    California information will be lost.

    Usage:

    Type 3 is rarely used in actual practice.

    When to use Type 3: Type III slowly changing dimension should only be used

    when it is necessary for the data warehouse to track historical changes, and when

    such changes will only occur for a finite number of time.