data-ware housing
DESCRIPTION
TRANSCRIPT
Data-ware Housing
Introduction
Definition : Simplex perception- No more than collection of Key
pieces of information used to manage & direct the business for the most profitable outcome.
Precise Definition- It concentrate on data- Data should be subject oriented, be consistent across sources & so on.
Pearson’s Definition- It is more than vast data- it is also process involved in getting that data from source to table & from table to analyst’s.
** In other word **
“A DWH is the data (Meta/fact/dimension/aggregate) and
process manager (load/warehouse/query) that make information
available, enabling people to make informed decision.
Data-ware housing Architecture :
DWH must architected to support three major driving
factors.
1) Populating DWH.
2) Day-to-Day management of DWH.
3) The ability to cope with requirement evolution.
Typical Process flow within D.W.H
Source
Extract & load
Warehouse
Data transformation and movement
User
Query
Archive data
Processes :
1. Extract & load the data
2. Clean & transform data in to a form that can cope with
large data volume & provide good query performance.
3. Back up & Archive data
4. Manage queries & direct them to appropriate data
Sources.
Extract & load process:
Op. Data
Suitable for operational System,May have been modified & extended over yr’s to support performance.
D.W.HReconstructed
1) Extract & load process:
a. Controlling the processes: determine when to start
extracting the data, run transformation, consistency
check & so on. Eg: Retail sales analysis
b. When to initiate the extract: Data should be in a
consistent state. Same instances of time. Eg. Telecom
c. Loading the data: Temporary Data store. Clean up
& Consistency check. X Eg. Current subscriber &
Current Event DB.
d. Copy Management tools & data clean-up.:
coding
2) Clean & transformation
a. Clean & transform the data in to a structure that speed up queries
b. Partition data in order to speed up queries, optimize h/w performance& simplify the management of DWH
Clean & transformation
a. Clean & transform the data in to a structure that speed up queries
• Make sure data is consistent within itself. Eg: row
• Make sure data is consistent with other data
With in the same source.
• Make sure data is consistent with data in the
other source system
• Make sure data is consistent with the information already in
the warehouse.
3) Back-up & archive process :
Back-up regularly- recover from loss/failure
In Archiving older data is removed from system
4) Query management process :
Directing query to most effective data source.
Process Architecture
Process Function System
manager
Extract & load Extract & load the data,
performing simple
transformations before & during
load.
Load Manager
Clean & transform
Data
Transforms & Manages data Warehouse
manager
Backup & archive Backs up & archives data
warehouse
Ware house
manager
Query Manager Directs & manages queries Query Manager
Operational Data
Operational Data
LOAD
MANAGER
Detailed informatio
n
Summary info
Meta Data
QUERY
MANAGER
Warehouse Manager
Data dipper
OLAP tools
Data Information Decision
Architecture of data-ware house
Load Manager
System Component that perform all the operations necessary to support the
extract and load process.
Off-the-Shelf tools, bespoke coding, C programs & Shell script.
Size & Complexity will vary between specific solutions from d.h.w to d.h.w.,
larger the degree of overlap between source systems, the larger the load
manager will be.
Third-Party tools max-20 to 25 % of the total system fun.
Load Manager Architecture
1) Extract the data from source systems.
2) Fast load the extracted data into a temporary data store.
3) Perform Simple transformations into a structure similar to the one in the data
ware house.
Each of these function has to be operate automatically & recover from any
error it encounters, to very large extent with no human intervention.
Extract data from source system
In order get hold of the source data it has to be transfer from Source
systems, and made available to D.W.H..
ASCII files are FTP across the LAN.
Current gateways tech. operates too slowly to compete to FTP.
Fast Load
Data should be loaded into warehouse in the fastest possible time, in
order to minimize the total load window.
This becomes critical as the no. of data sources increases and time
window shrinks.
In practice it is more effective to load the data in to a relational D.B. prior
to applying transformation & checks.(ASCII)
Simple Transformation
Before or during the load there will be an opportunity to perform simple
Transformations on the data.
Here we perform those transformation that does not require complex
Logic, or use of relational set operators.
Eg: retail management system.:
1)Strip out all the column that are not required in DWH.
2)Convert all the values to the required data types;
Load Manager Architecture
File structur
e
Temporary data Store
Ware house
str.
Load Manager
Controlling Process
Stored Procedure
Copy management
tools
Fast loader
Ware-house Manager
System Component that perform all the operations necessary to support the
Ware house management process.
Third party system management tools, bespoke coding, C programs &
Shell script.
As the Load manager size & Complexity of ware-house manager will vary
between specific solution. Unlike L.M. the complexity of WH manager is
driven by extend to which the operational management of the DHW has been
automated.
Third-Party tools max-40 % of the total system fun.
Ware-house Manager Architecture
1) Analyze the data to perform consistency & referential integrity check
2) Transform & merge the source data in to a temporary data source into the
Published DWH.
3) Create indexes, business view, partition views & so on.
4) Generate denormalization if appropriate.
Ware house Manager Architecture
Temporary data store
Star flake schema
Summary tables
Ware-house Manager
Controlling Process
Stored Procedure
Backup /recovery tool
SQL scripts
Using temporary destination table :
Once the data is in temporary Store, the next step is to crate a set of tables
identical to the destination table in the DWH.
Ex: if the data in DWH is highly partitioned….
As we r abt. to execute substantial constancy check, data should not be
loaded until it has been cleaned up.
If consistency check fails
Although Relational databases some form rollback, but in practice it is easy
to load data in temporary area, clean it up & then publish it to the DWH.
Complex Transformation
Reconcile data
Transform into a star flake schema:
Transform it into a form suitable for decision support queries.
Transform into a form in which the bulk of factual data lies in the center.
Star schema, snowflake schema, star flake schems.
Create Indexes & views:
One would expect the index creation time to be significant, even if we
need only to create index against fact table partition.
Because of this most relational technology have facilities to create
indexes in parallel, distributing the load across the H/W & significantly
reducing the elapsed time.
Overhead of inserting a row into a table.
Generate the summaries:
Ware-house manager has to create a set of the aggregation to
speed up query performance.
Generated Automatically.
Query manager:
System Component that perform all the operations necessary to support the
Query management process.
User access tools, specialist data-ware housing monitoring tools,
native
data base facilities, bespoke coding, C programs & Shell script.
Size & Complexity will vary between specific solutions.
Unlike the L.M. complexity of Q.M. is driven by th extent to which the facilities
are provided by user access tools or native DB facilities.
Query Manager Architecture
1. Direct queries to the appropriate tables2. Schedule the execution of the user queries.