data warehousing

27
Data Warehousing Hu Yan [email protected]

Upload: samiisthere

Post on 19-Nov-2014

306 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Data Warehousing

Data Warehousing

Hu [email protected]

Page 2: Data Warehousing

Outline• What is data warehousing• The benefit of data warehousing• Differences between OLTP and data warehousing• The architecture of data warehouse• The main components• Data flows• Tools and technologies• Integration• The importance of managing meta-data• Data marts

Page 3: Data Warehousing

What is data warehousing?

• data warehousing is subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process.

• a data warehouse is data management and data analysis• data webhouse is a distributed data warehouse that is

implement over the web with no central data repository• goal: is to integrate enterprise wide corporate data into a

single reository from which users can easily run queries

Page 4: Data Warehousing

What is data warehousing?

• Subject-orientedWH is organized around the major subjects of the enterprise..rather than the major application areas.. This is reflected in the need to store decision-support data rather than application-oriented data

• Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users

• Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots

• Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. The DB continually absorbs this new data, incrementally integrating it with previous data

Page 5: Data Warehousing

The benefits of data warehousing

• The potential benefits of data warehousing are high returns on investment..

• substantial competitive advantage..

• increased productivity of corporate decision-makers..

Page 6: Data Warehousing

The difference bewteen OLTP and data warehousing

• A DBMS built for online transaction processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with a differing set of requirements in mind

• example: OLTP systems are design to maximize the transaction processing capacity, while data warehouses are designed to support ad hoc query processing

Page 7: Data Warehousing

comparision of OLTP systems and data warehousing system

OLTP systems Data warehousing systems

Hold current data

Stores detailed data

Data is dynamic

Repetitive processing

High level of transaction throughput

Predictable pattern of usage

Transaction-driven

Application-orented

Supports day-to-day decisions

Serves large number of clerical/operation users

Holds historical data

Stores detailed, lightly, and highly summarized data

Data is largely static

Ad hoc, unstructured, and heuristic processing

Medium to how level of transaction throughput

Unpredictable pattern of usage

Analysis driven

Subject-oriented

supports strategic decisions

Serves relatively how number of managerial users

Page 8: Data Warehousing

Problems• Underestimation of resources for data loading• Hidden problems with source systems• Required data not captured• Increased end-user demands• Data homogenization• High demand for resources• Data ownership• High maintenance• Long-duration projects• Complexity of integration

Page 9: Data Warehousing

Operational

data source1

The architecture

Query Manage

Warehouse Manager

DBMS

Operational

data source 2

Meta-dataHigh

summarized data

Detailed data

Lightly

summarized

data

Operational

data store (ods)

Operational

data source n

Archive/backup

data

Load Manager

Data mining

OLAP(online analytical processing) tools

Reporting, query,

application development,

and EIS(executive information system) tools

End-user

access toolsTypical architecture of a data warehouse

Operational data store (ODS)

Page 10: Data Warehousing

The main components

• Operational data sourcesfor the DW is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers

• Operational datastore(ODS)is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse

Page 11: Data Warehousing

The main components

• load manageralso called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse

• warehouse managerperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data

Page 12: Data Warehousing

The main components

• query manageralso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries

• detailed, lightly and lightly summarized data,archive/backup data

• meta-data• end-user access toolscan be categorized into five main groups:

data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools

Page 13: Data Warehousing

Data flows

• Inflow- The processes associated with the extraction, cleansing, and loading of

the data from the source systems into the data warehouse.

• upflow- The process associated with adding value to the data in the warehouse

through summarizing, packaging , packaging, and distribution of the data

• downflow- The processes associated with archiving and backing-up of data in

the warehouse

• outflow- The process associated with making the data availabe to the end-users

• Meta-flow- The processes associated with the management of the meta-data

Page 14: Data Warehousing

Operational

data source1

Warehouse Manager

DBMS

Meta-data High

summarized data

Detailed data

Lightly

summarized

data

Operational

data store (ods)

Operational

data source n

Archive/backup

data

Load

Manager

Data mining tools

OLAP (online analytical processing) tools

End-user

access tools

Information flows of a data warehouse

Reporting, query,application

development, and EIS (executive information system) tools

Downflow

Inflow

Meta-flow

Upflow Query Manage

Outflow

Warehouse Manager

Page 15: Data Warehousing

Tools and Technologies

• The critical steps in the construction of a data warehouse:

a. Extractionb. Cleansing

c. Transformation• after the critical steps, loading the results into target

system can be carried out either by separate products, or by a single, categories:

• code generators• database data replication tools• dynamic transformation engines

Page 16: Data Warehousing

Data Warehouse DBSM(integration)

• due to the maturity of such products, most relational databases will integrate predictably with other types of software

• The reqirements for data warehose RDBMS• Load performance • Load processing • Data quality management• Query perfomance• Terabyte scalability• Mass user scalability• Networked data warehouse• Warehouse administration• Integrated dimensional analysis• Advanced query funtionlity

Page 17: Data Warehousing

The importance of managing meta-data(integration)

• The integration of meta-data, that is ”data about data”

• Meta-data is used for a variety of purposes and the management of it is a critical issue in achieving a fully integrated data warehouse

• The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse

• The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data

• The meta-data associated with data management describes the data as it is stored in the warehouse

• The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries

Page 18: Data Warehousing

• The major integration issue is how to synchronize the various types of meta-data use throughout the data warehouse. The challenge is to synchronize meta-data between different products from different vendors using different meta-data stores

• Two major standards for meta-data and modeling in the areas of data warehousing and component-based development-MDC(Meta Data Coalition) and OMG(Object Management Group)

Page 19: Data Warehousing

Administration and Management Tools

• a data warehouse requires tools to support the administration and management of such complex enviroment.

• for the various types of meta-data and the day-to-day operations of the data warehouse, the administration and management tools must be capable of supporting those tasks:

• monitoring data loading from multiple sources• data quality and integrity checks• managing and updating meta-data• monitoring database performance to ensure efficient query response

times and resource utilization

Page 20: Data Warehousing

• auditing data warehouse usage to provide user chargeback information

• replicating, subsetting, and distributing data

• maintaining effient data storage management

• purging data;

• archiving and backing-up data

• implementing recovery following failure

• security management

Page 21: Data Warehousing

Data mart

• data mart a subset of a data warehouse that supports the requirements of particular department or business function

• The characteristics that differentiate data marts and data warehouses include:

• a data mart focuses on only the requirements of users associated with one department or business function

Page 22: Data Warehousing

• data marts do not normally contain detailed operational data, unlike data warehouses

• as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated

Page 23: Data Warehousing

Operational

data source1

Warehouse Manager

DBMS

Operational

data source 2

Meta-dataHigh

summarized data

Detailed data

Lightly

summarized

data

Operational

data store (ods)

Operational

data source n

Archive/backup

data

Load

Manager

Data mining

OLAP(online analytical processing) tools

Reporting, query,application development,

and EIS(executive information system) tools

End-user

access tools

Typical data warehouse adn data mart architecture

Operational data store (ODS)

Query

Manage

summarized

data(Relational database)

Summarized data

(Multi-dimension database)

Data Mart

(First Tier) (Third Tier)

(Second Tier)

Warehouse Manager

Page 24: Data Warehousing

Reasons for creating a data mart

• To give users access to the data they need to analyze most often

• To provide data in a form that matches the collective view of the data by a group of users in a department or business function

• To improve end-user response time due to the reduction in the volume of data to be accessed

• To provide appropriately structured data as ditated by the requirements of end-user access tools

• Normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse

Page 25: Data Warehousing

• The cost of implementing data marts is normally less than that required to establish a data warehouse

• The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project

Page 26: Data Warehousing

data marts issues

• data mart functionalitythe capabilities of data marts have increased with the growth in their popularity

• data mart sizethe performance deteriorates as data marts grow in size, so need to reduce the size of data marts to gain improvements in performance

• data mart load performancetwo critical components: end-user response time and data loading performanceto increment DB updating so that only cells affected by the change are updated and not the entire MDDB structure

Page 27: Data Warehousing

• users’ access to data in multiple martsone approach is to replicate data between different data marts or, alternatively, build virtual data martit is views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific groups of users

• data mart internet/intranet accessit’s products sit between a web server and the data analysis product.Internet/intranet offers users low-cost access to data marts and the data WH using web browsers.

• data mart administrationorganization can not easily perform administration of multiple data marts, giving rise to issues such as data mart versioning, data and meta-data consistency and integrity, enterprise-wide security, and performance tuning . Data mart administrative tools are commerciallly available

• data mart installationdata marts are becoming increasingly complex to build. Vendors are offering products referred to as ”data mart in a box” that provide a low-cost source of data mart tools