datawarehouse

32
i. Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age ii. Information, which is created by data, becomes the bases for decision making iii.a database is basically a collection of information organized in such a way that computer program can quickly select desired pieces of data. DATABASE

Upload: ashish-kargwal

Post on 14-Jun-2015

150 views

Category:

Technology


3 download

TRANSCRIPT

  • 1. i. Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age ii. Information, which is created by data, becomes the bases for decision making iii. a database is basically a collection of information organized in such a way that computer program can quickly select desired pieces of data. DATABASE

2. DATA WAREHOUSE i. A data warehouse is a collection of integrated databases designed to support a DSS (Decision Support System) ii. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. iii. In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users. 3. DECISION SUPPORT SYSTEMS i. Created to facilitate the decision making process ii. So much information that it is difficult to extract it all from a traditional database iii. Need for a more comprehensive data storage facility iv. Extract Information from data to use as the basis for decision making v. Used at all levels of the Organization vi. Tailored to specific business areas vii. Interactive viii. Ad Hoc queries to retrieve and display information ix. Combines historical operation data with business activities 4. DATA WAREHOUSE 5. Data Warehouse Environment: i. Data store ii. Data mart iii. Metadata In order for data to be effective, DW must be: i. Consistent. ii. Well integrated. iii. Well defined. iv. Time stamped. DATA WAREHOUSE ENVIRONMENT 6. DATA STORE i. Data come from internal and external nonintegrated operational systems ii. An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data. iii. It Is the most common component of DW environment. iv. Data store is generally subject oriented, volatile, current commonly focused on customers, products, orders, policies, claims, etc. v. Its day-to-day function is to store the data for a single specific set of operational application. vi. Its function is to feed the data warehouse data for the purpose of analysis. 7. DATA STORE & DATA WAREHOUSE 8. DATA MART i. Small Data Stores ii. More manageable data sets iii. Targeted to meet the needs of small groups within the organization iv. It is lower-cost, scaled down version of the DW. v. Small, Single-Subject data warehouse subset that provides decision support to a small group of people vi. Data Mart offer a targeted and less costly method of gaining the advantages associated with data warehousing and can be scaled up to a full DW environment over time. 9. META DATA i. Last component of DW environments. ii. It is information that is kept about the warehouse rather than information kept within the warehouse. iii. Legacy systems generally dont keep a record of characteristics of the data (such as what pieces of data exist and where they are located). iv. The metadata is simply data about data. v. For example, a line in a sales database may contain: 4056 KJ596 223.45 vi. This is mostly meaningless until we consult the metadata that tells us it was store number 4056, product KJ596 and sales of $223.45 vii. The metadata are essential ingredients in the transformation of raw data into knowledge. They are the keys that allow us to handle the raw data. 10. GENERAL METADATA ISSUES General metadata issues associated with Data Warehouse use: i. What tables, attributes and keys does the DW contain? ii. Where did each set of data come from? iii. What transformations were applied with cleansing? iv. How have the metadata changed over time? v. How often do the data get reloaded? vi. Are there so many data elements that you need to be careful what you ask for? 11. A common way of introducing data warehousing is to refer to the characteristics of a data warehouse i. Subject Oriented ii. Integrated iii. Nonvolatile iv. Time Variant CHARACTERISTICS OF DATA WAREHOUSE 12. SUBJECT ORIENTED i. Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented. ii. Organized around major subjects, such as customer, product, sales. iii. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. iv. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 13. INTEGRATED i. Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated. ii. Data cleaning and data integration techniques are applied. iii. The data warehouse is a centralized, consolidated database that integrated data derived from the entire organization 14. NONVOLATILE i. Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred. ii. Once data is entered it is NEVER removed iii. Read-Only database for data analysis and query processing iv. Data are stored in read-only format. v. Represents the companys entire history 15. TIME VARIANT i. In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant. ii. In an operational application system, the expectation is that all data within the database are accurate as of the moment of access. In the DW data are simply assumed to be accurate as of some moment in time and not necessarily right now. iii. One of the places where DW data display time variance is in the structure of the record key. Every primary key contained within the DW must contain, either implicitly or explicitly an element of time( day, week, month, etc.) 16. TIME VARIANT i. Every piece of data contained within the warehouse must be associated with a particular point in time if any useful analysis is to be conducted with it. ii. Another aspect of time variance in DW data is that, once recorded, data within the warehouse cannot be updated or changed. 17. ONLINE TRANSACTION PROCESSING(OLTP) Online transaction processing. OLTP systems are optimized for fast and reliable transaction handling. Compared to data warehouse systems, most OLTP interactions will involve a relatively small number of rows, but a larger group of tables. OLAP functionality is characterized by dynamic, multidimensional analysis of historical data, which supports activities such as the following: i. Calculating across dimensions and through hierarchies ii. Analyzing trends iii. Drilling up and down through hierarchies iv. Rotating to change the dimensional orientation 18. DATA WAREHOUSE BASIC ARCHITECTURE 19. This illustrates three things: i. Data Sources (operational systems and flat files) ii. Warehouse (metadata, summary data, and raw data) iii. Users (analysis, reporting, and mining) The metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales. A summary in Oracle is called a materialized view. DATA WAREHOUSE BASIC ARCHITECTURE 20. DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA) 21. i. Data Sources (operational systems and flat files) ii. Staging Area (where data sources go before the warehouse) iii. Warehouse (metadata, summary data, and raw data) iv. Users (analysis, reporting, and mining) We need to clean and process our operational data before putting it into the warehouse. we can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA) 22. DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA AND DATA MARTS) 23. i. Data Sources (operational systems and flat files) ii. Staging Area (where data sources go before the warehouse) iii. Warehouse (metadata, summary data, and raw data) iv. Data Marts (purchasing, sales, and inventory) v. Users (analysis, reporting, and mining) DATA WAREHOUSE ARCHITECTURE (WITH A STAGING AREA AND DATA MARTS) 24. DATA WAREHOUSING TYPOLOGY i. The virtual data warehouse the end users have direct access to the data stores, using tools enabled at the data access layer ii. The central data warehouse a single physical database contains all of the data for a specific functional area iii. The distributed data warehouse the components are distributed across several physical databases 25. DATA WAREHOUSE ETL TOOLS ETL is short for Extract, Transform, Load.Three database functions that are combined into one tool to pull data out of one database and place it into another database. ETL is used to migrate data from one database to another, to form data marts and data warehouses and also to convert databases from one format or type to another i. Extract is the process of reading data from a database. ii. Transform is the process of converting the extracted data from its previous form into the form it needs to be in so that it can be placed into another database. Transformation occurs by using rules or lookup tables or by combining the data with other data. iii. Load is the process of writing the data into the target database. 26. ETL TOOLS Tools Version ETL Vendors Oracle Warehouse Builder 11gR1 Oracle Data Services XI 4.0 SAP Business Objects IBM Info sphere Information Server 9.1 IBM SAS Data Integration Studio 9.4M1 SAS Institute Power Center Informatica 9.5 Informatica Elixir Repertoire 7.2.2 Elixir Data Migrator 7.7 Information Builders SQL Server Integration Services 10 Microsoft Talend Studio for Data Integration 5.2 Talend Data Flow Manager 6.5 Pitney Bowes Business Insight Pervasive Data Integrator 10.0 Actian (Pervasive Software) 27. ETL TOOLS Tools Version ETL Vendors Open Text Integration Center 7.1 Open Text Oracle Data Integrator (ODI) 11.1.1.5 Oracle Data Manager/Decision Stream 8.2 IBM (Cognos) Clover ETL 3.4.1 Javlin Centerprise 6.0 Astera DB2 Infosphere Warehouse Edition 9.1 IBM Pentaho Data Integration 4.1 Pentaho Adeptia Integration Suite 5.1 Adeptia DMExpress 5.5 Syncsort Expressor Data Integration 3.7 QlikTech 28. DATA WAREHOUSE TECHNOLOGIES i. No one currently offers an end-to-end DW solution. Organizations buy bits and pieces from a number of vendors and hopefully make them work together. ii. SAS, IBM, Software AG, Information Builders and Platinum offer solutions that are at least fairly comprehensive. iii. The market is very competitive. Table 10-6 in the text lists 90 firms that produce DW products. 29. IMPLEMENTING THE DATA WAREHOUSE Kozar list of seven deadly sins of data warehouse implementation: i. If you build it, they will come the DW needs to be designed to meet peoples needs ii. Omission of an architectural framework you need to consider the number of users, volume of data, update cycle, etc. iii. Underestimating the importance of documenting assumptions the assumptions and potential conflicts must be included in the framework iv. Failure to use the right tool a DW project needs different tools than those used to develop an application v. Life cycle abuse in a DW, the life cycle really never ends vi. Ignorance about data conflicts resolving these takes a lot more effort than most people realize vii. Failure to learn from mistakes since one DW project tends to beget another, learning from the early mistakes will yield higher quality later 30. THE FUTURE OF DATA WAREHOUSING As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data. This will likely bring with it several new challenges: i. Regulatory constraints may limit the ability to combine sources of disparate data. ii. These disparate sources are likely to contain unstructured data, which is hard to store. iii. The Internet makes it possible to access data from virtually anywhere. Of course, this just increases the disparity. 31. REFERENCES i. Google.com ii. Oracle.com iii. Webopedia.com iv. Etltool.com