Download - Chapter 3 Database Support in Data Mining Types of database systems How relate to data mining
Chapter 3Chapter 3Database Support in Data MiningDatabase Support in Data Mining
Types of database systems
How relate to data mining
結束
3-2
ContentsContents
Describes data warehousing and related database system.Describes data warehousing and related database system.
Discusses feature of data found in data warehouseDiscusses feature of data found in data warehouse
Describes how data warehouses are typically implemented Describes how data warehouses are typically implemented and operatedand operated
Defines metadata in the context of data warehousesDefines metadata in the context of data warehouses
Show how different data systems are typically used in data Show how different data systems are typically used in data miningmining
Provides real examples of database systems used in data Provides real examples of database systems used in data miningmining
Discusses the concept of data qualityDiscusses the concept of data quality
Reviews the database software marketReviews the database software market
結束
3-3
Data managementData management
Retail organization generate masses of data that require very advanced data storage system.
Wal-Mart relied on modern data management to engage with SCM.
The manipulation of data is a key element in the data mining process.
Data mining and other analysis can draw upon data collected in internal systems and external sources.
結束
3-4
Data accessData access
Data warehouses are not requirements to do data mining, data warehouses store massive amounts of data that can be used for data mining.
Data mining analyses also use smaller sets of data that can be organized in online analytic processing (OLAP) systems of in data mining.
OLAP: provides access to report generators and graphical support.
結束
3-5
Contemporary DatabaseContemporary Database
Gain competitive advantage customer information systems
data mining
Develop and market new productsmicromarketing
結束
3-6
SystemsSystems
DatabasePersonal, small business level
On-Line Analytic Processing (OLAP)Ability to use many dimensions, reports & graphics
Data MartUsually temporary analysis
Data WarehouseUsually permanent repository
結束
3-7
Data WarehousingData Warehousing
Price Waterhouse definition:A data warehouse is an orderly and accessible
repository of known facts and related data that is used as a basis for making better management decisions. The data warehouse provides a unified repository of consistent data for decision making that is subject oriented, integrated, time variant, and nonvolatile.
結束
3-8
Data WarehousingData Warehousing
Data warehouses are used to store massive quantities of data that can be updated and allow quick retrieval of specific types of data.
Not just a technology; an architecture and process designed to support decision making
special-purpose database systems to improve query performance significantly
Three general data warehouse processes: 1. warehouse generation is the process of designing the
warehouse and loading the data.
2. Data management is the process of storing the data.
3. Information analysis is the process of using the data to support organization decision making.
結束
3-9
Benefits from Data WarehousingBenefits from Data Warehousing
Provide business users views of data appropriate to mission
Consolidate & reconcile (consistent) data
Give macro views of critical aspects
Timely & detailed access to information
Provide specific information to particular groups
Ability to identify trends
結束
3-10
Data warehousingData warehousing
Within data warehouses, data is classified and organized around subjects meaningful to the company.The data is gathered from operational systems:Barcode readers at cash registers,Information from e-commerce,Daily reports…Industry volumesEconomic data..
Data from different sources (shipping, marketing, billing) are integrated into a common format.
結束
3-11
Data TransformationData Transformation
Consolidate data from multiple sources
Filter to eliminate unnecessary details
Clean dataeliminate incorrect entrieseliminate duplications
Convert & translate data into proper format
Aggregate data as designed
結束
3-12
Data warehousingData warehousing
A data warehouse is a central aggregation of data, intended as a permanent storage facility with normalized, formatted.
Normalized implies the use of small, stable data structure within the database. Normalized data would group data elements by category, making it possible to apply relational principles in data updating.
結束
3-13
Key ConceptsKey Concepts
ScalabilityAbility to accurately cope with changing
conditions (especially magnitude of computing)
GranularityLevel of detail
Data warehouse – tends to be fine granularityOLAP – tends to aggregate to coarse granularity
結束
3-14
Data WarehousingData Warehousing
OLAP On-Line Transactional Processing
summary data detailed operational data
few users many concurrent users
data driven transaction driven
effectiveness efficiency
use spreadsheets to access
結束
3-15
Data MartsData Marts
Intermediate-level database system
Originally, many data marts were marketed as preliminary data warehouses. Currently, many data marts are used in conjunction with data warehouses rather than as competitive products.
Data marts are usually used as repositories of data gathered to serve a particular set of users, providing data extracted from data warehouses and/or other sources.
Often used as temporary storageGather data for study from data warehouse, other sources
(including external)Clean & transform for data mining
結束
3-16
OLAPOLAP
Multidimensional spreadsheet approach to shared data storage designed to allow users to extract data and generate report on the dimensions important to them.Data is segregated into different dimensions and organized in a hierarchical manner.Hypercube – term to reflect ability to sort on many dimensional formsMany forms MOLAP – multidimensional ROLAP – relational (uses SQL) DOLAP – desktop WOLAP – web enabled HOLAP - hybrid
結束
3-17
OLAPOLAP
One function of OLAP is standard report generation, including financial performance analysis on selected dimensions (such as by department, geographical region, product, salesperson, time…).
Supporting the planning and forecasting projects using spreadsheet analytic tools.
An OLAP product including a data warehouse, an OLAP server, and a client server on a local area network (LAN).
OLAP functions – see page. 37
結束
3-18
Relationships of database and DMRelationships of database and DM
Data warehouses are not required for data mining, nor are OLAP system.
However, the existence of either presents many opportunities to data mining.
結束
3-19
Data Warehouse ImplementationData Warehouse Implementation
Data warehouses create the opportunity to provide much better information than what was available in the past. DW can produce consistent views of events and reports.
DW provides Reliable, comprehensive source of clean dataAccurate, complete, in correct format
ProcessesSystem developmentData acquisitionData extraction for use
結束
3-20
Data Warehouse ImplementationData Warehouse Implementation
Implementing processes involve a degree of continuity since data warehousing is a dynamic environment.
To have a suite of software tools to extract data from sources and move it to the data warehouse itself and provide user access to this information.
Data acquisition is supported data warehouse generation.
結束
3-21
Data Warehouse GenerationData Warehouse Generation
Extract data from sources
Transform
Clean
Load into data warehouse60-80% of effort in operating data warehouse
結束
3-22
Data Extraction RoutinesData Extraction Routines
Extraction programs are executed periodically to obtain records, and copy the information to an intermediate file.
Data extraction routines:Interpret data formatsIdentify changed recordsCopy information to intermediate file
結束
3-23
Data TransformationData Transformation
Transformation programs accomplish final data preparation, including:The consolidation of data from multiple sourcesFiltering data to eliminate unnecessary detailsCleaning data eliminate incorrect entries of duplicationsConverting and translating data into the format
established for the data warehouseThe aggregation of data
結束
3-24
Data ManagementData Management
Data Management involve in:Retrieve information from data warehouseRun extraction programs to generate
repetitive reports and serve specific needsImplementation Problems:
Required data not availableInitial data warehouse scope too broadNot enough time to do prototyping, or needs
analysisInsufficient senior direction
結束
3-25
Meta DataMeta Data
Data warehouse management vs. data management:Data management concerns the management of all of the
enterprise’s data.Data warehouse management refers to the designs and
operation of the data warehouse through all phases of its life cycle.
Manage meta data Design data warehouse Ensure data quality Manage system during operations
結束
3-26
Meta DataMeta Data
Metadata is the set of reference (Data) to keep track of data, and is used to describe the organization of the warehouse.
A data catalog provides users with the ability to see specifically what the data warehouse contains.
The content of the data warehouse is defined by metadata, which provides business views of data (information access tools) and technical views (warehouse generation tools).
結束
3-27
Business MetadataBusiness Metadata
What data are available
Source of each data element
Frequency of data updates
Location of specific data
Predefined reports & queries
Methods of data access
結束
3-28
Technical Meta DataTechnical Meta Data
Data source(internal or external)
Data preparation features (transformation & aggregation rules)
Logical structure of dataPhysical structure & contentData ownershipSecurity aspects (access rights, restrictions)
System information (date of last update, retention policy, data usage)
結束
3-29
Wal-Mart’s Data WarehouseWal-Mart’s Data Warehouse
Heavy user of IT
Core competency – supply chain distribution2900 outletsData warehouse of 101 terabytes ($4 billion)65 million transactions per weekSubject-oriented, integrated, time-variant, nonvolatile
data65 weeks of data by item, store, day
結束
3-30
Wal-MartWal-Mart
Use data warehouse to:Support decision makingBuyers, merchandisers, logistics, forecasters3,500 vendor partners can queryCan handle 35 thousand queries per week
Benefit $12,000 per querySome users about 1 thousand queries per day
結束
3-31
Summers Rubber CompanySummers Rubber Company
Distribution firm7 operating locations10,000 items3,000 customers
Old system:OLAPDatabases transactional & summarized,
distributed
結束
3-32
Summers Data Storage SystemSummers Data Storage System
Built in-house, PCs, Access database
Visual Basic & Excel
Distributed systemData warehouse server controlled queries, managed
resources
SecurityPasswords gave some protectionTo protect from leaving employees, used data marts
with small versions of central database
結束
3-33
Summers – Negative featuresSummers – Negative features
Too much disk space on user local drives
Often difficult to understand & use
Updating multiple data sites slow, limited access
Summary data often wrong
Couldn’t use data mining toolsProblem was aggregated data stored
結束
3-34
ComparisonComparison
Product Use Duration Granularity
Warehouse Repository Permanent Finest
MartSpecific study
Temporary Aggregate
OLAPReport & analysis
Repetitive Summary
結束
3-35
Examples of Data UsesExamples of Data Uses
Customer information systems
Fingerhut
結束
3-36
Customer Information SystemsCustomer Information Systems
Massive databases
Detailed information about individuals and households
Use automated analysisidentify focused market target
結束
3-37
MicromarketingMicromarketing
Target small groups of highly responsive customers
Own niches like smaller competitors
EXAMPLES:Great Atlantic & Pacific Tea Company (A&P)
target customers, centralize buyingFingerhut
sell on credit to households <$25,000 income
結束
3-38
System demonstrationsSystem demonstrations
A dealer wholesaler.
A small portion for the first 10 shipments (Table. 3.1).
Data warehouse are normalized into relational form. The data is organized into a series of tables connected by keys.
Revenue
結束
3-39
Data martData mart
Examining the characteristics of customers who buy the products. (Advertising by mail, internet, …)Data marts could extract the data and aggregate it in a form useful for data mining.Table 3.2 shows entries that might be found in a data mart. (on product D428 in two-year interval)
結束
3-40
OLAPOLAP
An OLAP application focuses more on analyzing trends or other aspects of organizational operations. It may obtain much of its information from the data warehouse, but extracts granular information.
This information could be accessed to make a report by product category. Table. 3.3.
positive
結束
3-41
OLAPOLAP
Evaluating the value of each client to the firm.
Data can be aggregated within data mart, or on an OLAP system.
結束
3-42
OLAPOLAP
Organizing volume according to the shipper.
Table 3.5 displays the results of cases by shipper for each shipper.
結束
3-43
Data QualityData Quality
Data warehouse projects can fail, one of the most common reason is the refusal (reject) of users to accept the validity of data obtained from a data warehouse. Because: The corruption of data or missing data from the original sources. Failure of the software transferring data into or out of the data
warehouse. Failure of the data-cleansing process to resolve data inconsistence.
The responsible staff must verify the integrity of data, ensuring the data loading and storing process.
Data Integrity: Do not allow any meaningless, corrupt, or redundant data into the data warehouse.
Controls can be implemented prior to loading data, in the data migration, cleansing, transforming, and loading processes.
結束
3-44
Data QualityData Quality
An example of multiple variations, as illustrated in Table. 3.6.
What are the variations?1. Variations of the same customer
2. Misspell
3. Corrected spell but with a more complete definition
結束
3-45
Data QualityData Quality
Matching involves associating variables.Software used to introduce new data into the data warehouse needs to check that the appropriate spelling and entry values are used. Also, matching companies with addresses… and some maintenance.Software tools to ensure data quality, including:The analysis of data for typeThe construction of standardization schemesThe identification of redundant dataThe adjustment of matching criteria to achieve selected
levels of discriminationThe transformation of data into designed format
結束
3-46
Software productsSoftware products