Download - DatawarehouseDataMiningNotes
![Page 1: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/1.jpg)
![Page 2: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/2.jpg)
Database and data warehousing◦ Chapter 4 pages 167-177
Data mining and OLAP◦ Chapter 8 pages 321-325
![Page 3: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/3.jpg)
In managing information, physical deals with the structure of information as it resides on various storage media.
Logical deals with how knowledge workers view their information needs, and includes such terms as:◦CHARACTER - our smallest unit of
information.◦FIELD - group of related characters.◦RECORD - group of related fields.◦FILE - group of related records.◦DATABASE - group of logically
associated files.◦DATA WAREHOUSE - information from
many databases.
![Page 4: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/4.jpg)
DATA DICTIONARY - contains the logical structure of information in a database.◦Definitions of all fields, records, and tables◦Relationships between tables◦Who is responsible for maintaining data in
the database◦Descriptions of who is authorized to access
different parts of the database Data dictionary contains meta data (data
about the data)
![Page 5: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/5.jpg)
Sample Data Dictionary Report
![Page 6: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/6.jpg)
Definition- a database that stores current and historical data designed to support business analysis activities and decision-making tasks of managers; typically a relational database model is used. The data warehouse uses special software (tools) to assist managers extract information.
Benefits improved access improved information isolation from operational systems tools permit advanced data analysis
Users and data marts
![Page 7: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/7.jpg)
Extraction phase – create files on the computer that will store the data warehouse and move transaction data to this machine; data may come from many sources or parts of the organization
Transformation phase – cleanse and standardize the data. Why is this necessary?
Load phase – transfer the data from the transformation phase into the data warehouse
The ETL process becomes automated to make regular transfers of transaction data into the data warehouse
![Page 8: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/8.jpg)
Operational Data Data is on many
systems Current operational
data Inconsistent data
definitions Functionally
organized data Data are constantly
changing Support OLTP
Warehouse Data Integrated in one
enterprise-wide system
Recent and historical data
Consistent data definitions
Data are organized around business entities
Data are stabilized Support OLAP
![Page 9: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/9.jpg)
• Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, previously
unknown and potentially useful) information or patterns from data in large databases
• Similar terms Exploratory data analysis Data driven discovery Deductive learning Knowledge extraction
![Page 10: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/10.jpg)
A computer-based data analysis process
• Utilizes historical organizational data, typically in a data warehouse
• Uses a variety of data analysis, modeling, and visualization techniques
• One use is to discover previously unknown patterns or potential relationships in the data: undirected
• Also used to make predictions, verify assumptions, or otherwise provide useful information: directed
• Allows businesses to make proactive, knowledge-driven decisions
![Page 11: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/11.jpg)
• Prediction Use some variables to predict unknown or
future values of other variables• Description
Find human-interpretable patterns that describe the data
![Page 12: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/12.jpg)
![Page 13: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/13.jpg)
![Page 14: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/14.jpg)
• Given a collection of data Each record contains a set of attributes, one of the
attributes is the class variable Usually, the given data set is divided into training and
test sets, with training set used to build the model and test set used to validate it
• Find a model for the class attribute as a function of the values of other attributes based on the training set Previously unseen data (test set) are used to
determine the accuracy of the model
![Page 15: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/15.jpg)
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product
Approach:• Use the data for a similar product introduced before• We know which customers decided to buy and which decided
otherwise; this {buy, don’t buy} decision is the class variable• Collect various demographic, lifestyle, and company-
interaction related information about all such customers• Use this information as input attributes to create a classifier
model
![Page 16: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/16.jpg)
Goal: Predict fraudulent cases in credit card transactions
Approach:• Use past credit card transactions and the information on its
account-holder as attributes When does a customer buy, what does he buy,
how often he pays on time, etc• Determine whether past transactions were fraud or fair
transactions; this is the class variable• Create a model for the class of the transactions
![Page 17: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/17.jpg)
Goal: To predict whether a customer is likely to be lost to a competitor
Approach:• Use detailed record of transactions with each of the past
and present customers, to find attributes How often the customer calls, where he calls,
what time-of-the day he calls most, his financial status, marital status, etc
• Label the customers as still with the company or left the company (churned)
• Create a model for churn
![Page 18: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/18.jpg)
Clustering concerns segmenting a diverse population into several homogeneous subgroups or clusters. Clustering differs from classification in that there are no predefined classifications.
• Data points in one cluster are more similar to one another
• Data points in separate clusters are less similar to one another
Similarity Measures:• Euclidean distance if attributes are continuous• Other problem-specific measures
![Page 19: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/19.jpg)
Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mixApproach:
• Collect different attributes of customers based on their geographical and lifestyle related information
• Find clusters of similar customers• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from different clusters
![Page 20: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/20.jpg)
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them
Approach: To identify frequently occurring terms in each document; form a similarity measure based on the frequencies of different terms; use it to cluster
Gain: Retrieval can utilize the clusters to relate a new document or search term to clustered documents
![Page 21: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/21.jpg)
Given a set of records, each of which contains some number of items from a specific collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items:
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
![Page 22: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/22.jpg)
Goal: To identify items that are bought together by sufficiently many customers
Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items
A classic case: If a customer buys diapers on Friday evening, then he is very likely to buy beer
![Page 23: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/23.jpg)
So, don’t be surprised if you find six-packs stacked next to diapers!
![Page 24: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/24.jpg)
• Privacy • • Profiling
• Unauthorized Use
• Big Brother
![Page 25: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/25.jpg)
Customer loyalty cards have multiple uses, but one use is to collect data for the data warehouse
Examples◦Grocery stores◦Web sites◦Harrah’s◦Store related credit cards
Assurance of a steady flow of data
![Page 26: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/26.jpg)
Multidimensional data analysis (or OLAP) enables users to view data using various dimensions, measures and time frames (i. e., OLAP)◦dimensions: products, business units,
country, industry (e.g., categories)◦measures: money, unit sales, head
count, variances◦ time: daily, weekly, monthly, quarterly,
yearly) This type of analysis also provides the
ability to view data in different ways (tables, charts, 3-D, geographically)
OLAP tools provide for this Pivot tables in Excel or Access
![Page 27: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/27.jpg)
Pro
pert
y
City
Time
Three dimensional revenue model
![Page 28: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/28.jpg)
14670
15056
15888
14555
16004
14578
15500
15890Fla
tH
ouse
Glasgow
London
Aberdeen
Q1 Q2 Q3 Q4
![Page 29: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/29.jpg)
Property Type City Time
Total Revenue
Flat Glasgow Q1 15056
House Glasgow Q1 14670
Flat Glasgow Q2 14555
House Glasgow Q2 15888
Flat Glasgow Q3 14578
House Glasgow Q3 16004
Flat Glasgow Q4 15890
House Glasgow Q4 15500
Flat London Q1 19678
House London Q1 23877
Flat London Q2 19567
House London Q2 28677
![Page 30: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/30.jpg)
• A common operation is to aggregate a measure over one or more dimensionsFind the total revenueFind the total revenue for each cityFind the top property-type for the 3rd
quarter based on total revenue across all cities
![Page 31: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/31.jpg)
• Roll-up: Aggregating data across different dimension levelsExample: given revenue by city, we can roll-up to get
total revenue by state
• Drill-down: The inverse of roll-up: disaggregating dataExample: given total revenue by state, we can drill-
down to get revenue by city
![Page 32: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/32.jpg)
Glasgow London Aberdeen
Q1 29726 43555 53210
Q2 30443 48244 34567
Q3 30582 56222 45677
Q4 31390 45632 50056Glasgow
London Aberdeen
Jan 9035 21005 5216
Feb 10788 14799 14944
Mar 9903 7751 33050
Apr 11273 10573 21884
May
9005 16896 8573
Jun 10265 20775 4110
Rollup
Drill Down
![Page 33: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/33.jpg)
• Slicing & Dicing: Selecting data within dimension categoriesExample: given revenue by city for the entire year, we
can extract the revenue for a given city for a given quarter
• Rotating: Reorienting the presentation of the data cubeExample: given a cube that presents revenue by city &
property type for each quarter, we can change the presentation to present the revenue by property type & quarter for each city
![Page 34: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/34.jpg)
Example: OLAP Usage at an Automobile Dealership
The StoryAn automobile dealership manager wants to improve business activity. Therefore she wants to view sales figures from different perspectives.
A QuestionWhat is the sales volume for a specific model and colors, for a specific salesperson ?
The Data NeedsSales by model
Sales by salesperson
Sales by color
![Page 35: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/35.jpg)
Example: The Multi-dimensional Data Model Used
Sales Volume
Blue Red White
Van
Coupe
Sedan
Miller
Clyde
Smith
COLOR
SALESPERSON
MOD E L
![Page 36: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/36.jpg)
Example: OLAP “Slicing & Dicing“ – Selecting Categories
Sales Volumes
Blue
Red White
Van
Coupe
Sedan
Miller
Clyde
Smith
COLOR
SALESPERSON
MOD E L
• Color: Blue and White
Clyde
Blue
White
Coupe
“Sliced & Diced“ Data
Choose a range out of each dimension: • Model: Coupe only
• Salesperson: Clyde only
![Page 37: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/37.jpg)
Example: OLAP “Rotation“ – Changing the Presentation of the Cube
Sales Volume
Blue Red White
Van
Coupe
Sedan
COLOR
MOD E L
View of the Account Manager
Rotate the data cube by 90°
SALESPERSON
Sales Volume
Miller
Smith Clyde
Van
Coupe
Sedan
MOD E L
View of the Product Manager
Different Users will require different views of the multidimensional cube
![Page 38: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/38.jpg)
Example: OLAP Drill-Down and Roll-UpData can be disaggregated and aggregated along a dimension according to the natural hierarchy
Drill-Down
Roll-Up
State
Region
Salesperson Miller Smith
Clyde Lucas Gleason
Atlanta Athens
Georgia
Sales Volume by Organization Dimension- three level hierarchy -
![Page 39: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/39.jpg)
Primarily used to exploit data warehouses Provides extremely fast response View combinations of two dimensions Enable drilling down (start with broad info
and get more specific) Produces results as counts or percentages Conversion of tables to charts/graphs Usually requires a tailored-made relational
database OLAP applications are widely used by mid-
level and upper level managers A form of business intelligence software
![Page 40: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/40.jpg)
Go to www.fedscope.opm.gov ◦Under data cubes on entry page click on
employment◦Demonstrate drill down and adding charts◦Data for this example comes from the
Central Personnel Data File (CPDF) of the federal government
◦The OLAP tool used to build this site is from a company named Cognos (PowerPlay)
OLAP tools based on Excel◦ http://www.cubularity.com
![Page 41: DatawarehouseDataMiningNotes](https://reader035.vdocument.in/reader035/viewer/2022081518/54933585b47959ca038b45ae/html5/thumbnails/41.jpg)
Linkage between elements◦ spreadsheet - between cells in same table◦ DBMS - between elements in different tables
Orientation◦ spreadsheet is toward calculations◦ DBMS is tilted toward organization and
linkage of data elements in different tables Capabilities
◦ DBMS has extensive querying and reporting power
◦ spreadsheet is limited Memory requirements
◦ entire spreadsheet table must be in memory◦ not true for the database table