datamining and data warehousing · data mining – on what kinds of data? ¾ database-oriented data...

DATAMINING AND DATA WAREHOUSING

What motivated data mining? Why it is important? * Huge Volume of data * Major Sources of Abundant data: - Business – Web, E-commerce, Transactions, Stocks - Science – Remote Sensing, Bio informatics, Scientific Simulation - Society and Everyone – News, Digital Cameras, You Tube * Need for turning data into knowledge – Drowning in data, but starving for knowledge * Applications that use data mining: - Market Analysis - Fraud Detection - Customer Retention - Production Control - Scientific Exploration What is Data Mining?

• Extracting and ‘Mining’ knowledge from large amounts of data. • “Gold Mining from rock or sand” is same as “Knowledge mining from data” • Other terms for Data Mining:

o Knowledge Mining o Knowledge Extraction o Pattern Analysis o Data Archeology o Data Dredging

• Data Mining is not same as KDD (Knowledge Discovery from Data) • Data Mining is a step in KDD

Data Cleaning – Remove noisy and inconsistent data Data Integration – Multiple data sources combined Data Selection – Data relevant to analysis retrieved Data Transformation – Transform into form suitable for Data Mining (Summarized / Aggregated) Data Mining – Extract data patterns using intelligent methods Pattern Evaluation – Identify interesting patterns Knowledge Presentation – Visualization / Knowledge Representation

– Presenting mined knowledge to the user Relation to Statistics:

• Statistics – “Learning from Data” or “Turning data into information”. • Data – Crude Information – Does not makes sense – What we capture & store • e.g. customer data, store data, demographical data, geographical data • Information – relates items of data – relevant to the decision problem • e.g. X lives in Z; S is Y years old; X and S moved; W has money in Z • Facts – Information becomes facts when data can support it • Knowledge – What we know or infer – relates items of information • e.g. a quantity Q of product A is used in region Z; customers of class L use N% of C in

period D

• Data Mining interface between statistics, computer science, artificial intelligence,

machine learning, database management, data visualization,… • "Data mining is the application of statistics to reveal patterns and trends in very large

data sets." • Data mining can learn from Statistics. Statistics is fundamental to data mining. • Data mining will not become knowledge discovery without statistical thinking. • Statistics will not be able to succeed on massive and complex datasets without data

mining.

1.2 Databases

Database System – Evolutionary Path: • Progress in Hardware technology -> Led to powerful, affordable computers • New data repository architecture -> Led to Data warehouses (multiple heterogeneous data

sources in single schema) o Warehouses facilitates management decision making o Warehouse includes data cleaning, data integration, OLAP (Online Analytical

Processing) OLAP consists of Summarization, Consolidation, Aggregation, Different

angle view / Multidimensional Analysis for decision making • However, in-depth analysis requires additional data analysis tools • Data rich and information poor situation • Expert systems – rely on domain experts for decision making - using their knowledge

intuition o Time consuming, costly, error prone, biased

• So the solution is to use Data Mining tools – performs data analysis, finds data patterns

o Contributes to business strategies, knowledge bases, scientific & medical research

Data Mining – Confluence of Multiple Disciplines 1. Databases 2. Data Warehousing 3. Statistics 4. Machine Learning 5. Information Retrieval 6. Image and Signal Processing 7. Pattern Recognition 8. Neural Networks 9. Data Visualization 10. Spatial / Temporal Data Analysis Data Mining – On What Kinds of Data?

Database-oriented data sets and applications o Relational database, data warehouse, transactional database

Advanced data sets and advanced applications o Data streams and sensor data o Time-series data, temporal data, sequence data (incl. bio-sequences) o Structure data, graphs, social networks and multi-linked data o Object-relational databases o Heterogeneous databases and legacy databases o Spatial data and spatiotemporal data o Multimedia database o Text databases o The World-Wide Web

Relational Databases:

• Consists of Database (inter related data) and set of software programs to manage and access data.

• Collection of tables

• Each table has a set of attributes (columns / fields) and large set of tuples (records or rows)

• Unique key – describes a tuple. • Data Model used is ER Data Model – Set of entities and its relationships • Accessed by Database queries - written in SQL / using GUI • Query – transformed to join, selection and projection operations • Query Optimized for efficient processing • SQL includes aggregate functions (Group by) sum, avg, count, max, min • Apply data mining on databases

o Searches for trends or data patterns Predict credit risk of new customers based on income, age & prev. credit

info o Detects deviations

Sales of particular items – sales < expected comparison with previous year o Deviations further investigated for reason

Increase in price, change in packing Data Warehouses: Data spread in several databases – physically located at numerous sites Data warehouse – repository of multiple DBs in single schema; resides at single site. Data warehousing processes 1. Data Cleaning 2. Data Integration 3. Data Transformation 4. Data Loading 5. Periodic data refreshing

• Data in a data warehouse are organized around major subjects • Data provide information on historical perspective – summarized on periodic dimension • Eg. Sales of an item for a region in a period • Data warehouse model – multidimensional database structure / data cube • Dimensions – Attributes / set of attributes • Facts – Aggregated measures (Count / Sales amount)

Data Mart & Data Warehouse – Difference

• Data Mart – Department subset of data warehouse • Data Warehouse – Enterprise wide scope, suited for OLAP • DWH - Presents data at different levels of abstraction; accommodates different user

views

• OLAP Operations – Drill Down (Data at Month Level) & Roll Up (Data at Country Level)

• If in-depth analysis required – use Data Mining tolls and techniques Transactional Databases

• Consists of a file with records where each record is a transaction • Each transaction has a unique transaction ID and list of items that make up transactions • Transactional database may have additional tables associated with it. • These tables contain other reference information regarding sales, date of transactions,

customer ID, Sales Person Id, Branch of Sales. • Eg. “Show all items purchased by John”; “How many transactions include item number

I3” • These queries requires full scan of transactional database. • But deeper analysis required in real time. • Eg. “Which items sold well together” Market Basket Data Analysis Groups /

bundles items together as a strategy for maximizing sales. • Eg. “Printers and Computers sold together” Printers can be given at discount rate • Such queries not answered by transactional database • So apply Data Mining techniques to identify frequent item set patterns.

Advanced Data and Information Systems and Advanced Applications

• Spatial Data (maps) o WWW (Internet) • Engineering Data (Building design, circuit design, system components) • Hyper text and Multi Media Data (text, image, video, audio) • Time-related data (Historical records or stock exchange data) • Stream Data (Video, Surveillance, Sensor – data flows in & out) • All these advanced data types requires complex DB Schema structure with dynamic

changes • Hence we have advanced & application oriented DB systems:

o Object-relational DB systems o Temporal and Time-series DB systems o Heterogeneous and Legacy DB systems o Data Stream Management Systems o Web-Based information systems

• These raise many challenging research and implementation issues for data mining Object-Relational Databases

• Extended relational model on handling complex objects – Popular in industry applications

• Each object includes: o Attributes / variables o Messages – Communicates with other objects o Methods – Code to implement a message – receives a value in return

Eg. Message get_photo(employee) returns photo of employee object • Object Class – Objects with common properties • Each object = instance of a class • Class Subclass ; Employee Class Sales-person Sub class • Inheritance – subclass inherits the properties of its class + additional properties specific to

the subclass (Eg. Commission – property specific to sub-class) • Data Mining technique to be developed for handling complex object structures, class,

subclass, inheritance etc.

Temporal Databases, Sequence Databases and Time-Series Databases Temporal Databases

• Stores time related attributes & has time stamps with different semantics. Sequence Databases

• Sequences of ordered events with or without time. • Eg. Customer Buying Sequence; Web Logs; Biological sequences

Time-Series Databases • Sequences of ordered events over time. • Eg. Stock Exchange, Inventory Control, Temperature, Wind,… • Data Mining techniques used to find trend of changes of objects in such databases. • Used in decision making and strategy planning • Based on multiple granularity of time • Eg. Banking Data Mining Customer traffic prediction • Eg. Stock Exchange Data Mining Investment Planning Strategy

Spatial Databases and Spatiotemporal Databases: Spatial Databases:

• Spatial related information • Eg. Geographic (map) dbs, medical image dbs, Satellite image dbs • Spatial data represented in Raser Format in n-dimensional bit maps • Eg. 2D Satellite image – represented in Raser Data – each pixel represents rainfall in a

given area • Another representation is that Maps are represented in Vector Format Roads, Bridges,

Buildings and Lakes represented as points or lines or polygons • Eg. Applications Forestry, Ecology Planning • What kind of data mining can be applied on Spatial Databases?

o “Houses located near a park” o “Climate of hill areas at different altitudes” o “Poverty rate based on city distances from major highways” o “Spatial Data Cubes” can be constructed with multi dimensional hierarchies

– drill-down & roll-up Spatio-temporal Databases:

• Database with Spatial objects that change with time. • Eg. Outbreak of flu based on geographic location with respect to time.

Text Databases and Multimedia Databases: Text Databases:

• Database consists of Word descriptions, keywords, sentences, paragraphs. • Eg. Product description, bug report, warning messages, summary reports, notes,

documents. • Text databases can be Unstructured / Semi-structured (E-mail messages, HTML/XML

Web Pages) / Well structured (library databases – can use relational databases) • What can Data Mining on text databases uncover?

o Key word and content association o Clusters within text objects o Data Mining & Information Retrieval techniques can be integrated for better

results o Uses hierarchies such as dictionaries and thesaurus

Multimedia Databases: • Stores images, audio, video data. O Also called as Continuous Media Data

• Applications – picture retrieval system, voice-mail system, video on demand systems, WWW, Speech based user interfaces

• Storage and search techniques can be integrated with data mining methods for efficiency • Can construct multimedia data cubes for similarity based pattern matching

Heterogeneous Databases and Legacy Databases: Heterogeneous Databases:

• Database has autonomous components that are interconnected where components communicate.

• Objects in different components differ – hence difficult to understand semantics. Legacy Databases:

• Has long history of information. O Different hardware and operating systems. • Group of heterogeneous databases connected by intra or inter-computer networks • Information exchange across such databases is difficult because of diverse semantics. • Data mining techniques provide solution by performing statistical data distribution and

correlation analysis. Data Streams:

• Data Flows in & out of the platform (or window) dynamically. • Unique Features:

o Huge volume o Dynamically changing o Flows in & out in a fixed order

o Small number of scans o Demanding fast response time o Eg. Power supply, network traffic, telecommunications, web logs, video

surveillance o Ongoing research on data stream management systems. o In these systems we use Continuous Query Model that has pre-defined queries o Multidimensional, on-line analysis and mining can be performed on stream data

World Wide Web: • Online Information Services • Eg. Yahoo, Google, Wikepedia • Data objects are linked together to facilitate interactive access. • Users traverse from one object to another via links • Such Data allows more challenging opportunity for Data Mining • Understanding user access patterns allows better web design and better marketing

strategy. • This is called as Web Usage Mining or Web Log Mining • Web pages are highly unstructured • Web Page Analysis – Ranks web pages – helps in easy retrieval of relevant content pages

when key word search is made. • Web Page Clustering and Classification • Web Community Analysis – hidden behavior of groups of web pages users.

1.3 Data Mining Functionalities

Broad view of Data Mining Functionality: Data mining is the process of discovering interesting knowledge from large amounts of data stored in databases, data warehouses or other information repositories.

Data Mining Functionalities – What kinds of Patterns can be mined? • Two Categories – (i) Descriptive (ii) Predictive • Descriptive – describes the general properties of the data in the database • Predictive – Makes Predictions from the data • Data Mining Functionalities should allow:

o Mining of multiple kinds of patterns that accommodates different user expectations and applications

o Discover patterns at different levels of granularity o Hints / Specifications / Queries to focus the search for interesting patterns

• Each discovered pattern is measured for its “trustworthiness” based on data in the database.

• 1) Characterization:

o Concept description / Class description o Data is associated with concept or class o Eg. Classes of items for sale – (i) Computers (ii) Printers o Eg. Concepts of customers – (i) Big Spenders (ii) Budget Spenders o Class / Concept descriptions can be delivered via:

(1) Data Characterization (2) Data Discrimination (3) Data Characterization and Data Discrimination

o Data Characterization Summarization of general characteristics or features of a target class of

data Data specific to a class are collected by query Types of data summarization:

Summarization based on simple statistical measures Data summarization along a dimension – user controlled – OLAP

rollup Attribute oriented induction – without user interaction.

Output of data characterization can be represented in different forms: Pie Charts, Bar Charts, Curves Multidimensional Data cubes, Multidimensional tables Generalized relations – in rule form – called as “Characteristics Rules”

Eg. “Find Summarization of characteristics of customers who spend more than Rs. 50000 in shop S1 in a year”

Result = “Customers are 40–50 years old, employed and have high credit rating”

Users can drill down on any dimension – Eg. “Occupation of customers” o Data Discrimination

Comparison of general features of target class data objects with the general features of objects from one or a set of contrasting classes.

Target and Contrasting classes are specified by the users Data objects retrieved through database queries Eg. “Users wants to compare general features of S/W products whose

sales increased by 10% in the last year with those whose sales decreased by 30% during the same period.”

Output of data discrimination is same as output of data characterization. Rule form is called as “Discriminate Rules”. Eg. Compare two groups of customers.

Group1 – Shops frequently – at least 2 times a month

Vs Group 2 – Shops rarely – less than 3 times a year

Result = “80% of frequent shopping customers are between 20-40 years old & have university education.” & “60% of infrequent shopping customers are seniors or youths with no university degree.”

Users can drill down on income level dimension for better discriminative features between the two classes of customers.

• 2) Mining Frequent Patterns, Associations and Correlation:

o Frequent Patterns – Patterns that occur frequently in data. o Many kinds of Frequent Patterns exists:

(1) Itemsets (2) Subsequences (3) Substructures Frequent Itemsets: (Simple)

Set if items that frequently appear together in a database. Eg. Bread & Jam

Frequent Subsequences: (Advanced) Frequent sequential patterns Eg. Purchase PC Purchase Digital Camera Memory Card

Frequent Structured Patterns: (Advanced) Structural forms that occur frequently Structural forms – Graphs, Trees, Lattices Result = Discovery of interesting associations and correlations within

data. o Eg. Association Analysis:

Example 1: “Find which items are frequently purchased together in the same transactions”.

Buys (X. “Computer”) => Buys (X, “Software”) [Support = 1%, Confidence = 50%]

X is a variable representing customers Confidence = % of chance that a customer buying computer buys a

software Support = % of transactions in the whole database that showed computers

and software’s were purchased together. This association rule has a single repeated predicate “Buys” Such association rules are called “Single Dimensional Association Rules” Example 2: Age (X, “20…29”) ^ Income (X, “20K…29K”) => Buys (X, “CD

Player”) [Support = 2%, Confidence = 60%] Association Rule = “2% of total customers in the database are between 20-

29 years of age and with income Rs.20000 to Rs.29000 and have purchased CD player.” & “There is 60% probability that a customer in this age group and income group will purchase a CD player”

This is an association between more than one predicate (ie. Age, Income and Buys)

This is called as “Multidimensional Association Rule”. Association rules that do not satisfy minimum support threshold and

minimum confidence threshold are discarded. • 3) Classification and Prediction:

o Classification: Process of finding a model that describes data classes or concepts Based on a set of training data

This model can be represented in different forms Classification Rules Decision Trees Mathematical Formulae Neural Networks

Decision Trees Flowchart like tree structure Each Node = Test on the attribute value Each Branch = Outcome of the test Tree Leaves = Classes or class distributions Decision trees can be converted into classification rules

Neural Networks Collection of neuron-like processing units + weighted connections

between the units. Other methods of Classification

Naïve Bayesian Classification Support Vector Machines K-nearest neighbor Classification

Classification is used to predict missing or unavailable numeric data values => Prediction.

Regression Analysis:- is a statistical methodology used for numeric prediction.

Prediction also includes distribution trends based on the available data. Classification and Prediction may be used to be preceded by Relevance

Analysis. Relevance Analysis:- attempts to identify attributes that do not contribute

to the classification or prediction process which can be excluded. Example – Classification and Prediction:

1) IF-THEN rules – Classification Model:

2) A Decision Tree – Classification Model:

3) A Neural Network Classification Model:

• 4) Cluster Analysis: o Analyzes data objects without consulting a known class label. o The objects are clustered or grouped based on the principles of “Maximizing the

intra-class similarity” and “Minimizing the inter-class similarity”. o Objects within a cluster have high similarity compared to objects in other clusters. o Each cluster formed is a class of objects. o From this class of objects rules can be derived. o Clustering allows “Taxonomy Formation” Hierarchy of classes that groups

similar events together. o Eg. Customers with respect to customer locations in a city.

o 3 Data Clusters; Cluster center marked with a ‘+’

• 5) Outlier Analysis:

o Data that do not comply with general behavior of data are called as Outliers. o Most Data Mining methods discard outliers as noise or exceptions. o Some applications like fraud detection, rare events can be interesting than regular

ones. o Analysis of such outliers is called as Outlier Analysis / Outlier Mining. o Outliers detected using:

Statistical Methods Distance Measures Deviation Based Methods

Difference in characteristics of an object in a group o Example – Outlier Analysis:

Fraudulent usage of credit cards by detecting purchase of extremely large amount for a given credit card account compared to its general charges incurred.

Same applies for Type of purchase, Place of purchase, Frequency of purchase.

• 6) Evolution Analysis:

o Describes the trends of data whose behavior change over time. o This step includes:

Characterization & Discrimination Association & Correlation Analysis Classification & Prediction Clustering of time-related data Time-series data analysis Sequence or periodicity pattern matching Similarity based data analysis

o Example – Evolution Analysis: Stock exchange data for past several years available. You want to invest in TATA Steel Corp.

Data mining study / Evolution analysis on previous stock exchange data can help prediction of future trends in stock exchange prices.

This will help in decision making in stock investment.

1.4 Steps in Data Mining Process

Data Mining Task Primitives: • Data mining task can be specified in the form of a data mining query, which is input to

the data mining system. • Data mining query is defined in terms of data mining task primitives. • Data Mining task primitives are:

o Set of task relevant data to be mined. (relevant attributes / dimensions) o Kind of knowledge to be mined (kind of data mining functionality) o Background knowledge to be used in the discovery process. (knowledge base –

concept hierarchy, user beliefs) o Interestingness measures and thresholds for pattern evaluation (Interestingness

measure for association rules are ‘support’ and ‘confidence’) o Expected representation for visualizing the discovered patterns. (Tables, charts,

graphs, decision trees, cubes)

Are all of the Patterns Interesting?

• A Data Mining system can generate thousands or even millions of patterns or rules, but only few are interesting.

• What makes a pattern interesting? o A pattern is interesting if it is:

Easily understood by humans Valid on a test data (with some degree of certainty) Useful & Novel Validates a user defined hypothesis

• Interesting pattern = Knowledge • Objective measures of Pattern Interestingness:

o Association rules of the form X => Y is Support o Support = “% of transactions from the database that satisfies the given rule” o Support (X => Y) = P (X U Y) o Confidence = “Assesses the degree of certainty of the association rule” o Confidence (X => Y) = P (X / Y) [i.e. Transaction containing X also contains Y] o Rules that do not satisfy a confidence threshold say 50% is uninteresting. o Rules below the threshold are noise / exceptions o Subjective Interestingness Measure: (Based on users belief in the data)

Unexpected (Contradicts user’s belief) Expected (Confirms user’s belief) Actionable (User can act on)

• Completeness of Data Mining Algorithms: o Can a data mining algorithm generate all of the interesting patterns?

If so it is unrealistic and inefficient. o Instead user provided specification can confine the search on interesting patterns

• If a data mining algorithm produces only interesting patterns it is highly desirable & efficient, but it is a challenge in Data Mining Domain.

1.5 Architecture of Typical Data Mining Systems

Architecture of a typical Data Mining System – Major Components:

Knowledge Base: • Domain knowledge is used to guide search – used to evaluate interestingness of patterns. • Includes concept hierarchies, user benefits, thresholds, metadata etc.

Database / Data warehouse Server: • Responsible for fetching relevant data based on data mining request.

Data Mining Engine: • Consists of modules for characterization, association, correlation analysis, classification,

cluster analysis, prediction, outlier analysis and evolution analysis. Pattern Evaluation Module:

• Interacts with data mining modules. O Focuses the search towards interesting patterns • Pattern evaluation module may be integrated with mining module to confine the search.

User Interface: • Communicates between users and data mining system • Specifies data mining query – to focus search • Uses intermediate data mining results to perform exploratory data mining. • Browse database / data warehouse • Evaluate mined patterns o Visualize patterns in different forms

1.6 Classification of Data Mining Systems

Classification of Data Mining Systems • Data Mining is an inter-disciplinary field. • It is a confluence of a set of disciplines:

1. Database Systems 2. Statistics 3. Machine Learning 4. Visualization 5. Neural Networks 6. Fuzzy Logic 7. Rough Set Theory 8. Knowledge Representation 9. Spatial Data Analysis 10. Information Retrieval 11. Pattern Recognition 12. Image Processing 13. Signal Processing 14. Computer Graphics 15. Web Technology 16. Economics 17. Business 18. Bio-informatics 19. Psychology

• Generates large variety of data mining systems. • Need for classification of Data Mining Systems => users can choose based on their

needs. • Data Mining Systems Classification based on:

1) Classification according to the kinds of databases mined: This classification is based on different criteria:

Data Models Relational mining systems Transactional mining systems Object-relational mining systems Data Warehouse mining systems

Type of Data Spatial Data Time-Series Data Text Data Stream Data

Type of Application Multimedia Data Mining Systems World Wide Web mining Systems

2) Classification according to the kinds of knowledge mined: Based on the data mining functionalities:

Characterization & Discrimination Association & Correlation Analysis Classification & Prediction Clustering, Outlier Analysis, Evolution Analysis

Based on the granularity levels of abstraction of knowledge mined Generalized knowledge (High Level of Abstraction) Primitive Level Knowledge (Raw data level) Knowledge at multiple levels

Data mining systems that mine data regularities (Common data patterns) Vs Data mining systems that mine data irregularities (Outliers / Exceptions).

3) Classification according to the kinds of techniques utilized: Based on the degree of user interactions involved:

Autonomous Systems Interactive Exploratory Systems Query Driven Systems

Based on the methods of data analysis involved: Database oriented method of data analysis Data warehouse oriented method of data analysis Machine Learning > Statistics > Visualization Pattern Recognition > Neural Networks Sophisticated Data Mining System has effective integrated techniques

4) Classification according to the applications adapted: Finance Telecommunication DNA Stock Market E-mail Integrated Application specific systems

1.7 Overview of Data Mining Techniques

Major Issues in Data Mining: • Mining Methodology Issues:

o Mining different kinds of knowledge in databases. o Incorporation of background knowledge

o Handling noisy or incomplete data o Pattern Evaluation – Interestingness Problem

• User Interaction Issues: o Interactive mining of knowledge at multiple levels of abstraction o Data mining query languages and ad-hoc data mining. o Presentation and visualization of data mining results.

• Performance Issues: o Efficiency and Scalability of Data Mining Algorithms. o Parallel, distributed and incremental mining algorithms.

• Issues related to diversity of data types: o Handling of relational and complex types of data. o Mining information from heterogeneous databases and global information

systems.

Data Mining Query Language: • DMQL – adopts SQL like syntax o DMQL can be easily integrated with SQL • Example: “Want to classify customers whose salary >= Rs.40,000

And who have purchased for >= Rs.1,000 of which the price of each item >= Rs.100. We are interested in customer’s age, income, types of items purchased, Purchase location, where items were made.” Resulting classification is in rule form.

Use database store_db Use hierarchy location_hierarchy for T.branch, age_hierarchy for C.age Mine classification as promising_customers In relevance to C.age, C.income, I.type, I.place_made, T.branch From customer C, item I, transaction T Where I.item_ID = T.item_ID

And C.cust_ID = T.cust_ID And C.income >= 40000 And I.price >= 100

Group By T.cust_ID Having sum(I.price) >= 1000 Display as rules

• We want to generate a classification model for “promising_customers” Vs “non_promising_customers”

• Attribute may be specified as Class Label Attribute whose values explicitly represent the classes.

• Specified data are retrieved and assigned as “Promising_customers” and the remaining data in the database are assigned as “non-promising_customers”

• Other types of Data Mining languages are: o Microsoft’s OLEDB for data mining includes DMX – XML styled data mining

language. o PMML – Programming Model Markup Language o CRISP-DM – CRoss Industry Standard Process for Data Mining

Integration of a Data Mining System with a Database or Data Warehouse System:

• DMQL – adopts SQL like syntax

Review Questions

Two Marks:

1. What motivated Data Mining? Why is it important? 2. What is Data Mining? 3. List the kinds of data upon which Data Mining can be done. 4. What is the difference between Data Warehouse and Data Mart? 5. Write about the relation of Statistics with Data Mining. 6. What makes a pattern interesting? 7. Write about the objective measures of pattern interestingness?

Sixteen Marks:

1. (i) Describe the Database System Evolutionary Path. (8) (ii) Explain the steps in the Knowledge Discovery Process. (8)

2. (i) Detail on the Architecture of Data Mining Systems with a suitable diagram. (8) (ii) Explain about the data warehousing process. (8)

3. Detail on the various kinds of data upon which data mining can be done. (16) 4. Explain about various Data Mining functionalities. (16) 5. (i) List down and explain the classifications of data mining systems. (6)

(ii) Discuss about the major issues in data mining. (5) (iii) Write about the integration of data mining system with the database

or data warehouse system. (8)

Assignment Topic:

1. Explain on the Data Mining Query Language.

2.1 Data Pre-processing Data Pre-processing

Why Data Pre-processing? Data in the real world is dirty. That is it is incomplete or noisy or inconsistent.

Incomplete: means lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., occupation=“ ” Noisy: means containing errors or outliers

e.g., Salary=“‐10” Inconsistent: means containing discrepancies in codes or names

e.g., Age=“42” Birthday=“03/07/1997” e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records

Why Is Data Dirty?

Data is dirty because of the below reasons.

Incomplete data may come from “Not applicable” data value when collected Different considerations between the time when the data was collected and when it is

analyzed. Human / hardware / software problems

Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission

Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data)

Duplicate records also need data cleaning Why Is Data Pre-processing Important?

Data Pre-processing is important because:

If there is No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view has the following properties:

Accuracy Completeness Consistency Timeliness

Believability Value added Interpretability Accessibility

Broad categories:

The above properties are broadly categorized into: Intrinsic Contextual Representational Accessibility

Major Tasks in Data Pre-processing Major tasks in data pre-processing are data cleaning, data integration, data transformation, data reduction and data discretization.

Data cleaning Data Cleaning includes, filling in missing values, smoothing noisy data, identifying or

removing outliers, and resolving inconsistencies. Data integration

Data Integration includes integration of multiple databases, data cubes, or files. Data transformation

Data Transformation includes normalization and aggregation. Data reduction

Data reduction is achieved by obtaining reduced representation of data in volume but produces the same or similar analytical results.

Data Discretization Data Discretization is part of data reduction but with particular importance, especially

for numerical data. Forms of Data Pre-processing

2.2 Data Cleaning

Data Cleaning

Importance of Data Cleaning “Data cleaning is one of the three biggest problems in data warehousing”—Ralph

Kimball “Data cleaning is the number one problem in data warehousing”—DCI survey

Data cleaning tasks are:

Filling in missing values Identifying outliers and smoothing out noisy data Correcting inconsistent data Resolving redundancy caused by data integration

Data Cleaning Missing Data

Eg. Missing customer income attribute in the sales data

Methods of handling missing values: a) Ignore the tuple

1) When the attribute with missing values does not contribute to any of the classes or has missing class label.

2) Effective only when more number of missing values are there for many attributes in the tuple.

3) Not effective when only few of the attribute values are missing in a tuple.

b) Fill in the missing value manually

1) This method is time consuming 2) It is not efficient 3) The method is not feasible

c) Use of a Global constant to fill in the missing value 1) This means filling with “Unknown” or “Infinity” 2) This method is simple

3) This is not recommended generally d) Use the attribute mean to fill in the missing value

That is, take the average of all existing income values and fill in the missing income value.

e) Use the attribute mean of all samples belonging to the same class as that of the given tuple.

Say, there is a class “Average income” and the tuple with the missing value belongs to this class and then the missing value is the mean of all the values in this class.

f) Use the most probable value to fill in the missing value This method uses inference based tools like Bayesian Formula, Decision tree etc.

Noisy Data Apply data smoothing techniques like the ones given below: a) Binning Methods

Simple Discretization Methods: Binning

Equal-width (distance) partitioning Divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals

will be: W = (B –A)/N. The most straightforward, but outliers may dominate presentation Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately same number of samples

Good data scaling Managing categorical attributes can be tricky

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

b) Regression - Method of mapping the data into an linear mathematical equation and converging them to a single line and finding the outlying data c) Clustering - Forming clusters and identifying the outlying data Unsorted Data

21,25,34,8,4,15,21,24,28,29,9,26 Sorted Data 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into equal frequency bins a) Equal Width Partitioning / Distance Partitioning Lowest value of the attribute = 4 Highest value of the attribute = 34 Width of the interval = (34 – 4) / 12 = 30 /12 = 2.5 Advantage: Straightforward Disadvantage: Skewed data not handled b) Equal Depth Partitioning / Frequency Partitioning Fix N intervals, Divide the sample into 3 intervals Advantage: Good Data Scaling Disadvantage: Managing categorical data is difficult

Inconsistent Data a) Corrected manually using external references b) Develop code functions that can correct data inconsistencies c) Use available tools to correct data inconsistencies - Known functional dependencies between attributes is used here Data Cleaning as a Process Handling missing data & noisy data is not all about data cleaning,

but it is a big process. 1) First step in data cleaning is Discrepancy Detection a) Using Meta Data - Data about Data, - Attribute Value Domain,

- Acceptable value for an attribute, - Range of Attribute Values, - Dependency between attributes

To detect discrepancy use code or tool b) Using Field Overloading - Eg. 20004 entered instead of 2004 c) Using Unique Rule d) Using Consecutive Rule e) Using Null Rule - How to fill in field values that have blanks, ?, special characters... Commercial tools to aid in discrepancy detection:

(i) Data Scrubbing Tool - Uses simple domain knowledge to detect & correct errors

(ii) Data Auditing Tool - Analyzes data to see if it satisfies rules &

relationships - Uses Statistical analysis to find correlations - Uses clustering to find outliers

2) Second Step in Data Cleaning is Data Transformation

- Method of correcting the identified data discrepancies. Commercial tools to aid in Data Transformation:

(i) Data Migration Tool: - Simple Transformations - Eg. Replace “Gender” by “Sex”

(ii) Extraction Transformation and Loading Tools (ETL): - Users specify transformations through GUI

These two steps of Data Cleaning iterates until the user is satisfied. This process is error prone and time consuming Potter’s Wheel – Publicly available Data Cleaning Tool.

2.3 Integration

Data Integration Data Integration - Combines data from multiple sources into a single store. - Includes multiple databases, data cubes or flat files Schema integration - Integrates meta data from different sources - Eg. A.cust_id = B.cust_no Entity Identification Problem - Identify real world entities from different data sources - Eg. Pay_type filed in one data source can take the values ‘H’ or ‘S’, Vs in another data source it can take the values 1 or 2 Detecting and resolving data value conflicts: - For the same real world entity, the attribute value can be different in

different data sources - Possible reasons can be - Different interpretations, different representation and different scaling

- Eg. Sales amount represented in Dollars (USD) in one data source and as Pounds ($) in another data source.

Handling Redundancy in data integration: - When we integrate multiple databases data redundancy occurs - Object Identification – Same attributes / objects in different data sources

may have different names. - Derivable Data – Attribute in one data source may be derived from

Attribute(s) in another data source Eg. Monthly_revenue in one data source and Annual revenue in another data source.

- Such redundant attributes can be detected using Correlation Analysis - So, Careful integration of data from multiple sources can help in

reducing or avoiding data redundancy and inconsistency which will in turn improve mining speed and quality.

Correlation Analysis – Numerical Data: - Formula for Correlation Co-efficient =

(Pearson’s Product Moment Co-efficient)

BABA nBAnAB

nBBAA

r BA σσσσ )1()(

)1())((

, −

−=

−

−−= ∑∑

- Where, n = No. Of Tuples; and are respective means of A & B σA and σB are respective standard deviations of A & B Σ(AB) is the sum of the cross product of A & B

- If the correlation co-efficient between the attributes A & B are positive then they are positively correlated.

- That is if A’s value increases, B’s value also increases. - As the correlation co-efficient value increases, the stronger the correlation. - If the correlation co-efficient between the attributes A & B is zero then

they are independent attributes. - If the correlation co-efficient value is negative then they are negatively Correlated.

Eg: Positive Correlation

Eg: Negative Correlation

Eg: No Correlation

A B

Correlation Analysis – Categorical Data: - Applicable for data where values of each attribute are divided into different categories.

- Use Chi-Square Test (using the below formula)

- If the value of Χ2 is high, higher the attributes are related. - The cells that contribute maximum to the value of Χ2 are the ones whose Observed frequency is very high than its Expected frequency. - The Expected frequency is calculated using the data distribution in the two categories of the attributes. - Consider there are two Attributes A & B; the values of A are categorized into category Ai and Aj; the values of B are categorized into category Bi and Bj - The expected frequency of Ai and Bj =

Eij = (Count(Ai) * Count(Bj)) / N

Play chess

Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450 Not like science fiction

50(210) 1000(840) 1050

Sum(col.) 300 1200 1500 - Eg. Consider a sample population of 1500 people who are surveyed to see if they Play Chess or not and if they Like Science Fiction books are not. - The counts given within parenthesis are expected frequency and the remaining one is the observed frequency. - For Example the Expected frequency for the cell (Play Chess, Like Science Fiction) is: = (Count (Play Chess) * Count (Like Science Fiction)) / Total sample population = (300 * 450) / 1500 = 90

93.507840

)8401000(360

)360200(210

)21050(90

)90250( 22222 =

−+

−+

−+

−=χ

- This shows that the categories Play Chess and Like Science Fiction are strongly correlated.

2.4 Transformation

∑ −=

ExpectedExpectedObserved 2

2 )(χ

Data Transformation Smoothing:- Removes noise from the data Aggregation:- Summarization, Data cube Construction Generalization:- Concept Hierarchy climbing Attribute / Feature Construction:- New attributes constructed from the given ones Normalization:- Data scaled to fall within a specified range

- min-max normalization - z-score normalization - normalization by decimal scaling

Data Transformation – Normalization: Min-Max Normalization: - Uses the formula: Eg: Let income range $12,000 to $98,000 normalized to [0.0, 1,0]. Then $73,600 is mapped to: Z-Score Normalization: - Uses the formula:

- Where is µ mean and σ is standard deviation. - Eg. Let µ = 54,000, σ = 16,000. Then

Normalization by Decimal Scaling:

- Moves the decimal places of the values of the attribute A by a certain number of decimal places that depends upon the maximum absolute value of A.

- That is it uses the formula: - Where j is the smallest integer such that Max(|ν’|) < 1 - Eg. A ranges from 986 to 917;

- Maximum absolute value of A is 986 - So divide each value of A by 1000 (j =3) - Now normalized values of A ranges from -0.986 to 0.917

2.5 Reduction

Data Reduction Why Data Reduction?

- A database of data warehouse may store terabytes of data - Complex data analysis or mining will take long time to run on the complete

data set What is Data Reduction?

- Obtaining a reduced representation of the complete dataset

AAA

AA

A minnewminnewmaxnewminmax

minvv _)__(' +−−

−=

716.00)00.1(000,12000,98000,12600,73

=+−−−

A

Avvσµ−

='

225.1000,16

000,54600,73=

−

j

vv10

'=

- Produces same result or almost same mining / analytical results as that of original.

Data Reduction Strategies: 1. Data cube Aggregation 2. Dimensionality reduction – remove unwanted attributes 3. Data Compression 4. Numerosity reduction – Fit data into mathematical models 5. Discretization and Concept Hierarchy Generation

1. Data Cube Aggregation: - The lowest level of data cube is called as base cuboid.

- Single Level Aggregation - Select a particular entity or attribute and Aggregate based on that particular attribute.

Eg. Aggregate along ‘Year’ in a Sales data. - Multiple Level of Aggregation – Aggregates along multiple attributes – Further reduces the size of the data to analyze. - When a query is posed by the user, use the appropriate level of

Aggregation or data cube to solve the task - Queries regarding aggregated information should be answered using

the data cube whenever possible.

2. Attribute Subset Selection Feature Selection: (attribute subset selection)

- The goal of attribute subset selection is to find the minimum set of Attributes such that the resulting probability distribution of data classes is as close as possible to the original distribution obtained using all Attributes.

- This will help to reduce the number of patterns produced and those patterns will be easy to understand

Heuristic Methods: (Due to exponential number of attribute choices) - Step wise forward selection - Step wise backward elimination - Combining forward selection and backward elimination - Decision Tree induction - Class 1 - A1, A5, A6; Class 2 - A2, A3, A4

3. Data Compression

- Compressed representation of the original data. - This data reduction is called as Lossless if the original data can be reconstructed from the compressed data without any loss of information. - The data reduction is called as Lossy if only an approximation of the original data can be reconstructed. - Two Lossy Data Compression methods available are:

o Wavelet Transforms o Principal Components Analysis 3.1 Discrete Wavelet Transform (DWT):

- Is a linear Signal processing technique - It transforms the data vector X into a numerically different vector X’. - These two vectors are of the same length. - Here each tuple is an n-dimensional data vector. - X = {x1,x2,…xn} n attributes - This wavelet transform data can be truncated. - Compressed Approximation: Stores only small fraction of strongest of

wavelet coefficients. - Apply inverse of DWT to obtain the original data approximation. - Similar to discrete Fourier transforms (Signal processing technique involving

Sines and Cosines) - DWT uses Hierarchical Pyramid Algorithm

o Fast Computational speed o Halves the data in each iteration 1. The length of the data vector should be an integer power of two.

(Padding with zeros can be done if required) 2. Each transform applies two functions:

a. Smoothing – sum / weighted average b. Difference – weighted difference

3. These functions are applied to pairs of data so that two sets of data of length L/2 is obtained.

4. Applies these two transforms iteratively until a user desired data length is obtained.

3.2 Principal Components Analysis (PCA):

- Say, data to be compressed consists of N tuples and k attributes. - Tuples can be called as Data vectors and attributes can be called as dimensions. - So, data to be compressed consists of N data vectors each having k-dimensions. - Consider a number c which is very very less than N. That is c << N. - PCA searches for c orthogonal vectors that have k dimensions and that can best be used to represent the data. - Thus data is projected to a smaller space and hence compressed. - In this process PCA also combines the essence of existing attributes and produces a smaller set of attributes. - Initial data is then projected on to this smaller attribute set. Basic Procedure:

1. Input data Normalized. All attributes values are mapped to the same range. 2. Compute N Orthonormal vectors called as principal components. These are

unit vectors perpendicular to each other. Thus input data = linear combination of principal components

3. Principal Components are ordered in the decreasing order of “Significance” or strength.

4. Size of the data can be reduced by eliminating the components with less “Significance” or the weaker components are removed. Thus the Strongest Principal Component can be used to reconstruct a good approximation of the original data.

- PCA can be applied to ordered & unordered attributes, sparse and skewed data. - It can also be applied on multi dimensional data by reducing the same into 2 dimensional data. - Works only for numeric data.

4. Numerosity Reduction

- Reduces the data volume by choosing smaller forms of data representations. - Two types – Parametric, Non-Parametric. - Parametric – Data estimated into a model

– only the data parameters stored and not the actual data.

X1

X2 Y1

Y2

Principal Component Analysis

- Stored data includes outliers also - Eg. Log-Linear Models

- Non-Parametric – Do not fits data into models - Eg. Histograms, Clustering and Sampling

4.1 Regression and Log-Linear Models: - Linear Regression - data are modeled to fit in a straight line. - That is data can be modeled to the mathematical equation: - Where X is called the “Response Variable” and Y is called “Predictor Variable”. - Alpha and beta are called the regression coefficients. - Alpha is the Y-intercept and Beta is the Slope of the equation. - These regression coefficients can be solved by using “method of least squares”. - Multiple Regression – Extension of linear regression

– Response variable Y is modeled as a multidimensional vector. - Log-Linear Models: Estimates the probability of each cell in a base cuboid for a set of discretized attributes. - In this higher order data cubes are constructed from lower ordered data cubes.

4.2 Histograms: - Uses binning to distribute the data. - Histogram for an attribute A: - Partitions the data of A into disjoint subsets / buckets. - Buckets are represented in a horizontal line in a histogram. - Vertical line of histogram represents frequency of values in bucket. - Singleton Bucket – Has only one attribute value / frequency pair - Eg. Consider the list of prices (in $) of the sold items. 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15,

18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

- consider a bucket of uniform width, say $10.

- Methods of determining the bucket / partitioning the attribute values:

o Equi-Width: Width of each bucket is a constant o Equi-Depth: Frequency of each bucket is a constant o V-Optimal: Histogram with the least variance

Histogram Variance = Weighted sum of values in each bucket Bucket Weight = Number of values in the bucket.

o MaxDiff: Find the difference between pair of adjacent values. Buckets are formed between the pairs where the difference between the pairs is greater than or equal to b-1, b (Beta) is user specified.

o V-Optimal & MaxDiff are most accurate and practical. - Histograms can be extended for multiple attributes – Multidimensional histograms –

can capture dependency between attributes.

- Histograms of up to five attributes are found to be effective so far. - Singleton buckets are useful for storing outliers with high frequency.

4.3 Clustering: - Considers data tuples as objects. - Partition objects into clusters. - Objects within a cluster are similar to one another and the objects in different

clusters are dissimilar. - Quality of a cluster is represented by its ‘diameter’

– maximum distance between any two objects in a cluster. - Another measure of cluster quality = Centroid Distance = Average distance of each

cluster object from the cluster centroid. - The cluster representation of the data can be used to replace the actual data - Effectiveness depends on the nature of the data. - Effective for data that can be categorized into distinct clusters. - Not effective if data is ‘smeared’. - Can also have hierarchical clustering of data - For faster data access in such cases we use multidimensional index trees. - There are many choices of clustering definitions and algorithms available. -

Diagram for clustering

4.4 Sampling: - Can be used as a data reduction technique. - Selects random sample or subset of data. - Say large dataset D contains N tuples. 1. Simple Random Sample WithOut Replacement (SRSWOR) of size n: - Draw n tuples from the original N tuples in D, where n<N. - The probability of drawing any tuple in D is 1/N. That is all tuples have equal chance 2. Simple Random Sample With Replacement (SRSWR) of size n: - Similar to SRSWOR, except that each time when a tuple is drawn from D it is recorded and replaced. - After a tuple is drawn it is placed back in D so that it can be drawn again.

3. Cluster Sample: - Tuples in D are grouped into M mutually disjoint clusters. - Apply SRS (SRSWOR / SRSWR) to each cluster of tuples. - Each page of data fetching of tuples can be considered as a cluster.

4. Stratified Sample: - D is divided into mutually disjoint strata.

- Apply SRS (SRSWOR / SRSWR) to each Stratum of tuples. - In this way the group having the smallest number of tuples

is also represented.

2.6 Discretization Concept Hierarchies

Data Discretization and Concept Hierarchy Generation: Data Discretization Technique:

- Reduces the number of attribute values - Divides the attribute values into intervals - Interval labels are used to replace the attribute values - Result – Data easy to use, concise, Knowledge level representation of data

Types of Data Discretization Techniques: 1. Supervised Discretization

a. Uses Class information of the data 2. Unsupervised Discretization

a. Does not uses Class information of the data 3. Top-down Discretization (splitting)

a. Identifies ‘Split-Points’ or ‘Cut-Points’ in data values b. Splits attribute values into intervals at split-points c. Repeats recursively on resulting intervals d. Stops when specified number of intervals reached or some stop criteria is reached.

4. Bottom-up Discretization (merging) a. Divide the attribute values into intervals where each interval has a distinct

attribute value. b. Merge two intervals based on some merging criteria c. Repeats recursively on resulting intervals d. Stops when specified number of intervals reached or some stop criteria is reached.

• Discretization results in – Hierarchical Partitioning of Attributes = Called as Concept Hierarchy

• Concept Hierarchy used for Data Mining at multiple levels of abstraction. • Eg. For Concept Hierarchy – Numeric values for the attribute Age can be replaced with

the class labels ‘Youth’, ‘Middle Aged’ and ‘Senior’

• Discretization and Concept Hierarchy are pre-processing steps for Data Mining • For a Single Attribute multiple Concept Hierarchies can be produced to meet various user

needs. • Manual Definition of concept Hierarchies by Domain experts is a tedious and time

consuming task. • Automated discretization methods are available. • Some Concept hierarchies are implicit at the schema definition level and are defined

when the schema is being defined by the domain experts. • Eg of Concept Hierarchy using attribute ‘Age’ • Interval denoted by (Y,X] Value Y (exclusive) and Value X (inclusive)

Discretization and Concept Hierarchy Generation for Numeric Data: - Concept Hierarchy generation for numeric data is difficult and tedious task as it has

wide range of data values and has undergoes frequent updates in any database. - Automated Discretization Methods:

o Binning o Histogram analysis o Entropy based Discretization Method o X2 – Merging (Chi-Merging) o Cluster Analysis o Discretization by Intuition Partitioning

- These methods assumes the data is in the sorted order Binning: - Top-Down Discretization Technique Used - Un Supervised Discretization Technique – No Class Information Used - User specified number of bins is used. - Same technique as used for Smoothing and Numerosity reduction - Data Discretized using Equi-Width or Equi-Depth method - Replace each bin value by bin mean or bin median. - Same technique applied recursively on resulting bins or partitions to generate

Concept Hierarchy - Outliers are also fitted in separate bins or partitions or intervals Histogram Analysis: - Un Supervised Discretization; Top-Down Discretization Technique. - Data Values split into buckets – Equi-Width or Equi-Frequency - Repeats recursively on resulting buckets to generate multi-level Concept Hierarchies. - Stops when user specified numbers of Concept Hierarchy Levels are generated. Entropy-Based Discretization: - Supervised Discretization Method; Top-Down Discretization Method - Calculates and determines split-points; The value of the attribute A that has minimum

entropy is the split point; Data divided into partitions at the split points - Repeats recursively on resulting partitions to produce Concept Hierarchy of A - Basic Method:

o Consider a database D which has many tuples and A is one of the attribute. o This attribute A is the Class label attribute as it decides the class of the tuples. o Attribute value of A is considered as Split-point - Binary Discretization.

Tuples with data values of A<= Split-point = D1 Tuples with data values of A > Split-point = D2

o Uses Class information. Consider there are two classes of tuples C1 and C2. Then the ideal partitioning should be that the first partition should have the class C1 tuples and the second partition should have the class C2 tuples. But this is unlikely.

o First partition may have many tuples of class C1 and few tuples of class C2 and Second partition may have many tuples of class C2 and few tuples of class C1.

o To obtain a perfect partitioning the amount of Expected Information Requirement is given by the formula:

o Formula for Info(D) and Entropy(D1) o o o o Consider that there are m classes and pi is the probability of class i in D1. o Select a split-point so that it has minimum amount of Expected Information

Requirement. o Repeat this recursively on the resulting partitions to obtain the Concept

Hierarchy and stop when the number of intervals exceed the max-intervals (user specified)

X2 Merging (Chi-Merging): - Bottom-up Discretization; Supervised Discretization - Best neighboring intervals identified and merged recursively. - Basic concept is that adjacent intervals should have the same class distribution. If so

they are merged, otherwise remain separate. o Each distinct value of the numeric attribute = one interval o Perform X2 test on each pair of adjacent intervals. o Intervals with least X2 value are merged to form a larger interval o Low X2 value Similar class distribution o Merging done recursively until pre-specified stop criteria is reached.

- Stop criteria determined by 3 conditions: o Stops when X2 value of every pair of adjacent intervals exceeds a pre-

specified significance level – set between 0.10 and 0.01 o Stops when number of intervals exceeds a pre-specified max interval (say 10

to 15) o Relative Class frequencies should be consistent within an interval. Allowed

level of inconsistency within an interval should be within a pre-specified threshold say 3%.

Cluster Analysis: - Uses Top-Down Discretization or Bottom-up Discretization - Data values of an attribute are partitioned into clusters - Uses the closeness of data values Produces high quality discretization results. - Each cluster is a node in the concept hierarchy - Each cluster further sub-divided into sub-clusters in case of Top-down approach to

create lower level clusters or concepts. - Clusters are merged in Bottom-up approach to create higher level cluster or concepts. Discretization by Intuitive Partitioning: - Users like numerical value intervals to be uniform, easy-to-use, ‘Intuitive’, Natural. - Clustering analysis produces intervals such as ($53,245.78,$62,311.78]. - But intervals such as ($50,000,$60,000] is better than the above. - 3-4-5 Rule:

o Partitions the given data range into 3 or 4 or 5 equi-width intervals o Partitions recursively, level-by-level, based on value range at most significant

digit.

)(||||)(

||||)( 2

21

1 DEntropyDDDEntropy

DDDI += ∑

=

−=m

iii ppDEntropy

121 )(log)(

o Real world data can be extremely high or low values which need to be considered as outliers. Eg. Assets of some people may be several magnitudes higher than the others

o Such outliers are handled separately in a different interval o So, majority of the data lies between 5% and 95% of the given data range. o Eg. Profit of an ABC Ltd in the year 2004. o Majority data between 5% and 95% - (-$159,876,$1,838,761] o MIN = -$351,976; MAX = $4,700,896; LOW = -$159,876; HIGH =

$1,838,761; o Most Significant digit – msd = $1,000,000; o Hence LOW’ = -$1,000,000 & HIGH’ = $2,000,000 o Number of Intervals = ($2,000,000 – (-$1,000,000))/$1,000,000 = 3.

Example of 3-4-5 Rule

o Hence intervals are: (-$1,000,000,$0], ($0,$1,000,000],

($1,000,000,$2,000,000] o LOW’ < MIN => Adjust the left boundary to make the interval smaller. o Most significant digit of MIN is $100,000 => MIN’ = -$400,000 o Hence first interval reduced to (-$400,000,$0] o HIGH’ < MAX => Add new interval ($2,000,000,$5,000,000] o Hence the Top tier Hierarchy intervals are: o (-

$400,000,$0],($0,$1,000,000],($1,000,000,$2,000,000],($2,000,000,$5,000,000]

o These are further subdivided as per 3-4-5 rule to obtain the lower level hierarchies.

o Interval (-$400,000,$0] is divided into 4 equi-width intervals o Intervals ($0,$1,000,000] & is divided into 5 Equi-width intervals o Interval ($1,000,000,$2,000,000] is divided into 5 Equi-width intervals o Interval ($2,000,000, $5,000,000] is divided into 3 Equi-width intervals.

Concept Hierarchy Generation for Categorical Data: Categorical Data = Discrete data; Eg. Geographic Location, Job type, Product Item type Methods Used:

1. Specification of partial ordering of attributes explicitly at the schema level by users or Experts.

2. Specification of a portion of a hierarchy by explicit data grouping. 3. Specification of the set of attributes that form the concept hierarchy, but not their partial

ordering. 4. Specification of only a partial set of attributes.

1. Specification of a partial set of attributes at the schema level by the users or domain experts: - Eg. Dimension ‘Location’ in a Data warehouse has attributes ‘Street’, ‘City’, ‘State’ &

‘Country’. - Hierarchical definition of these attributes obtained by ordering these attributes as: - State < City < State < Country at the schema level itself by user or expert.

2. Specification of a portion of the hierarchy by explicit data grouping: - Manual definition of concept hierarchy. - In real time large databases it is unrealistic to define the concept hierarchy for the entire database manually by value enumeration.

- But we can easily specify intermediate-level grouping of data - a small portion of hierarchy.

- For Eg. Consider the sttribute State where we can specify as below: - {Chennai, Madurai, Trichy} C (Belongs to) Tamilnadu - {Bangalore, Mysore, Mangalore} C (Belongs to) Karnataka

3. Specification of a set of attributes but not their partial ordering: - User specifies set of attributes of the concept hierarchy; but omits to specify their ordering - Automatic concept hierarchy generation or attribute ordering can be done in such cases. - This is done using the rule that counts and uses the distinct values of each attribute.

- The attribute that has the most distinct values is placed at the bottom of the hierarchy - And the attribute that has the least distinct values is placed at the top of the hierarchy - This heuristic rule applies for most cases but it fails of some. - Users or experts can examine the concept hierarchy and can perform manual

adjustment. - Eg. Concept Hierarchy for ‘Location’ dimension: - Country (10); State (508), City (10,804), Street (1,234,567) - Street < City < State < Country - In this case user need not modify the generated order / concept hierarchy. - But this heuristic rule may fail for the ‘Time’ dimension. - Distinct Years (100); Distinct Months (12); Distinct Days-of-week (7) - So in this case the attribute ordering or the concept hierarchy is: - Year < Month < Days-of-week - This is not correct.

4. Specification of only partial set of attributes: - User may have vague idea of the concept hierarchy - So they just specify only few attributes that form the concept hierarchy.

- Eg. User specifies just the Attributes Street and City. - To get the complete concept hierarchy in this case we have to link these user specified attributes

with the data semantics specified by the domain experts. - Users have the authority to modify this generated hierarchy.

- The domain expert may have defined that the attributes given below are semantically linked

- Number, Street, City, State, Country. - Now the newly generated concept hierarchy by linking the domain expert

specification and the users specification will be that: - Number < Street < City < State < Country - Here the user can inspect this concept hierarchy and can remove the unwanted

attribute ‘Number’ to generate the new Concept Hierarchy as below: - Street < City < State < Country

2.8 Concept Description: Data Generalization and Summarization Based Characterization

Data generalization and summarization-based characterization Data and objects in databases often contain detailed information at primitive concept levels. For example, the item relation in a sales database may contain attributes describing low level item information such as item ID, name, brand, category, supplier, place made, and price. It is useful to be able to summarize a large set of data and present it at a high conceptual level. For example, summarizing a large set of items relating to Christmas season sales provides a general description of such data, which can be very helpful for sales and marketing managers. This requires an important functionality in data mining: data generalization.

Data generalization is a process which abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels. Methods for the e₃cient and ₃exible generalization of large data sets can be categorized according to two approaches: (1Ñ the data cube approach, and (2Ñ the attribute-oriented induction approach.

Data cube approach for data generalization In the data cube approach (or OLAP approachÑ to data generalization, the data for analysis are stored in a multidimensional database, or data cube. Data cubes and their use in OLAP for data generalization were described in detail. In general, the data cube approach \materializes data cubes" by ₃rst identifying expensive computations required for frequently-processed queries. These operations typically involve aggregate functions, such as count(Ñ, sum(Ñ, average(Ñ, and max(Ñ. The computations are performed, and their results are stored in data cubes. Such computations may be performed for various levels of data abstraction. These materialized views can then be used for decision support, knowledge discovery, and many other applications.

A set of attributes may form a hierarchy or a lattice structure, de₃ning a data cube dimension. For example, date may consist of the attributes day, week, month, quarter, and year which form a lattice structure, and a data cube dimension for time. A data cube can store pre-computed aggregate functions for all or some of its dimensions. The precomputed aggregates correspond to speci₃ed group-by's of di‒erent sets or subsets of attributes.

Generalization and specialization can be performed on a multidimensional data cube by roll-up or drill-down operations. A roll-up operation reduces the number of dimensions in a data cube, or generalizes attribute values to higher level concepts. A drill-down operation does the reverse. Since many aggregate functions need to be computed repeatedly in data analysis, the

storage of precomputed results in a multidimensional data cube may ensure fast response time and o‒er ₃exible views of data from di‒erent angles and at di‒erent levels of abstraction.

The data cube approach provides an e₃cient implementation of data generalization, which in turn forms an important function in descriptive data mining. However, as we pointed out in Section 5.1, most commercial data cube implementations con₃ne the data types of dimensions to simple, nonnumeric data and of measures to simple, aggregated numeric values, whereas many applications may require the analysis of more complex data types. More-over, the data cube approach cannot answer some important questions which concept description can, such as which dimensions should be used in the description, and at what levels should the generalization process reach. Instead, it leaves the responsibility of these decisions to the users.

In the next subsection, we introduce an alternative approach to data generalization called attribute-oriented induction, and examine how it can be applied to concept description. Moreover, we discuss how to integrate the two approaches, data cube and attribute-oriented induction, for concept description. Attribute-oriented induction The attribute-oriented induction approach to data generalization and summarization-based characterization was ₃rst proposed in 1989, a few years prior to the introduction of the data cube approach. The data cube approach can be considered as a data warehouse-based, precomputation-oriented, materialized view approach. It performs o‒-line aggregation before an OLAP or data mining query is submitted for processing. On the other hand, the attribute-oriented approach, at least in its initial proposal, is a relational database query-oriented, generalization-based, on-line data analysis technique. However, there is no inherent barrier distinguishing the two approaches based on on-line aggregation versus o‒-line precomputation. Some aggregations in the data cube can be computed on-line, while o‒-line precomputation of multidimensional space can speed up attribute-oriented induction as well. In fact, data mining systems based on attribute-oriented induction, such as DBMiner, have been optimized to include such o‒-line precomputation.

Let's ₃rst introduce the attribute-oriented induction approach. We will then perform a detailed analysis of the approach and its variations and extensions.

The general idea of attribute-oriented induction is to ₃rst collect the task-relevant data using a relational database query and then perform generalization based on the examination of the number of distinct values of each attribute in the relevant set of data. The generalization is performed by either attribute removal or attribute generalization (also known as concept hierarchy ascensionÑ. Aggregation is performed by merging identical, generalized tuples, and accumulating their respective counts. This reduces the size of the generalized data set. The resulting generalized relation can be mapped into di‒erent forms for presentation to the user, such as charts or rules.

1. Attribute removal is based on the following rule: If there is a large set of distinct values for an attribute of the initial working relation, but either (1Ñ there is no generalization operator on the attribute (e.g., there is no concept hierarchy de₃ned for the attributeÑ, or (2Ñ its

higher level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.

What is the reasoning behind this rule? An attribute-value pair represents a conjunct in a generalized tuple, or rule. The removal of a conjunct eliminates a constraint and thus generalizes the rule. If, as in case 1, there is a large set of distinct values for an attribute but there is no generalization operator for it, the attribute should be removed because it cannot be generalized, and preserving it would imply keeping a large number of disjuncts which contradicts the goal of generating concise rules. On the other hand, consider case 2, where the higher level concepts of the attribute are expressed in terms of other attributes. For example, suppose that the attribute in question is street, whose higher level concepts are represented by the attributes hcity, province or state, countryi. The removal of street is equivalent to the application of a generalization operator. This rule corresponds to the generalization rule known as dropping conditions in the machine learning literature on learning-from-examples.

2. Attribute generalization is based on the following rule: If there is a large set of distinct

values for an attribute in the initial working relation, and there exists a set of generalization operators on the attribute, then a generalization operator should be selected and applied to the attribute.

This rule is based on the following reasoning. Use of a generalization operator to generalize an attribute value within a tuple, or rule, in the working relation will make the rule cover more of the original data tuples, thus generalizing the concept it represents. This corresponds to the generalization rule known as climbing generalization trees in learning-from-examples.

Both rules, attribute removal and attribute generalization, claim that if there is a large set of

distinct values for an attribute, further generalization should be applied. This raises the question: how large is \a large set of distinct values for an attribute" considered to be?

Depending on the attributes or application involved, a user may prefer some attributes to remain at a rather low abstraction level while others to be generalized to higher levels. The control of how high an attribute should be generalized is typically quite subjective. The control of this process is called attribute generalization control. If the attribute is generalized \too high", it may lead to over-generalization, and the resulting rules may not be very informative. On the other hand, if the attribute is not generalized to a \su₃ciently high level", then under-generalization may result, where the rules obtained may not be informative either. Thus, a balance should be attained in attribute-oriented generalization.

There are many possible ways to control a generalization process. Two common approaches are described below.

The ₃rd technique, called attribute generalization threshold control, either sets one

generalization thresh-old for all of the attributes, or sets one threshold for each attribute. If the number of distinct values in an attribute is greater than the attribute threshold, further attribute removal or attribute generalization should be performed. Data mining systems typically have a default attribute threshold value (typically ranging from 2 to 8Ñ, and should allow experts and users to modify the threshold values as well. If a user feels that the gen-eralization reaches too high a level for a particular attribute, she can increase the threshold. This corresponds to drilling down along the attribute. Also, to further generalize

a relation, she can reduce the threshold of a particular attribute, which corresponds to rolling up along the attribute.

The second technique, called generalized relation threshold control, sets a threshold for the generalized relation. If the number of (distinctÑ tuples in the generalized relation is greater than the threshold, further generalization should be performed. Otherwise, no further generalization should be performed. Such a threshold may also be preset in the data mining system (usually within a range of 10 to 30Ñ, or set by an expert or user, and should be adjustable. For example, if a user feels that the generalized relation is too small, she can increase the threshold, which implies drilling down. Otherwise, to further generalize a relation, she can reduce the threshold, which implies rolling up.

These two techniques can be applied in sequence: ₃rst apply the attribute threshold control

technique to generalize each attribute, and then apply relation threshold control to further reduce the size of the generalized relation.

Notice that no matter which generalization control technique is applied, the user should be allowed to adjust the generalization thresholds in order to obtain interesting concept descriptions. This adjustment, as we saw above, is similar to drilling down and rolling up, as discussed under OLAP operations. However, there is a methodological distinction between these OLAP operations and attribute-oriented induction. In OLAP, each step of drilling down or rolling up is directed and controlled by the user; whereas in attribute-oriented induction, most of the work is performed automatically by the induction process and controlled by generalization thresholds, and only minor adjustments are made by the user after the automated induction.

In many database-oriented induction processes, users are interested in obtaining quantitative or statistical in-formation about the data at di‒erent levels of abstraction. Thus, it is important to accumulate count and other aggregate values in the induction process. Conceptually, this is performed as follows. A special measure, or numerical attribute, that is associated with each database tuple is the aggregate function, count. Its value for each tuple in the initial working relation is initialized to 1. Through attribute removal and attribute generalization, tuples within the initial working relation may be generalized, resulting in groups of identical tuples. In this case, all of the identical tuples forming a group should be merged into one tuple. The count of this new, generalized tuple is set to the total number of tuples from the initial working relation that are represented by (i.e., were merged intoÑ the new generalized tuple. For example, suppose that by attribute-oriented induction, 52 data tuples from the initial working relation are all generalized to the same tuple, T . That is, the generalization of these 52 tuples resulted in 52 identical instances of tuple T . These 52 identical tuples are merged to form one instance of T , whose count is set to 52. Other popular aggregate functions include sum and avg. For a given generalized tuple, sum contains the sum of the values of a given numeric attribute for the initial working relation tuples making up the generalized tuple. Suppose that tuple T contained sum(units soldÑ as an aggregate function. The sum value for tuple T would then be set to the total number of units sold for each of the 52 tuples. The aggregate avg (averageÑ is computed according to the formula, avg = sum♪count.

Generalized descriptions resulting from attribute-oriented induction are most commonly

displayed in the form of a generalized relation, such as the generalized relation.

2.9 Mining Association Rules in Large Databases

Mining Association Rules from large databases: Basic Concepts and a Road Map:

Frequent Patterns: - Patterns that appear in a dataset frequently. - Eg. Milk and Bread are purchased together mostly Frequent Sequential Patterns: - Sequence of patterns that appear frequently in a dataset. - Eg. Purchase of Computer followed by purchase of Digital Camera followed by

purchase of Memory Card – This sequence of pattern found mostly. Frequent Structural Patterns: - Structural forms such as graphs, trees and la ttices appear frequently in a dataset.

Frequent Pattern Mining: - Discovery of interesting associations & correlations between item sets in transactional

and relational databases. Market Basket Analysis – A Motivating Example:

- Analyzes customer buying habits by finding associations between different items that customers place in their “Shopping Baskets”.

- Discovery of such associations helps retailers to: o Develop Marketing Strategy o Increase Sales o Plan Shelf Space o Decide on which items to put on sale at reduced prices.

Eg. Sale on Printers allowed when Computers are purchased. - These patterns can be represented in the form of association rules.

o For Eg. The customers who purchase computers also tend to buy printers at the same time is represented in association rule as below:

Computer => Printers [Support = 2%; Confidence = 60%] - Support and Confidence are two measures of rule interestingness. - Support – reflects usefulness of discovered rule. - Confidence – reflects certainty of discovered rule. - Support = 2% => 2% of transactions satisfies the rule “Printers are purchased when

computers are sold” - Confidence = 60% => 60% of customers who purchased a computer also purchased

printers. - Association rules are interesting if they satisfy:

o Minimum Support Threshold Set by users or domain experts o Minimum Confidence Threshold Set by users or domain experts

Basic Concepts:

- Let I = {I1,I2,…In} = set of items - D = {T1,T2,…Tn} = Set of Transactions - Each Tj is a set of items such that T is the subset of I

- Each transaction has a unique identifier TID - Let A = set of items - Then T is said to contain A if and only if A is a subset of T - Association Rule is of the form:

o - The Association rule A => B has support s in the transaction set D; where s is the %

of transactions in D that contains - The Association rule A => B has confidence c in the transaction set D; where c is the

% of transactions in D containing A which also contains B. - That is:

- Association rules that satisfy both a minimum support threshold (min-sup) and

minimum confidence threshold (min-conf) are called Strong Association rules. - A set of items is referred to as an item set. - An item set that contains k items is called a k-item set. - The set {Computer, Printer} is a 2-itemset - Frequency / Support Count of an Item set:

o Number of transactions that contain the item set. - Frequent Item set:

o If an item set satisfies minimum support (min-support) then it is a frequent item set.

How are association rule mined from large databases? - Step1: Find all frequent item sets - Step 2: Generate strong association rules from the frequent item sets.

Association Rule Mining – A road map: - One form of association rule mining is “Market Basket Analysis” - Many kinds of association rules exist. - Association rules classified based on the following criteria:

o 1) Based on the type of values handled in the rule: a) Boolean Association Rule:

• Association between presence or absence of items. • Eg. Computer => Printer [Support = 2%; Confidence = 60%]

b) Quantitative Association Rule: • Association between quantitative attributes • Eg. In the below Age and Income are quantitative attributes.

o 2) Based on the dimensions of data involved in the rule:

a) Single-Dimensional Association Rule: • If attributes in an association rule refer to only one dimension • Eg.

b) Multi-Dimensional Association Rule: • Attributes in the association rule refer to two or more

dimensions. • Eg. In the below it involves 3 dimensions Age, Income &

Buys.

o 3) Based on the levels of abstractions involved in the rule: a) Multi-Level Association Rule

• Association rule refers to items at different levels of abstraction.

• Eg. • • In the above, Computer is at the higher level of abstraction and

Laptop Computer at the lower level of abstraction. b) Single-Level Association Rule

• Association rule refers to items at same level of abstraction o 4) Based on the nature of association involved in the rule:

Association mining extended to Correlation Analysis where the absence or presence of correlated items can be identified.

Apriori Algorithm: - This is a basic algorithm for:

o Finding Frequent Item sets o Generate Strong Association Rules from Frequent Item sets.

Finding Frequent Item sets: - This algorithm uses prior knowledge of frequency item set properties. - Uses Level-wise search – Iterative approach - Here k-item sets are used to explore k+1 item sets - 1) First set of frequent 1-Item sets are found. This set is denoted by L1. - 2) L1 is used to find L2 = Frequent 2-Item sets - 3) L2 is used to find L3 and so on. - 4) Stop when no more frequent k-item sets can be found. - Finding of each Lk requires one full scan of the database. - At every level the algorithm prunes the sets that are unlikely to be frequent sets. - Each pass (Except the first pass) has 2 phases:

o (i) Candidate Generation Process o (ii) Pruning Process

- This Algorithm uses the Apriori Property: o “All non-empty subsets of a frequent itemset must also be frequent”

(i)Candidate Generation Process: - To find Lk, the candidate set Ck has to be generated by joining Lk-1 with itself. - The members of Lk-1 are joinable if they have (k-2) items in common. - - Eg. L3 = { {1,2,3}, {1,2,5}, {1,3,5}, {2,3,5}, {2,3,4} } - Then C4 = { {1,2,3,5}, {2,3,4,5} } - {1,2,3,5} is generated from {1,2,3} & {1,2,5} - {2,3,4,5} is generated from {2,3,4} & {2,3,5} (ii)Pruning Process: - This process removes the items from Ck which does not satisfy the Apriori Property. - First find all k-1subsets of each item of Ck - Then check if these subsets are in Lk-1. If for any item of Ck any of its subsets are

not in Lk-1 then remove that item from Ck. - Prune (Ck)

For all c in Ck For all (k-1) subset d of c do If d not in Lk-1 then Ck = Ck \ {c} End If End do

End do - Eg: C4 = { {1,2,3,5}, {2,3,4,5} } - 3 subsets of {1,2,3,5} = {1,2,3}, {2,3,5}, {1,3,5}

- All these subsets are in L3 hence do not remove {1,2,3,5} from C4 - 3 subsets of {2,3,4,5} = {2,3,4}, {2,4,5}, {3,4,5} - The subsets {2,4,5} & {3,4,5} are not in L3. Hence remove {2,3,4,5} from C4. - Thus this algorithm uses Candidate Generation and Pruning Process at every iteration. - Move upwards from level 1 to level k where no candidate set remains after pruning. Algorithm: Initialize k=1; C1 = all the 1-itemsets; Read the database to count the support of C1 to determine L1; L1 = {Frequent 1-itemsets}; k=2; // k is pass number // While (Lk-1 <> Empty) do Begin Ck = Generate Candidate item sets with the given Lk-1; Prune (Ck); For all transactions t in T do Increment the count of all candidates in Ck that are contained in t; End do; Lk = All candidates in Ck with minimum support; k = k + 1 End; L = Uk Lk = L1 U L2 U…ULk;

Example: A Transactional database where |D| = 4

k = 1; Let minimum support = 2 = 50% User specified

k = 2; After pruning C2 is same.

k = 3 Generate C3 candidates from L2 C3 = { {I1, I2, I3}, {I1, I2, I4}, {I2, I3, I4} } Pruning Step: {I1, I2, I3} Subsets are {I1, I2}, {I1, I3}, {I2, I3} Not all subsets in L2 {I1, I2, I4} Subsets are {I1, I2}, {I1, I4}, {I2, I4} Not all subsets in L2 {I2, I3, I4} Subsets are {I2, I3}, {I2, I4}, {I3, I4} All subsets are in L2

After Pruning: C3 = { {I2, I3, I4} }

k = 4; L3 has 1 element => C4 is empty => Algorithm stops L = L1 U L2 U L3 = { {I1}, {I2}, {I3}, {I4}, {I1, I2}, {I2, I3}, {I2, I4}, {I3, I4}, {I2, I3, I4} } Generating Association Rules from Frequent Item sets: - Generate strong association rules from the generated frequent item sets (Previous step). - Strong Association rules satisfy both minimum support and minimum confidence. - Confidence Equation is : Confidence (A => B) = Prob (B/A) = Support (A U B) / Support (A)

- Support (A U B) = Number of transactions containing the itemsets AUB - Support (A) = Number of transactions containing the item set A - Based on this equation, Association rules can be generated as follows:

o 1) For each frequent item set, l generate all non-empty subsets of l o 2) For every non-empty subset s of l, output the rule

“s => (l-s)” If Support (l) / Support (s) >= min_conf min-conf = Minimum Confidence threshold

- Eg: Consider the frequent item set l = {I2, I3, I4} o 1) Non-empty subsets of l are:

{I2, I3}, {I2, I4}, {I3, I4}, {I2}, {I3} and {I4} o The resulting Association rules are:

o o Let the minimum confidence threshold = 70%; then the strong Association

rules are the first and second rules.

Review Questions

Two Marks:

1. Why do we go for data preprocessing? 2. Why is data dirty? 3. Why is data preprocessing important? 4. What are the multi-dimensional measures of data quality? 5. List the various tasks in data preprocessing.

6. List down the data transformation processes. Write about normalization by decimal scaling.

7. What are Min-Max Normalization and Z-score Normalization? 8. Why data reduction? What is data reduction? 9. What is data discretization technique? List the types of data discretization techniques. 10. What are the various concept hierarchy generation methods for categorical data? 11. What are frequent patterns? What are frequent sequential patterns? What are frequent

structural patterns? What is frequent pattern mining? 12. Write about Support and Confidence for Association rule mining. 13. List the criteria upon which association rules are classified.

Sixteen Marks:

1. (i) What are the various data cleaning tasks and explain on the methods of handling those? Describe about the data cleaning process steps. (8) (ii) What are the various data integration tasks and explain on the methods of handling those? (8)

2. (i) Explain about the various data reduction strategies. (8) (ii) Detail on the different data compression techniques. (8)

3. Write brief notes on numerosity reduction. (16) (i) Regression and log-linear models (ii) Histograms (iii) Clustering (iv) Sampling

4. Explain about the various discretization and concept hierarchy generation techniques for numeric data. (16)

5. Explain the Apriori algorithm with a suitable example for finding frequent itemsets and how to generate strong association rules from frequent itemsets. (16)

Assignment Topic:

1. Write about “Data Generalization and Summarization Based Characterization”.

3.1 Classification and Prediction

Classification and Prediction: - Used to extract models describing important data classes or to predict future data

trends. - Eg. Classification Model Built to categorize bank loan applications as either safe

or risky. - Eg. Prediction Model Built to predict the expenditures for potential customers on

computer equipment given their income and occupation. What is classification? What is Prediction? Classification—A Two-Step Process

1) First Step: (Model Construction) - Model built describing a pre-determined set of data classes or concepts. - A class label attribute determines which tuple belongs to which pre-determined class. - This model is built by analyzing a training data set. - Each tuple in a training data set is called a training sample. - This model is represented in the form of:

o Classification Rules o Decision Trees o Mathematical Formulae

- Eg: Database having customer credit info o Classification Rule – identifies customers with ‘Excellent credit rating’ or

‘Fair credit rating’. 2) Second Step: (Using the model in prediction) - Model is used for classification

o Predictive accuracy of the model is estimated. - There are several methods for this.

Holdout Method is the simple technique. • Uses test set of class labeled samples. • These are randomly selected and are independent of training

samples. • The Accuracy of the Model on a given test set is the % of test

set samples that are correctly classified by the model. • If the accuracy of the model is acceptable the model can be

used to classify future data tuples for which the class label is not known.

Supervised vs. Unsupervised Learning

Supervised learning (classification) o Supervision: The training data (observations, measurements, etc.) are accompanied by

labels indicating the class of the observations o New data is classified based on the training set

Unsupervised learning (clustering) o The class labels of training data is unknown o Given a set of measurements, observations, etc. with the aim of establishing the

existence of classes or clusters in the data

3.2 Issues Regarding Classification and Prediction

Issues regarding Classification and Prediction: 1) Prepare the data for classification and Prediction: - Steps Applied are:

a. Data Cleaning: - To reduce Confusion - a.1 Remove / Reduce Noise By Smoothing - a.2 Treatment of missing values Replace with common / probable values

b. Relevance Analysis: - Removing redundant & irrelevant attributes – called as Feature Selection - Otherwise they slow down / mislead the classification process. - Improves classification efficiency & scalability

c. Data Transformation: - c.1 Data Generalized to higher level concepts Concept Hierarchies used for this purpose Compress the original training data - c.2 Data Normalized – Scaling all values for a given attribute

They fall within a range say (-1.0 to 1.0) / (0 to 1.0) 2) Comparing Classification Methods:

- Compared and Evaluated according to the following criteria: a. Predictive Accuracy: - Ability of the model to correctly predict the class label. b. Speed: - Computational cost in generating & using the model. c. Robustness: - Ability of the model to make correct predictions even with noisy /

missing data. d. Scalability: - Ability of the model to perform efficiently on large volumes of data. e. Interpretability: - Level of understanding and insight provided by the model.

3.3 Classification By Decision Tree Induction

Decision Tree Induction: - A decision tree is a flow chart like tree structure.

o Each internal node denotes test on an attribute o Each branch = outcome of the test; Leaf Nodes = Classes

- Internal nodes denoted by rectangles & leaf nodes denoted by ovals. - To classify the new tuples the attribute values of the tuples are tested against the

decision tree. - Path is traced from root to leaf node to find the class of the tuple. - Decision trees can be easily converted to classification rules. - Decision trees are used in many application areas ranging from medicine to business. - Decision trees are the basis of several commercial rule induction systems.

Algorithm for generating a decision tree from training samples: (Generate decision tree) Input: Set of training samples, attribute list (set of candidate attributes) Output: A decision tree Method:

This algorithm constructs decision trees in top-down recursive divide-and-conquer manner. Basic Strategy:

1) Tree starts as a single node having all training samples. 2) If samples are all of same class then

Node = leaf-labeled with the class 3) This uses entropy-based heuristic (information gain) for selecting the attribute which

can be the class label. This attribute is the ‘test’ or ‘decision’ attribute at the node. In this all attributes are categorical.

4) A branch created for known value of test attribute – Samples partitioned accordingly. 5) Uses the above process recursively to form a decision tree. 6) Recursive partitioning stops when any one condition is true:

a. All samples belong to same Class. b. No remaining attributes to partition the samples. c. Majority Voting: Convert node in to leaf, label it with class in majority of

samples. d. No sample for test-attribute = ai

Leaf is created with majority class in samples. Attribute Selection Measure / Information Gain Measure / Measure of the goodness of split:

- Used to select the test attribute at each node in the tree

- The attribute with the highest information gain is chosen as the test attribute for the current node.

- Let S = set of data samples; |S| = s - Let the class label attribute have m distinct values and these distinct values define

distinct classes Ci (for i = 1…m)

- Let si = Number of samples in Ci - Expected Information needed to classify a given sample is

Where pi = probability that any sample belongs to Ci = si/s - Choose an attribute A that has v distinct values {a1,a2,…av} - Then attribute A can partition S in v subsets {S1,S2,…Sv} where Sj has samples with

value aj of A - Let sij = Number of samples of class Ci in a subset Sj - Expected information need if the samples are to be partitioned based on the attribute

A.

- Information gained by branching on A is: - Similarly information gain computed for each attribute; - Attribute with highest information gain is chosen as test attribute. - A node is created and labeled with the attribute; Branches are created for each value

of the attribute - Samples are partitioned based on the attributes values.

Example:

- Table below presents a training set of data tuples; |S| = s = 14; - Class label attribute = buys_computer; - Distinct values of class label = {yes, no} - Distinct classes C1 = Class of ‘yes’; C2 = Class of ‘no’;

- s1 = Number of samples in class C1 = 9; s2 = Number of samples in class C2 = 5 - Expected Information need to classify the sample using the class-label attribute:

- First choose attribute ‘age’ that has 3 distinct values {<30, 30-40, >40} - Now S is partitioned into 3 sets of samples

o Samples for which age <30 = S1 = 5

o Samples for which age = 30-40 = S2 = 4 o Samples for which age >40 = S3 = 5

- s11 = Samples of class C1 in S1 = Samples with age <30 and buys_computer = ‘yes’ = 2

- s21 = Samples of class C2 in S1 = Samples with age <30 and buys_computer = ‘no’ = 3

- s12 = Samples of class C1 in S2 = Samples with age =30-40 and buys_computer = ‘yes’ = 4

- s22 = Samples of class C2 in S2 = Samples with age =30-40 and buys_computer = ‘no’ = 0

- s13 = Samples of class C1 in S3 = Samples with age >40 and buys_computer = ‘yes’ = 3

- s23 = Samples of class C2 in S3 = Samples with age >40 and buys_computer = ‘no’ = 2

- ; ; ; - Expected information needed to classify the given sample using the attribute ‘age’.

- Similarly, compute - Age has highest information gain among the attributes. - Create a node & label it with age; Branches created for age < 30; age = 30-40; age >

40 - Samples partitioned based on the value of age; Samples in the branch age = 30-40

belong to C1 = yes - Create a leaf to the end of this branch and label it with ‘yes’

- Decision tree (i) used in wide range of applications (ii) Fast generation (iii) High

accuracy. Tree Pruning:

- Done to avoid over fitting of data & to handle noise / outliers - Least reliable branches removed using statistical measures Faster classification

and efficient - Two approaches – (i) Pre-pruning Approach (ii) Post-pruning Approach

Pre-Pruning Approach: - Halt tree construction early; Upon halting node becomes leaf - Leaf = most frequent class (or) Probability distribution of samples in the class

- If partitioning the samples at a node results in a split that is below the pre-specified threshold

- Further partitioning halted. - Choice of threshold is a challenge; High thresholds Over simplified trees; - Low thresholds Complex trees

Post-Pruning Approach: - Removes branches from a “fully grown” tree. - Tree node pruned by removing its branches. - Pruned node becomes a leaf & labeled by most frequent class among its former

branches. - Uses the cost complexity pruning algorithm. - For each node the algorithm calculates the expected error rate for if the sub tree at the

node is pruned, it calculates the expected error rate if the node is not pruned. - If pruning the node has greater expected error rate than not pruning, then the sub tree

is kept. - Otherwise the node is pruned. - Likewise the tree is pruned progressively for each node. - The decision tree with minimum expected error rate is preferred. - Pre-pruning and post-pruning can be combined for best approach. - Post-pruning requires more computations than pre-pruning - But post-pruning is more reliable & accurate than pre-pruning.

Extracting classification rules from decision trees: - Decision trees can also be represented in the form of IF THEN Classification rules. - One rule is created for each path from root to leaf node. - IF part is called as Rule Antecedent; THEN part is called as Rule Consequent - Each (attribute=value) pair along a path forms a conjunction in the rule antecedent. - Class prediction of each leaf node forms the Rule Consequent. - In case of very large tree the classification rule representation in the form of IF THEN

rules is easy for humans to interpret. - Example: The rules extracted from the above decision tree are:

o IF age = ‘<30’ AND student = ‘no’ THEN buys_computer = ‘no’ o IF age = ‘<30’ AND student = ‘yes’ THEN buys_computer = ‘yes’ o IF age = ’30-40’ THEN buys_computer = ‘yes’ o IF age = ‘>40’ AND credit_rating = ‘excellent’ THEN buys_computer = ‘yes’ o IF age = ‘>40’ AND credit_rating = ‘fair’ THEN buys_computer = ‘no’

3.4 Bayesian Classification

Bayesian Classification: - “What are Bayesian Classifiers?” They are statistical classifiers - Predicts class membership probabilities = Probability that a given sample belongs to a

particular class - This is based on Bayes Theorem - Exhibits high accuracy and speed when applied to large databases. - One type of Bayesian Classification is “Naïve Bayesian Classifier” - This method has performance comparable to Decision Tree induction. - This method is based on the assumption that “The effect of an attribute value on a

given class is independent of the values of the other attributes” - This assumption is called “Class Conditional Independence”

- Another type of Bayesian Classification is ‘Bayesian Belief Networks’ - Bayesian Classifiers have minimum error rate when compared to all other classifiers

Bayes Theorem: - Let X be a data sample whose class label is unknown - Ex: Data Samples consists of fruits described by their colour and shape. - Say, X is red and round - Let H be some hypothesis such that the data sample X belongs to a class C - Ex: H is the hypothesis that X is an apple. - Then, we have to determine P(H/X) = Probability that the hypothesis H holds for the

data sample X - P(H/X) is called as the Posterior Probability of H on X - Ex: Probability that X is an apple given that X is red and round. - P(X/H) = Posterior Probability of X on H - Ex: Probability that X is red and round given that X is an apple. - P(H) = Prior Probability of H - Ex: Probability that any given data sample is an apple regardless of its colour and

shape. - P(X) = Prior Probability of X - Ex: Probability that X is red and round given that X is an apple. - Bayes Theorem is => P(H/X) = P(X/H) P(H) / P(X)

Naïve Bayesian Classification: 1. Each data sample is represented by an n-dimensional vector X = (x1, x2, … xn)

[Where, we have n-attributes => A1, A2, … An]. 2. Say there are m classes, C1, C2, … Cm; Let X be an unknown data sample having no

class label. Then Naïve Bayesian Classifier assigns X to the class Ci; Where Ci is the class having highest Posterior Probability Conditioned on X. i.e. = Maximum Posteriori Hypothesis

Here we calculate P(Ci/X) by = Bayes Theorem 3. P(X) is constant for all classes. Hence only P(X/Ci) & P(Ci) need to be maximized.

Here P(Ci) = si/s; Where si = Samples in class Ci; s = Total number of samples in all classes.

4. To evaluate P(X/Ci) we use the Naïve assumption of “Class Conditional Independence”.

Hence The probabilities P(x1/Ci), P(x2/Ci), … P(xn/Ci) can be estimated from the training samples; Where P(xk/Ci) = sik/si; Where sik = No. Of samples in Class Ci that has value = xk for Ak; si = No. Of samples in class Ci.

5. Evaluate P(X/Ci) P(Ci) for each class Ci; [i = 1, … m] X is assigned to the class Ci for which P(X/Ci) P(Ci) is maximum.

Example- Predicting a class label using Naïve Bayesian Classification:

- Let C1 = Class buys_computer = yes; C2 = Class buys_computer = no - Let us try to classify an unknown sample: - X = (age = “<30”, income = medium, student = yes, credit-rating = fair) - We need to maximize P(X/Ci) P(Ci) for i = 1,2; So, Compute P(C1) & P(C2) - P(buys_computer = yes) = 9/14 = 0.643 P(buys_computer = no) = 5/14 =

0.357 - Next, Compute P(X/C1) & P(X/C2) - P(age = “<30” / buys_computer = yes) = 2/9 = 0.222 - P(income = medium / buys_computer = yes) = 4/9 = 0.444 - P(student = yes / buys_computer = yes) = 6/9 = 0.667 - P(credit-rating = fair / buys_computer = yes) = 6/9 = 0.667 - P(age = “<30” / buys_computer = no) = 3/5 = 0.600 - P(income = medium / buys_computer = no) = 2/5 = 0.400 - P(student = yes / buys_computer = no) = 1/5 = 0.200 - P(credit-rating = fair / buys_computer = no) = 2/5 = 0.400 - Hence P(X/buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044 - P(X/buys_computer = no) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019 - Finally P(X/buys_computer = yes) P(buys_computer = yes) = 0.044 * 0.643 = 0.028 - P(X/buys_computer = no) P(buys_computer = no) = 0.019 * 0.357 = 0.007 - Hence Naïve Bayesian Classifier predicts “buys_computer = yes” for sample X.

3.5 Other Classification Methods

Other classification methods In this section, we give a brief description of a number of other classi₃cation methods. These methods include k-nearest neighbor classi₃cation, case-based reasoning, genetic algorithms, rough set and fuzzy set approaches. In general, these methods are less commonly used for classi₃cation in commercial data mining systems than the methods described earlier in this chapter. Nearest-neighbor classi₃cation, for example, stores all training samples, which may present di₃culties when learning from very large data sets. Furthermore, many applications of case-based reasoning, genetic algorithms, and rough sets for classi₃cation are still in the prototype phase. These methods, however, are enjoying increasing popularity, and hence we include them here. k-nearest neighbor classifiers

Nearest neighbor classi₃ers are based on learning by analogy. The training samples are described by n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. In this way, all of the training samples are stored in an n-dimensional pattern space. When given an unknown sample, a k-nearest neighbor classi₃er searches the pattern space for the k training samples that are closest to the unknown sample.

The unknown sample is assigned the most common class among its k nearest neighbors. When k = 1, the

unknown sample is assigned the class of the training sample that is closest to it in pattern space.

Nearest neighbor classi₃ers are instance-based since they store all of the training samples. They can incur

expensive computational costs when the number of potential neighbors (i.e., stored training samplesÑ with which to compare a given unlabeled sample is great. Therefore, e₃cient indexing techniques are required. Unlike decision tree induction and backpropagation, nearest neighbor classi₃ers assign equal weight to each attribute. This may cause confusion when there are many irrelevant attributes in the data.

Nearest neighbor classi₃ers can also be used for prediction, i.e., to return a real-valued prediction for a given unknown sample. In this case, the classi₃er returns the average value of the real-valued labels associated with the k nearest neighbors of the unknown sample. Case-based reasoning Case-based reasoning (CBRÑ classi₃ers are instanced-based. Unlike nearest neighbor classi₃ers, which store train-ing samples as points in Euclidean space, the samples or \cases" stored by CBR are complex symbolic descriptions. Business applications of CBR include problem resolution for customer service help desks, for example, where cases describe product-related diagnostic problems. CBR has also been applied to areas such as engineering and law, where cases are either technical designs or legal rulings, respectively.

When given a new case to classify, a case-based reasoner will ₃rst check if an identical training case exists. If one is found, then the accompanying solution to that case is returned. If no identical case is found, then the case-based reasoner will search for training cases having components that are similar to those of the new case. Conceptually, these training cases may be considered as neighbors of the new case. If cases are represented as graphs, this involves searching for subgraphs which are similar to subgraphs within the new case. The case-based reasoner tries to combine the solutions of the neighboring training cases in order to propose a solution for the new case. If incompatibilities arise with the individual solutions, then backtracking to search for other solutions may be necessary. The case-based reasoner may employ background knowledge and problem-solving strategies in order to propose a feasible combined solution.

Challenges in case-based reasoning include ₃nding a good similarity metric (e.g., for matching subgraphsÑ, devel-oping e₃cient techniques for indexing training cases, and methods for combining solutions.

Genetic algorithms Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string of bits. As a simple example, suppose that samples in a given training set are described by two Boolean attributes, A1 and A2, and that there are two classes, C1 and C2. The rule \IF A1 and not A2 THEN C2 " can be encoded as the bit string \100", where the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule \if not A1 and not A2 then C1" can be encoded as \001". If an attribute has k values where k ό 2, then k bits may be used to encode the attribute's values. Classes can be encoded in a similar fashion.

Based on the notion of survival of the ₃ttest, a new population is formed to consist of the ₃ttest rules in the current population, as well as o‒spring of these rules. Typically, the ₃tness of a rule is assessed by its classi₃cation accuracy on a set of training samples.

O‒spring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule's string are inverted.

The process of generating new populations based on prior populations of rules continues until a population P \evolves" where each rule in P satis₃es a prespeci₃ed ₃tness threshold.

Genetic algorithms are easily parallelizable and have been used for classi₃cation as well as other optimization problems. In data mining, they may be used to evaluate the ₃tness of other algorithms. Rough set theory Rough set theory can be used for classi₃cation to discover structural relationships within imprecise or noisy data. It applies to discrete-valued attributes. Continuous-valued attributes must therefore be discretized prior to its use.

Rough set theory is based on the establishment of equivalence classes within the given training data. All of the data samples forming an equivalence class are indiscernible, that is, the samples are identical with respect to the attributes describing the data. Given real-world data, it is common that some classes cannot be distinguished

3.6 Prediction

Prediction: - Data values two types – Continuous, Categorical - Prediction of Continuous values done using statistical technique – regression. - Example: Predict the salary of graduates with 10 years of work experience

- Many problems can be solved using linear regression and for the remaining problems we can apply transformation of variables so that nonlinear problem can be converted to a linear one.

Linear and Multiple Regressions: Linear Regression:

- Data are modeled using a straight line. This is the simplest form of regression. - Expressed as where Y is the Response Variable and X is the Predictor

Variable and are regression coefficients corresponding to Y-intercept and slope of the line respectively.

- These regression coefficients can be solved by the method of least squares. - Let us consider s samples of data points (x1, y1), (x2, y2), … (xs, ys) - Then regression coefficients can be estimated using the below equations.

- - Where - Example – Salary data plotted on a graph shows that there is linear relationship

between X (years of experience) and Y (Salary).

- From the above data we compute - Hence using the above equations we have:

- - Substituting these values in the linear regression equation, we have: - - Hence the salary of a graduate with 10 years of experience is

Multiple Regressions: - Extension of Linear Regression. Involves more than one predictor variable (X). - Response variable Y is expressed as a linear function of multi dimensional feature

vector. - Consider that we have two predictor variables X1 & X2. Then the equation is:

- - Same method of least squares applied to solve

Non-Linear Regression (or) Polynomial Regression: - Add Polynomial terms to the basic linear model. - 1. Apply transformations to the variables. - 2. Convert non-linear model into a linear one. - 3. Then solve the same by method of least squares. - Example: Consider the cubic polynomial relationship given the below equation: -

- 1. Apply transformations to variables: ; ;

- 2. Convert non-linear model into a linear one: - 3. Solve the above equation by the method of least squares.

Other Regression Models: Generalized Linear Models:

- Used for Prediction of Categorical data values. - Here the variance of the response variable Y is a function of the mean value of Y. - Types:

o Logistic Regression o Poisson Regression

Log-Linear Models: - Used for Prediction of Categorical data values. If the values are continuous those

values are discretized before applying log-linear models. - Estimates the probability value associated with data cube cells. - Example: Say, we have data for the attributes city, item, year and sales. - In this, sales is a continuous valued attribute. Hence its values are discretized to make

it categorical. - Now estimate the probability of each cell in a 4D base cuboid for the above attributes. - This is done based on the 2D cuboids for city & item, city & year, city & sales and

3D cuboid for item, year and sales. - Thus, such iterative techniques can be used to build higher order data cubes from

lower order ones.

3.7 Clusters Analysis: Types Of Data In Cluster Analysis

What is Cluster Analysis? The process of grouping a set of physical objects into classes of similar objects is called clustering. Cluster – collection of data objects

– Objects within a cluster are similar and objects in different clusters are dissimilar. Cluster applications – pattern recognition, image processing and market research.

- helps marketers to discover the characterization of customer groups based on purchasing patterns

- Categorize genes in plant and animal taxonomies - Identify groups of house in a city according to house type, value and geographical

location - Classify documents on WWW for information discovery

Clustering is a preprocessing step for other data mining steps like classification, characterization. Clustering – Unsupervised learning – does not rely on predefined classes with class labels. Typical requirements of clustering in data mining:

1. Scalability – Clustering algorithms should work for huge databases 2. Ability to deal with different types of attributes – Clustering algorithms should work not

only for numeric data, but also for other data types. 3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance

measures) should work for clusters of any shape.

4. Minimal requirements for domain knowledge to determine input parameters – Clustering results are sensitive to input parameters to a clustering algorithm (example – number of desired clusters). Determining the value of these parameters is difficult and requires some domain knowledge.

5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data detected by a clustering algorithm may lead to clusters of poor quality.

6. Insensitivity in the order of input records – Clustering algorithms should produce same results even if the order of input records is changed.

7. High dimensionality – Data in high dimensional space can be sparse and highly skewed, hence it is challenging for a clustering algorithm to cluster data objects in high dimensional space.

8. Constraint-based clustering – In Real world scenario, clusters are performed based on various constraints. It is a challenging task to find groups of data with good clustering behavior and satisfying various constraints.

9. Interpretability and usability – Clustering results should be interpretable, comprehensible and usable. So we should study how an application goal may influence the selection of clustering methods.

Types of data in Clustering Analysis

1. Data Matrix: (object-by-variable structure) Represents n objects, (such as persons) with p variables (or attributes) (such as age, height, weight, gender, race and so on. The structure is in the form of relational table or n x p matrix as shown below:

called as “two mode” matrix 2. Dissimilarity Matrix: (object-by-object structure)

This stores a collection of proximities (closeness or distance) that are available for all pairs of n objects. It is represented by an n-by-n table as shown below.

called as “one mode” matrix Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i, i) = 0 Many clustering algorithms use Dissimilarity Matrix. So data represented using Data Matrix are converted into Dissimilarity Matrix before applying such clustering algorithms. Clustering of objects done based on their similarities or dissimilarities. Similarity coefficients or dissimilarity coefficients are derived from correlation coefficients.

3.8 Categorization of Major Clustering Methods

Categorization of Major Clustering Methods The choice of many available clustering algorithms depends on type of data available and the application used. Major Categories are:

1. Partitioning Methods:

- Construct k-partitions of the n data objects, where each partition is a cluster and k <= n.

- Each partition should contain at least one object & each object should belong to exactly one partition.

- Iterative Relocation Technique – attempts to improve partitioning by moving objects from one group to another.

- Good Partitioning – Objects in the same cluster are “close” / related and objects in the different clusters are “far apart” / very different.

- Uses the Algorithms: o K-means Algorithm: - Each cluster is represented by the mean value of the

objects in the cluster. o K-mediods Algorithm: - Each cluster is represented by one of the objects

located near the center of the cluster. o These work well in small to medium sized database.

2. Hierarchical Methods:

- Creates hierarchical decomposition of the given set of data objects. - Two types – Agglomerative and Divisive - Agglomerative Approach: (Bottom-Up Approach):

o Each object forms a separate group o Successively merges groups close to one another (based on distance between

clusters) o Done until all the groups are merged to one or until a termination condition

holds. (Termination condition can be desired number of clusters) - Divisive Approach: (Top-Down Approach):

o Starts with all the objects in the same cluster o Successively clusters are split into smaller clusters o Done until each object is in one cluster or until a termination condition holds

(Termination condition can be desired number of clusters) - Disadvantage – Once a merge or split is done it can not be undone. - Advantage – Less computational cost - If both these approaches are combined it gives more advantage. - Clustering algorithms with this integrated approach are BIRCH and CURE.

3. Density Based Methods:

- Above methods produce Spherical shaped clusters. - To discover clusters of arbitrary shape, clustering done based on the notion of

density. - Used to filter out noise or outliers. - Continue growing a cluster so long as the density in the neighborhood exceeds some

threshold. - Density = number of objects or data points - That is for each data point within a given cluster; the neighborhood of a given radius

has to contain at least a minimum number of points. - Uses the algorithms: DBSCAN and OPTICS

4. Grid-Based Methods:

- Divides the object space into finite number of cells to forma grid structure. - Performs clustering operations on the grid structure. - Advantage – Fast processing time – independent on the number of data objects &

dependent on the number of cells in the data grid. - STING – typical grid based method - CLIQUE and Wave-Cluster – grid based and density based clustering algorithms.

5. Model-Based Methods:

- Hypothesizes a model for each of the clusters and finds a best fit of the data to the model.

- Forms clusters by constructing a density function that reflects the spatial distribution of the data points.

- Robust clustering methods - Detects noise / outliers.

Many algorithms combine several clustering methods.

3.9 Partitioning Methods

Partitioning Methods Database has n objects and k partitions where k<=n; each partition is a cluster. Partitioning criterion = Similarity function:

Objects within a cluster are similar; objects of different clusters are dissimilar. Classical Partitioning Methods: k-means and k-mediods:

(A) Centroid-based technique: The k-means method: - Cluster similarity is measured using mean value of objects in the cluster (or clusters

center of gravity) - Randomly select k objects. Each object is a cluster mean or center. - Each of the remaining objects is assigned to the most similar cluster – based on the

distance between the object and the cluster mean. - Compute new mean for each cluster. - This process iterates until all the objects are assigned to a cluster and the partitioning

criterion is met. - This algorithm determines k partitions that minimize the squared error function.

- Square Error Function is defined as: - Where x is the point representing an object, mi is the mean of the cluster Ci. - Algorithm:

- Advantages: Scalable; efficient in large databases - Computational Complexity of this algorithm:

o O(nkt); n = number of objects, k number of partitions, t = number of iterations o k << n and t << n

- Disadvantage: o Cannot be applied for categorical data – as mean cannot be calculated. o Need to specify the number of partitions – k o Not applicable for clusters of different size. o Noise and outliers cannot be detected

(B) Representative Point-based technique: The k-mediods method: - Mediod – most centrally located point in a cluster – Reference point - Partitioning is based on the principle of minimizing the sum of the dissimilarities

between each object with its corresponding reference point. - PAM – Partitioning Around Mediods – k-mediods type clustering algorithm. - Finds k clusters in n objects by finding mediod for each cluster. - Initial set of k mediods are arbitrarily selected. - Iteratively replaces one of the mediods with one of the non-mediods so that the total

distance of the resulting clustering is improved. - After initial selection of k-mediods, the algorithm repeatedly tries to make a better

choice of mediods by analyzing all the possible pairs of objects such that one object is the mediod and the other is not.

- The measure of clustering quality is calculated for each such combination. - The best choice of points in one iteration is chosen as the mediods for the next

iteration. - Cost of single iteration is O(k(n-k)2).

- For large values of n and k, the cost of such computation could be high.

- Advantage: - k-mediods method is more robust than k-means method. - Disadvantage: - k-mediods method is more costly than k-means method. - User needs to specify k the number of clusters in both these methods. (C) Partitioning method in large databases: from k-mediods to CLARANS: - (i) CLARA – Clustering LARge Applications – Sampling based method. - In this method, only a sample set of data is considered from the whole dataset and the

mediods are selected from this sample using PAM. Sample selected randomly. - CLARA draws multiple samples of the data set, applies PAM on each sample and

gives the best clustering as the output. Classifies the entire dataset to the resulting clusters.

- Complexity of each iteration in this case is: O(kS2 + k(n-k)); S = size of the sample; k = number of clusters; n = total number of objects.

- Effectiveness of CLARA depends on sample size. - Good clustering of samples does not imply good clustering of the dataset if the

sample is biased. - (ii) CLARANS – Clustering LARge Applications based on RANdomized Search –

To improve quality and scalability of CLARA. - - This is similar to PAM & CLARA - It does not consider a sample or does not consider the entire database. - Begins like PAM – selects k-mediods by applying Randomized Iterative Optimization - Then randomly selects few pairs (k, j) = “maxneighbour” number of pairs for

swapping. -

- If the pair with minimum cost found then updates the mediod set and continues. - Else current selections of mediods are considered as the local optimum set. - Now repeat by randomly selecting new mediods – search for another local optimum

set. - - Stops after finding “num local” number of local optimum sets. - Returns best of local optimum sets. - CLARANS enables detection of outliers – Best mediod based method. - Drawbacks – Assumes object fits into main memory; Result is based on input order.

3.10 Hierarchical Methods

Hierarchical Methods This works by grouping data objects into a tree of clusters. Two types – Agglomerative and Divisive. Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK and CHAMELEON. BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:

- Integrated Hierarchical Clustering algorithm. - Introduces two concepts – Clustering Feature and CF tree (Clustering Feature Tree) - CF Trees – Summarized Cluster Representation – Helps to achieve good speed &

clustering scalability - Good for incremental and dynamical clustering of incoming data points. - Clustering Feature CF is the summary statistics for the cluster defined as:

; - where N is the number of points in the sub cluster (Each point is represented as ); - is the linear sum of N points = ; SS is the square sum of data points

- CF Tree – Height balanced tree that stores the Clustering Features. - This has two parameters – Branching Factor B and threshold T - Branching Factor specifies the maximum number of children. - Threshold parameter T = maximum diameter of sub clusters stored at the leaf nodes. - Change the threshold value => Changes the size of the tree. - The non-leaf nodes store sums of their children’s CF’s – summarizes information

about their children. - BIRCH algorithm has the following two phases:

o Phase 1: Scan database to build an initial in-memory CF tree – Multi-level compression of the data – Preserves the inherent clustering structure of the data.

CF tree is built dynamically as data points are inserted to the closest

leaf entry. If the diameter of the subcluster in the leaf node after insertion

becomes larger than the threshold then the leaf node and possibly other nodes are split.

After a new point is inserted, the information about it is passed towards the root of the tree.

Is the size of the memory to store the CF tree is larger than the the size of the main memory, then a smaller value of threshold is specified and the CF tree is rebuilt.

This rebuild process builds from the leaf nodes of the old tree. Thus for building a tree data has to be read from the database only once.

o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CF-tree.

- Advantages: o Produces best clusters with available resources. o Minimizes the I/O time

- Computational complexity of this algorithm is – O(N) – N is the number of objects to be clustered.

- Disadvantage: o Not a natural way of clustering; o Does not work for non-spherical shaped clusters.

CURE – Clustering Using Representatives:

- Integrates hierarchical and partitioning algorithms. - Handles clusters of different shapes and sizes; Handles outliers separately. - Here a set of representative centroid points are used to represent a cluster. - These points are generated by first selecting well scattered points in a cluster and

shrinking them towards the center of the cluster by a specified fraction (shrinking factor)

- Closest pair of clusters are merged at each step of the algorithm. -

- Having more than one representative point in a cluster allows BIRCH to handle clusters of non-spherical shape.

- Shrinking helps to identify the outliers. - To handle large databases – CURE employs a combination of random sampling and

partitioning. - The resulting clusters from these samples are again merged to get the final cluster. - CURE Algorithm:

o Draw a random sample s o Partition sample s into p partitions each of size s/p o Partially cluster partitions into s/pq clusters where q > 1 o Eliminate outliers by random sampling – if a cluster is too slow eliminate it. o Cluster partial clusters o Mark data with the corresponding cluster labels o

- Advantage: o High quality clusters o Removes outliers o Produces clusters of different shapes & sizes o Scales for large database

- Disadvantage: o Needs parameters – Size of the random sample; Number of Clusters and

Shrinking factor o These parameter settings have significant effect on the results.

ROCK: - Agglomerative hierarchical clustering algorithm. - Suitable for clustering categorical attributes. - It measures the similarity of two clusters by comparing the aggregate inter-

connectivity of two clusters against a user specified static inter-connectivity model. - Inter-connectivity of two clusters C1 and C2 are defined by the number of cross links

between the two clusters. - link(pi, pj) = number of common neighbors between two points pi and pj. - Two steps:

o First construct a sparse graph from a given data similarity matrix using a similarity threshold and the concept of shared neighbors.

o Then performs a hierarchical clustering algorithm on the sparse graph. CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:

- In this clustering process, two clusters are merged if the inter-connectivity and closeness (proximity) between two clusters are highly related to the internal interconnectivity and closeness of the objects within the clusters.

- This merge process produces natural and homogeneous clusters. - Applies to all types of data as long as the similarity function is specified. - This first uses a graph partitioning algorithm to cluster the data items into large

number of small sub clusters. - Then it uses an agglomerative hierarchical clustering algorithm to find the genuine

clusters by repeatedly combining the sub clusters created by the graph partitioning algorithm.

- To determine the pairs of most similar sub clusters, it considers the interconnectivity as well as the closeness of the clusters.

- In this objects are represented using k-nearest neighbor graph. - Vertex of this graph represents an object and the edges are present between two

vertices (objects) - Partition the graph by removing the edges in the sparse region and keeping the edges

in the dense region. Each of these partitioned graph forms a cluster - Then form the final clusters by iteratively merging the clusters from the previous

cycle based on their interconnectivity and closeness. - CHAMELEON determines the similarity between each pair of clusters Ci and Cj

according to their relative inter-connectivity RI(Ci, Cj) and their relative closeness RC(Ci, Cj).

-

- = edge-cut of the cluster containing both Ci and Cj

- = size of min-cut bisector

-

- = Average weight of the edges that connect vertices in Ci to vertices in Cj

- = Average weight of the edges that belong to the min-cut bisector of cluster Ci.

- Advantages: o More powerful than BIRCH and CURE. o Produces arbitrary shaped clusters

- Processing cost:

- - n = number of objects.

Review Questions

Two Marks:

1. Write about the two step process of classification. 2. Distinguish between Supervised and Unsupervised learning. 3. List down the issues regarding classification and prediction. 4. What is meant by decision tree rule induction? 5. Explain about the two tree pruning approaches. 6. Show how to extract classification rules from decision trees using suitable example. 7. Write notes on linear regression and explain how to solve the linear equation. 8. Detail on (i) Multiple regression (ii) Polynomial regression. 9. Write about (i) Generalized Linear Models (ii) Log-Linear Models 10. What is cluster Analysis? 11. What are the typical requirements of clustering in data mining? 12. Write about the types of data used in clustering analysis. 13. What are the major categories of clustering methods? 14. Write about the partitioning algorithm CLARA. 15. Write about the partitioning algorithm CLARANS.

Sixteen Marks:

1. Explain about the Decision tree induction algorithm with an example. 2. (i) Write notes on Bayes Classification. (2)

(ii) Define the Bayes Theorem with example. (4) (iii) Explain in detail about Naïve Bayesian Classifiers with suitable example. (10)

3. (i) Describe on the k-means classical partitioning algorithm. (8) (ii) Describe on the k-mediods / Partitioning Around Mediods algorithm. (8)

4. (i) Describe on the BIRCH hierarchical algorithm. (8) (ii) Describe on the CURE hierarchical algorithm. (8)

5. (i) Describe on the ROCK hierarchical algorithm. (8) (ii) Describe on the CHAMELEON hierarchical algorithm. (8)

Assignment Topic:

1. Write in detail about “Other Classification Methods”.

4.1 Data Warehousing Components

What is Data Warehouse?

- Defined in many different ways but mainly it is: o A decision support database that is maintained separately from the

organization’s operational database. o Supports information processing by providing a solid platform of

consolidated, historical data for analysis. - In the broad sense, “A Data Warehouse is a subject-oriented, integrated, time-variant,

and Non-Volatile collection of data in support of management’s decision-making process”.

- Hence, Data Warehousing is a process of constructing and using data warehouses. - Data Warehouse is Subject-Oriented:

o Data organized around major subjects, such as customer, product and sales. o Focused on modeling and analysis of data for decision-makers, not on daily-

operations or transaction-processing. o Provides a simple and concise view around a particular subject by excluding

data that are not useful in the decision support process. - Data Warehouse is Integrated:

o Constructed by integrating multiple, heterogeneous data sources. Relational databases, flat files, on-line transaction records…

o Data cleaning and Data integration techniques are applied. Ensures consistency in naming conventions, encoding structures,

attribute measures etc. from different data sources. • Eg. Hotel_price: currency, tax, breakfast_covered etc.

When data is moved to the warehouse it is converted. - Data Warehouse is Time-Variant:

o The time-horizon for the data warehouse is significantly longer than that of operational systems.

Operational Database: Current value data Data Warehouse: Stores data with historical perspective (Eg. Past 5 to

10 years). o Every key structure in the data warehouse contains an element of time

explicitly or implicitly. - Data Warehouse is Non-Volatile:

o A physically separate store of data transformed from the operational environment.

o Operational update of data does not occur in the data warehouse environment. Does not require Transaction Processing, Recovery, Concurrency

control mechanisms (functions of DBMS). Requires only two operations in data accessing:

• Initial Loading of Data / Periodic Refresh of Data • Access of Data

- Data Warehouse Vs Heterogeneous DBMS: o Traditional Heterogeneous DB Integration:

Query-driven approach Build wrappers / mediators on top of heterogeneous databases When a query is passed from a client site, a meta-dictionary is used to

translate the query into queries appropriate for individual heterogeneous sites involved and the results are integrated into a global answer set.

Complex information filtering, compete for resources o Data Warehouse:

Update-driven approach Has high performance Information from heterogeneous sources are integrated in advance and

stored in warehouses for direct query and analysis. - Data Warehouse Vs Operational DBMS:

o OLTP – Online Transactional Processing: This includes the major tasks of traditional relational DBMS

(Concurrency control, transaction procession, recovery etc…) Does Day-to-Day operations such as purchasing, inventory, banking,

manufacturing, payroll, registration, accounting… o OLAP – Online Analytical Processing:

This is the major task of Data Warehousing System. Useful for complex data analysis and decision making.

o Distinct Features – OLTP Vs OLAP:

User and System Orientation: • OLTP – Customer-Oriented – Used by Clerks, Clients and IT

Professionals. • OLAP – Market-Oriented – Used by Managers, Executives and

Data Analysts Data Contents:

• OLTP – has current data - data is too detailed so that it can not be used for decision making.

• OLAP – has historical data – data summarized and aggregated at different levels of granularity – data easy for decision making

Database Design: • OLTP – E-R Data Model - Application Oriented DB design. • OLAP – Star or Snowflake Model – Subject oriented DB

design. View:

• OLTP – Data pertaining to a department or an enterprise. • OLAP – Data pertaining to many departments of an

organization or data of many organizations stored in a single data warehouse.

Access Patterns: • OLTP – Short atomic transactions. • OLAP – Read only transactions (mostly complex queries).

Other distinguishing Features: • Database Size, Frequency of Operation and Performance

Metrics.

o Comparison between OLTP and OLAP systems:

- Why Separate Data Warehouse?:

o High performance for both systems: DBMS – Tuned for OLTP:

• access methods, indexing, concurrency control, recovery Warehouse – Tuned for OLAP:

• Complex OLAP Queries • Multi dimensional view • Consolidation

o Different Functions and Different Data: Missing Data:

• Decision support requires historical data • Historical data are not maintained by Operational DBs

Data Consolidation: • DS requires consolidation (aggregation and summarization) of

data from heterogeneous sources. Data Quality:

• Data from different sources are inconsistent • Different codes and formats has to be reconciled

4.2 Multi Dimensional Data Model

A Multi-Dimensional Data Model - From Tables and Spreadsheets to Data Cubes:

o A data warehouse is based on a multi-dimensional data model which views data in the form of a data cube.

o A data cube such as Sales allows data to be modeled and viewed in multiple dimensions.

Dimension tables such as item(item_name, brand, type) or time(day, week, month, quarter, year)

Fact tables consist of measures (such as dollars_sold) and keys to each of the related dimension tables.

o In data warehousing we have: Base Cuboid: Any n-D cube is called a base cuboid

Apex Cuboid: Top most 0-D cuboid that has the highest-level of summarization is called the apex cuboid.

Data Cube: Lattice of cuboids forms a data cube.

- Conceptual Modeling of Data Warehouses:

o Data warehouse is modeled as dimensions and facts / measures. o There are three types of Modeling for Data Warehouses:

Star Schema: • A fact table in the middle connected to a set of dimension

tables. Snowflake Schema:

• A refinement of star schema where some dimensional table is normalized into a set of smaller dimension tables, forming a shape similar to snowflake.

Fact Constellation: • Multiple fact tables shares dimension tables. Viewed as a

collection of star schema. Hence called Galaxy schema or fact constellation.

- Cube Definition Syntax in DMQL (Data Mining Query Language): o Cube Definition (Includes Fact table definition also):

Define cube <cube_name> [<dimension_list>]: <measure_list> o Dimension Definition (Dimension Table):

Define dimension <dimension_name> as (<attribute_list>) o Special Case – Shared Dimension Tables:

Define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time>

- Defining Star Schema in DMQL: o Define cube sales_star [time, item, branch, location]: dollars_sold =

sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

o Define dimension time as (time_key, day, day_of_week, month, quarter, year) o Define dimension item as (item_key, item_name, brand, type, supplier_type) o Define dimension branch as (branch_key, branch_name, branch_type)

o Define dimension location as (locaton_key, street, city, province_or_state, country)

- Example of Star Schema:

- Example of Snowflake Schema:

- Example of Fact Constellation:

- Defining Snowflake Schema in DMQL:

o Define cube sales_snowflake [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

o Define dimension time as (time_key, day, day_of_week, month, quarter, year) o Define dimension item as (item_key, item_name, brand, type,

supplier(supplier_key, supplier_type)) o Define dimension branch as (branch_key, branch_name, branch_type) o Define dimension location as (locaton_key, street, city(ciy_key,

province_or_state, country)) - Defining Fact Constellation in DMQL:

o Define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

o Define dimension time as (time_key, day, day_of_week, month, quarter, year) o Define dimension item as (item_key, item_name, brand, type, supplier) o Define dimension branch as (branch_key, branch_name, branch_type)

o Define dimension location as (locaton_key, street, city, province_or_state, country)

o Define cube shipping [time, item, shipper, from_locaton, to_location]: dollar_cost = sum(cost_in_dollars), unit_shipperd = count(*)

o Define dimension time as time in cube sales o Define dimension item as item in cube sales o Define dimension shipper as (shipper_key, shipper_name, location as location

in cube sales, shipper_type) o Define dimension from_location as location in cube sales o Define dimension to_location as location in cube sales

- Measures of Data Cube: Three Categories: (based on the kind of aggregate functions used)

o Distributive: If the result derived by applying the function to n aggregate values is

same as that derived by applying the function on all the data without partitioning.

Eg. Count(), Sum(), Min(), Max() o Algebraic:

If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.

Eg. Avg(), min_N(), standard_deviation() o Holistic:

If there is no constant bound on the storage size needed to describe a subaggregate

Eg. Median(), Mode(), rank() - A Concept Hierarchy for the Dimension Location:

o Concept Hierarchy – Defines a sequence of mappings from a set of low level concepts to higher level more general concepts.

o There can b more than one concept hierarchy for a given attribute or dimension, based on different user view points.

o These are automatically generated or pre-defined by domain-experts or users.

July 20, 2010 Data Mining: Concepts and Techniques 25

Multidimensional Data

Sales volume as a function of product, month, and region

Prod

uct

Region

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day


A Sample Data Cube

Total annual salesof TV in U.S.A.Date

Prod

uct

Cou

ntrysum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4QtrU.S.A

Canada

Mexico

sum

- July 20, 2010 Data Mining: Concepts and Techniques 28

Browsing a Data Cube

VisualizationOLAP capabilitiesInteractive manipulation

- Typical OLAP Operations:

o Roll-Up (drill-up): Summarize data By climbing up the hierarchy or by dimension reduction.

o Drill-Down (roll down): Reverse of roll-up From higher level summary to lower level summary or detailed data,

or introducing new dimensions. o Slice and Dice: Project and Select o Pivot (rotate):

Re-orient the cube, visualization, 3D to series of 2D planes. o Other operations:

Drill across: Querying more than one fact table (Using SQL). Drill through: Queries the back end relational tables through the

bottom level of the cube. (Using SQL).

- Roll-up on location (from cities to country)

- Drill-Down on time (from quarters to months)

- - Slice for time = “Q2”

- Pivot

- A Star Net Query Model for querying multi dimensional databases: o A star net model consists of radial lines emanating from a central point where

each line represents a concept hierarchy for a dimension. o Each abstraction level in the hierarchy is called a foot print.


A Star-Net Query Model

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTSCustomer

ProductPRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

4.3 Data Warehouse Architecture

Data Warehouse Architecture: - Design of Data Warehouse: A Business Analysis Framework:

o There are four views regarding the design of the data warehouse: Top-Down View:

• Allows the selection of relevant information necessary for the data warehouses that suits the business needs.

Data Source View: • Exposes the information being captured, stored and managed

by operational systems.

• This information may be at various levels of detail and accuracy, from individual data source table to integrated data source tables.

Data Warehouse View: • Includes Fact tables and dimension tables stored inside the data

warehouse. • This includes pre-calculated totals and counts as well as the

source information such as Date, Time to provide historical perspective.

Business Query View: • Sees the perspectives of data in the warehouse from the view of

end-user. o Designing a Data warehouse is a complex task and it requires:

Business Skills: (Business Analysts) • Understanding and translating the business requirements into

queries that the data warehouse can satisfy. Technology Skills: (Data Analysts)

• To understand how to derive facts and dimensions of the data warehouse.

• Ability to discover patterns & trends based on history and to detect anomaly

• Present relevant information as per managerial need based on such analysis

Program Management Skills: (Manager) • Interfacing with many technologies, vendors and end users so

as to deliver results in a timely and cost effective manner. o Extractors: Transfer data from operational system to the data warehouse. o Warehouse Refresh Software: Keeps the data warehouse up to date with

operational system’s data. o Building a Data warehouse requires understanding of how to store and

manage data, how to build extractors and how to build warehouse refresh software.

o Data warehouse Design / Build Process: Top-Down, Bottom-up and combination of both. Top-Down:

• Starts with overall design and planning (technology & business mature and well known)

• Minimizes integration, Expensive, takes long time, lacks flexibility

Bottom-Up: • Starts with experiments and prototypes (rapid & less

expensive) • Flexible, low cost, rapid return on investment, integration is a

problem o From software engineering point of view, design and construction of a data

warehouse consists of the steps: Planning, Requirements Study, Problem Analysis, Warehouse Design,

Data integration and testing and Deployment of a data warehouse

o Data warehouse design and construction follows two methodologies: Waterfall method:

• Structured and systematic analysis at each step before proceeding to the next.

Spiral method: • Rapid generation of increasingly functional systems, short turn

around time, modifications done quickly (well suited for data warehouses or data marts).

o Typical data warehouse design process includes the below steps: Choose a business process to model.

• Eg. Invoice, accounts, sales, inventory… • If the business process is Organizational – model data

warehouse • If the business process is departmental – model data mart

Choose the grain (atomic level of data in fact table) of the business process.

• Eg. Daily_sales or monthly_sales Choose the dimensions that will apply to each fact table record.

• Eg. Time, Item, Customer, Supplier… Choose the measure that will populate each fact table record.

• Measures are numeric additive quantities. Eg. Dollars_sold, units_sold

o Goals of Data Warehouse implementation: Specific, Achievable, Measurable Determine time, budget, organizations to be modeled, and departments

to be served. o Steps after data warehouse implementation:

Initial installation, rollout planning, training, platform upgrades and maintenance.

o Data warehouse administration includes: Data refreshment, data source synchronization, planning for disaster

recovery, access control and security management, managing data growth, managing database performance, data warehouse enhancement and extension.

- A three-tier data warehouse architecture: o Bottom Tier = Warehouse Database Server o Middle Tier = OLAP Server (Relational OLAP / Multidimensional OLAP) o Top Tier = Client – Query, analysis, reporting or data mining tools

o There are 3 data warehouse models based on the architecture:

Enterprise Warehouse: • Corporate wide data integration, spanning all subjects • One or more operational source systems; Takes years to design

and build • Cross functional in scope; Size of data – gigabytes to terabytes • Implemented on mainframes / UNIX / parallel architecture

platforms Data Mart:

• Subset of corporate wide data, spanning on selected subjects • Eg. Marketing data mart – subjects are customer, item and

sales. • Implementation takes few weeks • Implemented on low cost UNIX / Windows/NT server. • 2 categories – based on source of data:

o Independent data marts – Source from a department data

o Dependent data marts – Source is an enterprise data warehouse

Virtual Warehouse: • Set of views from the operational databases • Summary views are materialized for efficient query processing • Easy to build but requires excess capacity of operational db

servers - A recommended approach for Data Warehouse development:

o Implement a warehouse in an incremental and evolutionary manner. o First – Define a high level corporate data model (in 1 or 2 months)

o Second – independent data marts developed in parallel with enterprise data warehouse

Corporate data model refined as this development progresses. o Third – Multi-Tier-Data Warehouse constructed – Consists of enterprise data

warehouse which in turn communicates with departmental data marts.

- OLAP Server Architectures:

o Relational OLAP Servers (ROLAP): Intermediate server between Relational back-end server and Client

front-end tools. Relational / Extended relational DBMS to store and manage

warehouse data and OLAP middleware Include optimization of DBMS backend, implementation of

aggregation navigation logic, and additional tools and services. Greater scalability Eg. Metacube of Informix

o Multidimensional OLAP Servers (MOLAP): Supports multi dimensional views of data Sparse array-based multi-dimensional storage engine. Maps multidimensional views to data cube array structures Fast indexing to pre-compute summarized data Eg. Essbase of Arbor

o Hybrid OLAP Servers (HOLAP): Combines ROLAP and MOLAP architectures Benefits from high scalability of ROLAP and fast computation of

MOLAP Eg. Microsoft SQL Server

o Specialized SQL Servers: OLAP processing in relational databases (read only environment) Advanced query language and query processing support.

4.4 Data Warehouse Implementation

Data warehouse Implementation

- It is important for a data warehouse system to be implemented with: o Highly Efficient Computation of Data cubes o Access Methods; Query Processing Techniques

- Efficient Computation of Data cubes:

o = Efficient computation of aggregations across many sets of dimensions. o Compute Cube operator and its implementations:

Extends SQL to include compute cube operator Create Data cube for the dimensions item, city, year and

sales_in_dollars: • Example Queries to analyze data:

o Compute sum of sales grouping by item and city o Compute sum of sales grouping by item o Compute sum of sales grouping by city

• Here dimensions are item, city and year; Measure / Fact is sales_in_dollars

• Hence total number of cuboids or group bys for this data cube is

• Possible group bys are {(city, item, year), (city, item), (city, year), (item, year), (city), (item), (year), ()}; These group bys forms the lattice of cuboid

• 0-D (Apex) cuboid is (); 3-D (Base) cuboid is (city, item, year) o Hence, for a cube with n dimensions there are total cuboids. o The statement ‘compute cube sales’ computes sales aggregate cuboids for the

eight subsets. o Pre-computation of cuboids leads to faster response time and avoids redundant

computation. o But challenge in pre-computation is that the required storage space may

explode. o Number of cuboids in an n-dimensional data cube if there are no concept

hierarchy attached with each dimension = cuboids. o Consider Time dimension has the concept hierarchy

o Then total number of cuboids are: where Li is the number of levels associated with the dimension i.

o Eg. If a cube has 10 dimensions and each dimension has 4 levels, then total number of cuboids generated will be .

o This shows it is unrealistic to pre-compute and materialize all cuboids for a data cube.

- Hence we go for Partial Materialization: o Three choices of materialization: o No Materialization:

Pre-compute only base cuboid and no other cuboids; Slow computation

o Full Materialization: Pre-compute all cuboids; Requires huge space o Partial Materialization: Pre-compute a proper subset of whole set of cuboids

Considers 3 factors: • Identify cuboids to materialize – based on workload,

frequency, accessing cost, storage need, cost of update, index usage. (or simply use greedy Algorithm that has good performance)

• Exploit materialized cuboids during query processing • Update materialized cuboid during load and refresh (use

parallelism and incremental update) - Multiway array aggregation in the computation of data cubes:

o To ensure fast online analytical processing we need to go for full materialization

o But should consider amount of main memory available and time taken for computation.

o ROLAP and MOLAP uses different cube computation techniques. o Optimization techniques for ROLAP cube computations:

Sorting, hashing and grouping operations applied to dimension attributes – to reorder and cluster tuples.

Grouping performed on some sub aggregates – ‘Partial grouping step’ – to speed up computations

Aggregates computed from sub aggregates (rather than from base tables).

In ROLAP dimension values are accessed by using value-based / key-based addressing search strategies.

o Optimization techniques for MOLAP cube computations: MOLAP uses direct array addressing to access dimension values Partition the array into Chunks (sub cube small enough to fit into main

memory). Compute aggregates by visiting cube cells. The number of times each

cell is revisited is minimized to reduce memory access and storage costs.

This is called as multiway array aggregation in data cube computation. o MOLAP cube computation is faster than ROLAP cube computation

- Indexing OLAP Data: Bitmap Indexing; Join Indexing - Bitmap Indexing:

o Index on a particular column; Each distinct value in a column has a bit vector o The length of each bit vector = No. of records in the base table o The i-th bit is set if the i-th row of the base table has the value for the indexed

column. o This approach is not suitable for high cardinality domains

- Join Indexing:

o Registers joinable rows of two relations

o Consider Relations R & S o Let R (RID, A) & S (SID,

B); where RID and SID are record identifiers of R & S Respectively.

o For joining the attributes A & B the join index record contains the pair (RID, SID).

o Hence in traditional databases the join index maps the attribute values to a list of record ids

o But, in data warehouses join index relates the values of the dimensions of a star schema to rows in the fact table.

o Eg. Fact table: Sales and two dimensions city and product

o A join index on city maintains for each distinct city a list of R-IDs of the tuples of the Fact table Sales

o Join indices can span across multiple dimensions. – Composite join indices

o To speed up query processing join indexing and bitmap indexing can be integrated to form Bitmapped join indices.

o

Unit I - DATA WAREHOUSING AND DATA MINING -CA5010

KLNCIT – MCA For Private Circulation only

1

- Efficient processing of OLAP queries: o Steps for efficient OLAP query processing: o 1. Determine which OLAP operations should be performed on the

available cuboids: Transform the OLAP operations like drill-down, roll-up,… to its

corresponding SQL (relational algebra) operations. Eg. Dice = Selection + Projection

o 2. Determine to which materialized cuboids the relevant OLAP operations should be applied:

Involves (i) Pruning of cuboids using knowledge of “dominance” (ii) Estimate the cost of remaining materialized cuboids (iii) Selecting the cuboid with least cost

Eg. Cube: “Sales [time, item, location]: sum(sales_in_dollars)” Dimension hierarchies used are:

• “day < month < quarter < year” for time dimension • “Item_name < brand < type for item dimension • “street < city < state < country for location dimension

Say query to be processed is on {brand, state} with the condition year = “1997”

Say there are four materialized cuboids available • Cuboid 1: {item_name, city, year} ; Cuboid 2: {brand,

country, year} • Cuboid 3: {brand, state, year} ; • Cuboid 4: {item_name, state} where year = 1997

Which cuboid selected for query processing? Step 1: Pruning of cuboids – prune cuboid 2 as higher level of

concept “country” can not answer query at lower granularity “state”

Step 2: Estimate cuboid cost; Cuboid 1 costs the most of the 3 cuboids as item_name and city are at a finer granular level than brand and state as mentioned in the query.

Step 3: If there are less number of years and there are more number of item_names under each brand then Cuboid 3 has the least cost. But if otherwise and there are efficient indexes on item_name then Cuboid 4 has the least cost. Hence select Cuboid 3 or Cuboid 4 accordingly.

- Metadata repository: o Metadata is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse: • Schema, views, dimensions, hierarchies, derived data

definition, data mart locations and contents Operational meta data:

• Data lineage: History of migrated data and its transformation path

• Currency of data: Active, archived or purged • Monitoring information:

o Warehouse usage statistics, error reports, audit trails Algorithms used for summarization:

• Measure and Dimension definition algorithm

Unit I - DATA WAREHOUSING AND DATA MINING -CA5010 2

• Granularity, Subject, Partitions definition • Aggregation, Summarization and pre-defined queries and

reports. Mapping from operational environment to the data warehouse:

• Source database information; Data refresh & purging rules • Data extraction, cleaning and transformation rules • Security rules (authorization and access control)

Data related to system performance: • Data access performance; data retrieval performance • Rules for timing and scheduling of refresh

Business metadata: • Business terms and definitions • Data ownership information; Data charging policies

- Data Warehouse Back-end Tools and Utilities: o Data Extraction: Get data from multiple, heterogeneous and external

sources o Data Cleaning: Detects errors in the data and rectify them when possible o Data Transformation: Convert data from legacy or host format to

warehouse format o Load: Sort; Summarize, Consolidate; Compute views; Check integrity

Build indices and partitions o Refresh: Propagates the updates from data sources to the warehouse

4.5 Mapping the Data Warehouse to Multiprocessor Architecture

From Data warehousing to Data Mining: - Data Warehousing Usage:

o Data warehouse and Data Marts used in wide range of applications; o Used in Feedback system for enterprise management – “Plan-execute-

assess Loop” o Applied in Banking, Finance, Retail, Manufacturing,… o Data warehouse used for knowledge discovery and strategic decision

making using data mining tools o There are three kinds of data warehouse applications:

Information Processing: • Supports querying & basic statistical analysis • Reporting using cross tabs, tables, charts and graphs

Analytical Processing: • Multidimensional analysis of data warehouse data • Supports basic OLAP operations slice-dice, drilling and

pivoting Data Mining:

• Knowledge discovery from hidden patterns • Supports associations, classification & prediction and

Clustering • Constructs analytical models • Presenting mining results using visualization tools


- From Online Analytical Processing (OLAP) to Online Analytical Mining (OLAM):

o OLAM also called as OLAP Mining – Integrates OLAP with mining techniques

o Why OLAM? High Quality of data in data warehouses:

• DWH has cleaned, transformed and integrated data (Preprocessed data)

• Data mining tools need such costly preprocessing of data. • Thus DWH serves as a valuable and high quality data

source for OLAP as well as for Data Mining Available information processing infrastructure surrounding data

warehouses: • Includes accessing, integration, consolidation and

transformation of multiple heterogeneous databases ; ODBC/OLEDB connections;

• Web accessing and servicing facilities; Reporting and OLAP analysis tools

OLAP-based exploratory data analysis: • OLAM provides facilities for data mining on different

subsets of data and at different levels of abstraction • Eg. Drill-down, pivoting, roll-up, slicing, dicing on OLAP

and on intermediate DM results • Enhances power of exploratory data mining by use of

visualization tools On-line selection of data mining functions:

• OLAM provides the flexibility to select desired data mining functions and swap data mining tasks dynamically.

Architecture of Online Analytical Mining:


An OLAM System Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

- OLAP and OLAM engines accept on-line queries via User GUI API


- And they work with the data cube in data analysis via Data Cube API - A meta data directory is used to guide the access of data cube - MDDB constructed by integrating multiple databases or by filtering a data

warehouse via Database API which may support ODBC/OLEDB connections. - OLAM Engine consists of multiple data mining modules – Hence sophisticated

than OLAP engine. - Data Mining should be a human centered process – users should often interact

with the system to perform exploratory data analysis

4.6 OLAP Need

OLAP systems vary quite a lot, and they have generally been distinguished by a letter tagged onto the front of the word OLAP. ROLAP and MOLAP are the big players, and the other distinctions represent little more than the marketing programs on the part of the vendors to distinguish themselves, for example, SOLAP and DOLAP. Here, we aim to give you a hint as to what these distinctions mean.

4.7 Categorization of OLAP Tools

Major Types: Relational OLAP (ROLAP) –Star Schema based Considered the fastest growing OLAP technology style, ROLAP or “Relational” OLAP systems work primarily from the data that resides in a relational database, where the base data and dimension tables are stored as relational tables. This model permits multidimensional analysis of data as this enables users to perform a function equivalent to that of the traditional OLAP slicing and dicing feature. This is achieved thorough use of any SQL reporting tool to extract or ‘query’ data directly from the data warehouse. Wherein specifying a ‘Where clause’ equals performing a certain slice and dice action. One advantage of ROLAP over the other styles of OLAP analytic tools is that it is deemed to be more scalable in handling huge amounts of data. ROLAP sits on top of relational databases therefore enabling it to leverage several functionalities that a relational database is capable of. Another gain of a ROLAP tool is that it is efficient in managing both numeric and textual data. It also permits users to “drill down” to the leaf details or the lowest level of a hierarchy structure. However, ROLAP applications display a slower performance as compared to other style of OLAP tools since, oftentimes, calculations are performed inside the server. Another demerit of a ROLAP tool is that as it is dependent on use of SQL for data manipulation, it may not be ideal for performance of some calculations that are not easily translatable into an SQL query. Multidimensional OLAP (MOLAP) –Cube based Multidimensional OLAP, with a popular acronym of MOLAP, is widely regarded as the classic form of OLAP. One of the major distinctions of MOLAP against a ROLAP tool is that data are pre-summarized and are stored in an optimized format in a multidimensional cube, instead of in a relational database. In this type of model, data are structured into proprietary


formats in accordance with a client’s reporting requirements with the calculations pre-generated on the cubes. This is probably by far, the best OLAP tool to use in making analysis reports since this enables users to easily reorganize or rotate the cube structure to view different aspects of data. This is done by way of slicing and dicing. MOLAP analytic tool are also capable of performing complex calculations. Since calculations are predefined upon cube creation, this results in the faster return of computed data. MOLAP systems also provide users the ability to quickly write back data into a data set. Moreover, in comparison to ROLAP, MOLAP is considerably less heavy on hardware due to compression techniques. In a nutshell, MOLAP is more optimized for fast query performance and retrieval of summarized information. There are certain limitations to implementation of a MOLAP system, one primary weakness of which is that MOLAP tool is less scalable than a ROLAP tool as the former is capable of handling only a limited amount of data. The MOLAP approach also introduces data redundancy. There are also certain MOLAP products that encounter difficulty in updating models with dimensions with very high cardinality. Hybrid OLAP (HOLAP) HOLAP is the product of the attempt to incorporate the best features of MOLAP and ROLAP into a single architecture. This tool tried to bridge the technology gap of both products by enabling access or use to both multidimensional database (MDDB) and Relational Database Management System (RDBMS) data stores. HOLAP systems stores larger quantities of detailed data in the relational tables while the aggregations are stored in the pre-calculated cubes. HOLAP also has the capacity to “drill through” from the cube down to the relational tables for delineated data. Some of the advantages of this system are better scalability, quick data processing and flexibility in accessing of data sources. Other Types: There are also less popular types of OLAP styles upon which one could stumble upon every so often. We have listed some of the less famous types existing in the OLAP industry. Web OLAP (WOLAP) Simply put, a Web OLAP which is likewise referred to as Web-enabled OLAP, pertains to OLAP application which is accessible via the web browser. Unlike traditional client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists of three components: a client, a middleware and a database server. Probably some of the most appealing features of this style of OLAP are the considerably lower investment involved, enhanced accessibility as a user only needs an internet connection and a web browser to connect to the data and ease in installation, configuration and deployment process. But despite all of its unique features, it could still not compare to a conventional client/server machine. Currently, it is inferior in comparison to OLAP applications which involve deployment in client machines in terms of functionality, visual appeal and performance.


Desktop OLAP (DOLAP) Desktop OLAP, or “DOLAP” is based on the idea that a user can download a section of the data from the database or source, and work with that dataset locally, or on their desktop. DOLAP is easier to deploy and has a cheaper cost but comes with a very limited functionality in comparison with other OLAP applications. Mobile OLAP (MOLAP) Mobile OLAP is merely refers to OLAP functionalities on a wireless or mobile device. This enables users to access and work on OLAP data and applications remotely thorough the use of their mobile devices. Spatial OLAP (SOLAP) With the aim of integrating the capabilities of both Geographic Information Systems (GIS) and OLAP into a single user interface, “SOLAP” or Spatial OLAP emerged. SOLAP is created to facilitate management of both spatial and non-spatial data, as data could come not only in an alphanumeric form, but also in images and vectors. This technology provides easy and quick exploration of data that resides on a spatial database. Other different blends of an OLAP product like the less popular ‘DOLAP’ and ‘ROLAP’ which stands for Database OLAP and Remote OLAP, ‘LOLAP’ for Local OLAP and ‘RTOLAP’ for Real-Time OLAP are existing but have barely made a noise on the OLAP industry.


Review Questions

Two Marks:

1. What is a data warehouse? 2. Compare Data Warehouse with Heterogeneous DBMS. 3. Distinguish between OLTP and OLAP systems. 4. Why do we need a data warehouse? 5. Write about the three types of modeling for data warehouses. 6. Define a Star Schema using DMQL? 7. Define a Snowflake Schema using DMQL? 8. Define a Fact Constellation Schema using DMQL? 9. What are the three measures of a data cube? 10. What is a (i) Base Cuboid (ii) Apex Cuboid (iii) Data Cube. Give Examples. 11. What is Concept hierarchy? Explain with an example. 12. What is a star net query model? Depict using diagram. 13. Detail on available OLAP server architectures. 14. Explain about the three choices of Materialization.

Sixteen Marks:

1. (i) Compare the features of OLTP and OLAP systems. (6) (ii) Explain the various OLAP operations with examples. (10)

2. Explain in detail about the Data warehousing architecture with suitable architecture diagram. (16)

3. (i) Detail on how OLAP data is indexed using bitmap indexing and join indexing. (4) (ii) Discuss the steps for efficient OLAP Query processing. (4) (iii) Write about Metadata Repository (4) (iv) Write notes on Data Warehouse Tools and Utilities. (4)

4. (i) Write notes on Data Warehousing Usage. (4) (ii) Why do we go from OLAP to OLAM? (6) (iii) Discuss the cube computation techniques used by ROLAP and MOLAP. (6)

Assignment Topic:

1. Write about OLAP Need and Categorization of OLAP Tools.


5.1 Applications of Data Mining

A wide range of companies have deployed successful applications of data mining. While early adopters of this technology have tended to be in information-intensive industries such as financial services and direct mail marketing, the technology is applicable to any company looking to leverage a large data warehouse to better manage their customer relationships. Two critical factors for success with data mining are: a large, well-integrated data warehouse and a well-defined understanding of the business process within which data mining is to be applied (such as customer prospecting, retention, campaign management, and so on). Some successful application areas include:

• A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. The ongoing, dynamic analysis of the data warehouse allows best practices from throughout the organization to be applied in specific sales situations.

• A credit card company can leverage its vast warehouse of customer transaction data to identify customers most likely to be interested in a new credit product. Using a small test mailing, the attributes of customers with an affinity for the product can be identified. Recent projects have indicated more than a 20-fold decrease in costs for targeted mailing campaigns over conventional approaches.

• A diversified transportation company with a large direct sales force can apply data mining to identify the best prospects for its services. Using data mining to analyze its own customer experience, this company can build a unique segmentation identifying the attributes of high-value prospects. Applying this segmentation to a general business database such as those provided by Dun & Bradstreet can yield a prioritized list of prospects by region.

• A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments.

Each of these examples have a clear common ground. They leverage the knowledge about customers implicit in a data warehouse to reduce costs and improve the value of customer relationships. These organizations can now focus their efforts on the most important (profitable) customers and prospects, and design targeted marketing strategies to best reach them. There are a number of applications that data mining has. The first is called market segmentation. With market segmentation, you will be able to find behaviors that are common among your customers. You can look for patterns among customers that seem to purchase the same products at the same time. Another application of data mining is called customer churn. Customer churn will allow you to estimate which customers are the most likely to stop purchasing your products or services and go to one of your competitors. In addition to this, a


company can use data mining to find out which purchases are the most likely to be fraudulent. For example, by using data mining a retail store may be able to determine which products are stolen the most. By finding out which products are stolen the most, steps can be taken to protect those products and detect those who are stealing them. While direct mail marketing is an older technique that has been used for many years, companies who combine it with data mining can experience fantastic results. For example, you can use data mining to find out which customers will respond favorably to a direct mail marketing strategy. You can also use data mining to determine the effectiveness of interactive marketing. Some of your customers will be more likely to purchase your products online than offline, and you must identify them. While many businesses use data mining to help increase their profits, many of them don't realize that it can be used to create new businesses and industries. One industry that can be created by data mining is the automatic prediction of both behaviors and trends. Imagine for a moment that you were the owner of a fashion company, and you were able to precisely predict the next big fashion trend based on the behavior and shopping patterns of your customers? It is easy to see that you could become very wealthy within a short period of time. You would have an advantage over your competitors. Instead of simply guessing what the next big trend will be, you will determine it based on statistics, patterns, and logic. Another example of automatic prediction is to use data mining to look at your past marketing strategies. Which one worked the best? Why did it work the best? Who were the customers that responded most favorably to it? Data mining will allow you to answer these questions, and once you have the answers, you will be able to avoid making any mistakes that you made in your previous marketing campaign. Data mining can allow you to become better at what you do. It is also a powerful tool for those who deal with finances. A financial institution such as a bank can predict the number of defaults that will occur among their customers within a given period of time, and they can also predict the amount of fraud that will occur as well. Another potential application of data mining is the automatic recognition of patterns that were not previously known. Imagine if you had a tool that could automatically search your database to look for patterns which are hidden. If you had access to this technology, you would be able to find relationships that could allow you to make strategic decisions. Data mining is becoming a pervasive technology in activities as diverse as using historical data to predict the success of a marketing campaign, looking for patterns in financial transactions to discover illegal activities or analyzing genome sequences. From this perspective, it was just a matter of time for the discipline to reach the important area of computer security. Applications of Data Mining in Computer Security presents a collection of research efforts on the use of data mining in computer security. Data mining has been loosely defined as the process of extracting information from large amounts of data. In the context of security, the information we are seeking is the knowledge of whether a security breach has been experienced, and if the answer is yes, who is the perpetrator. This information could be collected in the context of discovering intrusions that aim to breach the privacy of services, data in a computer system or alternatively, in the context of discovering evidence left in a computer system as part of criminal activity.


Applications of Data Mining in Computer Security concentrates heavily on the use of data mining in the area of intrusion detection. The reason for this is twofold. First, the volume of data dealing with both network and host activity is so large that it makes it an ideal candidate for using data mining techniques. Second, intrusion detection is an extremely critical activity. This book also addresses the application of data mining to computer forensics. This is a crucial area that seeks to address the needs of law enforcement in analyzing the digital evidence. Applications of Data Mining in Computer Security is designed to meet the needs of a professional audience composed of researchers and practitioners in industry and graduate level students in computer science. 5.2 Social Impacts of Data Mining

Data Mining can offer the individual many benefits by improving customer service and satisfaction, and lifestyle in general. However, it also has serious implications regarding one’s right to privacy and data security. Is Data Mining a Hype or a persistent, steadily growing business? Data Mining has recently become very popular area for research, development and business as it becomes an essential tool for deriving knowledge from data to help business person in decision making process. Several phases of Data Mining technology is as follows:

• Innovators • Early Adopters • Chasm • Early Majority • Late Majority • Laggards

Is Data Mining Merely Managers Business or Everyone’s Business? Data Mining will surely help company executives a great deal in understanding the market and their business. However, one can expect that everyone will have needs and means of data mining as it is expected that more and more powerful, user friendly, diversified and affordable data mining systems or components are made available. Data Mining can also have multiple personal uses such as: Identifying patterns in medical applications To choose best companies based on customer service. To classify email messages etc. Is Data Mining a threat to Privacy and Data Security?


With more and more information accessible in electronic forms and available on the web and with increasingly powerful data mining tools being developed and put into use, there are increasing concern that data mining may pose a threat to our privacy and data security. Data Privacy: In 1980, the organization for Economic co-operation and development (OECD) established as set of international guidelines, referred to as fair information practices. These guidelines aim to protect privacy and data accuracy. They include the following principles:

• Purpose specification and use limitation. • Openness • Security Safeguards • Individual Participation

Data Security: Many data security enhancing techniques have been developed to help protect data. Databases can employ a multilevel security model to classify and restrict data according to various security levels with users permitted access to only their authorized level. Some of the data security techniques are: Encryption Technique Intrusion Detection In secure multiparty computation In data obscuration

5.3 Tools

Data Mining Tools: 1. Auto Class III:

Auto Class is an unsupervised Bayesian Classification System for independent data. 2. Business Miner:

Business Miner is a single strategy easy to use tool based on decision trees. 3. CART:

CART is a robust data mining tool that automatically searches for important patterns and relationships in large data sets.

4. Clementine: It finds sequence association and clustering for web data analysis.

5. Data Engine: Data Engine is a multiple strategy data mining tool for data modeling, combining conventional data analysis methods with fuzzy technology.

6. DB Miner: DB Miner is a publicly available tool for data mining. It is multiple strategy tool and it supports clustering and Association Rules.

7. Delta Miner:


Delta Miner is a multiple strategy tool for supporting clustering, summarization, and deviation detection and visualization process.

8. IBM Intelligent Miner: Intelligent Miner is a integrated and comprehensive set of data mining tools. It uses decision trees, neural networks and clustering.

9. Mine Set: Mine Set is comprehensive tool for data mining. Its features include extensive data manipulation and transformation.

10. SPIRIT: SPIRIT is a tool for exploration and modeling using Bayesian techniques.

11. WEKA: WEKA is a S/W environment that integrates several machine learning tools within a common framework and Uniform GUI.

5.4 An Introduction to DB Miner

A Data Mining system, DB Miner has been developed for interactive mining of multiple-level knowledge in large relational databases. The system implements wide spectrum of data mining functions, including generalization, characterization, association, classification and prediction. Introduction: With the upsurge of research and development activities on knowledge discovery in databases, a data mining system, db miner, has been developed based on our studies of data mining techniques and our experience in the development of an early system prototype, DBlearn. The system has the following distinct features:

1. It incorporates several interesting data mining techniques, including attribute-oriented induction, statistical analysis, progressive deepening for mining multiple level rules and meta-rule guided knowledge mining.

2. It performs interactive data mining and multiple concept levels on any user-specified set of data in a database using an SQL-like Data mining Query Language, DMQL or a GUI.

3. Efficient implementation techniques have been explored using different data structures, including generalized relations and multiple-dimensional data cubes.

4. The data mining process may utilize user or expert defined set-grouping or schema level concept hierarchies which can be specified flexibly, adjusted dynamically based on data distribution and generated automatically for numerical attributes.

5. Both UNIX and PC (Windows / NT) versions of the system adopt client / server architecture. The later may communicate with various commercial database systems for data mining using the ODBC technology.

Architecture and Functionalities: The general architecture of DB Miner is shown in the figure A1, tightly integrates a relational database system, such as Sybase SQL Server, with a concept hierarchy module, and a set of knowledge discovery modules of DB Miner. Graphical User Interface:


Knowledge discovery modules of DB Miner includes characterizer, discriminator, classifier, association rule finder, meta rule guided miner, predictor, evolution evaluator, deviation evaluator and some planned future modules. The functionalities of the knowledge discovery modules are brief described as follows: The characterizer generalizes a set of task-relevant data into a generalized relation which can then be used for extraction of different kinds of rules to be viewed at multiple concept levels from different angles. A discriminator discovers a set of discriminator rules which summarize the features that distinguish the class being examined from other classes. An Association Rule Finder discovers a set of association rules at the multiple concept levels from the relevant sets of data in a database. A meta-rule guided miner is a data mining mechanism which takes a user specified meta-rule form as a pattern to confine the search for desired rule. A predictor predicts the possible values of some mining data or the value distribution of certain attributes in a set of objects. A data evolution evaluator evaluates the data evolution regularities for certain objects where behavior changes over time. A deviation evaluator evaluates the deviation patterns for a set of task relevant data in a database. Another important function module of DB Miner is concept hierarchy which provides essential background knowledge for data generalization and multiple-level data mining. 5.5 Case Studies

Data mining is the process of discovering previously unknown, actionable and profitable information from large consolidated databases and using it to support tactical and strategic business decisions. The statistical techniques of data mining are familiar. They include linear and logistic regression, multivariate analysis, principal components analysis, decision trees and neural networks. Traditional approaches to statistical inference fail with large databases, however, because with thousands or millions of cases and hundreds or thousands of variables there will be a high level of redundancy among the variables, there will be spurious relationships, and even the weakest relationships will be highly significant by any statistical test. The objective is to build a model with significant predictive power. It is not enough just to find which relationships are statistically significant. Consider a campaign offering a product or service for sale, directed at a given customer base. Typically, about 1% of the customer base will be "responders," customers who will purchase the product or service if it is offered to them. A mailing to 100,000 randomly-chosen customers will therefore generate about 1000 sales. Data mining techniques enable customer


relationship marketing, by identifying which customers are most likely to respond to the campaign. If the response can be raised from 1% to, say, 1.5% of the customers contacted (the "lift value"), then 1000 sales could by achieved with only 66,666 mailings, reducing the cost of mailing by one-third. Case Study: Data Mining the Northridge Earthquake The data collected during the Northridge, California earthquake occupied several warehouses, and ranged from magnetic media to bound copies of printed reports. Nautilus Systems personnel sorted, organized, and cataloged the materials. Document were scanned and converted to text. Data were organized chronologically and according to situation reports, raw data, agency data, and agency reports. For example, the Department of Transportation had information on highways, street structures, airport structures, and related damage assessments. Nautilus Systems applied its proprietary data mining techniques to extract and refine data. Geography was used to link related information, and text searches were used to group information tagged with specific names (e.g., Oakland Bay Bridge, San Mateo, Marina). The refined data were further analyzed to detect patterns, trends, associations and factors not readily apparent. At that time, there was not a seismographic timeline, but it was possible to map the disaster track to analyze the migration of damage based upon geographic location. Many types of analyses were done. For example, the severity of damage was analyzed according to type of physical structure, pre- versus post- 1970 earthquake building codes, and off track versus on track damage. It was clear that the earthquake building codes limited the degree of damage. Nautilus Systems also looked at the data coming into the command and control center. The volume of data was so great that a lot was filtered out before it got to the decision support level. This demonstrated the need for a management system to build intermediate decision blocks and communicate the information where it was needed. Much of the information needed was also geographic in nature. There was no ability to generate accurate maps for response personnel, both route maps including blocked streets and maps defining disaster boundaries. There were no interoperable communications between local police, the fire department, utility companies, and the disaster field office. There were also no predefined rules of engagement between FEMA and local resources, resulting in delayed response (including such critical areas as firefighting) Benefits Nautilus Systems identified recurring data elements, data relationships and metadata, and assisted in the construction of the Emergency Information Management System (EIMS). The EIMS facilitates rapid building and maintenance of disaster operations plans, and provides consistent, integrated command (decision support), control (logistics management), and communication (information dissemination) throughout all phases of disaster management. Its remote GIS capability provides the ability to support multiple disasters with a central GIS team, conserving scarce resources.

5.6 Mining WWW


Mining the World Wide Web: WWW is huge, widely distributed, global information source for:

Information Services - News, Advertisements, consumer information, financial

management, education, government, e-commerce etc. Hyper-link Information Access and usage information Web-site contents and Organization Growing and changing very rapidly

- Broad diversity of user communities. Only a small portion of the information on the web is truly relevant or useful to web users.

- How to find high-quality Web pages on a specified topic? WWW provides rich sources of data mining.

Challenges on WWW Interactions: Creating knowledge from Information available Personalization of the information

Learning about customers / individual users Finding relevant information Searches for: Web access patterns Web Structures Regularity and dynamics of Web contents Problems: The “abundance” problem Limited coverage of the Web - Hidden Web sources, majority of data in DBMS. Limited query interface based on keyword-oriented search. Limited customization to individual users Dynamic and semi-structured Web Search Engines:

Index-based search the web, index web pages and build and store huge keyword based indices. Helps locate sets of Web pages containing certain keywords. Deficiencies

- A topic of any breadth may easily contain hundreds of thousands of documents

- Many documents that are highly relevant to a topic may not contain keywords defining them.

Web Mining Subtasks: Resource Finding

- Task of retrieving intended web-documents. Information Selection and Pre-Processing

- Automatic Selection and pre-processing specific information from retrieved web resources.

Generalization - Automatic discovery of patterns in Web Sites.


Analysis - Validation and / or interpretation of mined patterns

Web Content Mining: Discovery of useful information from web contents / data documents

- Web data contents: text, image, audio, video, metadata and hyperlinks Information Retrieval View

- Assist / Improve information finding - Filtering information to users on user profiles

Database View - Model data on the web integrate them for more sophisticated queries.

Web Structure Mining: To discover the link structure of the hyperlinks at the inter-document level to generate structural summary about the website and web page. Direction 1: based on the hyperlinks, categorizing the web pages and generated information. Direction 2: discovering the structure of Web document itself. Direction 3: Discovering the nature of the Web site. Finding authoritative web pages

- Retrieving pages that are not only relevant, but also of high quality or authoritative on the topic.

Hyperlinks can infer the notion of authority.

- Web consists not only of pages, but also of hyperlinks pointing from one page to another

- These hyperlinks contain an enormous amount of latent human annotation - A hyperlink pointing to another web page, this can be considered as the

authors endorsement of the other page. Web Usage Mining: Web Usage Mining also known as Web log Mining

Mining Techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web.

Techniques for Web Usage Mining: Construct multidimensional view on the Web log database. Perform data mining on Web log records Conduct Studies to analyze system performance Design of a Web log Miner: Web log is filtered to generate a relational database A data cube is generated from database OLAP is used to drill-down and roll-up in the cube OLAM is used for mining interesting knowledge Mining the Web’s link structures to identify authoritative web pages: Identify Authoritative Web Pages: Hub: Web page links to a collection of prominent sites on a common topic

Authority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs.

Mutual Reinforcing Relationship: A good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities.


Finding Authoritative Web Pages: Retrieving pages that are not only relevant, but also of high quality or authoritative on the topic.

Hyperlinks can infer the notion of authority: - The Web consists not only of pages, but also of hyperlinks pointing from

one page to another. - These hyperlinks contain an enormous amount of latent human annotation - A hyperlink pointing to another web page, this can be considered as the

author’s endorsement of the other page. Problems with the web linkage structure:

- Not every hyperlink represents an endorsement - One authority will seldom have its web page point to its rival authorities in

the same field - Authoritative pages are seldom particularly descriptive.

HITS (Hyperlink Inducted Topic Search): Explore interactions between hubs and authoritative pages. Use an index-based search engine to form the root set. Expand the root set into a base set

- Include all of the pages that the root set pages link to, and all of the pages that link to a page in the root set, up to a designated size cut off.

Apply weight-propagation - An iterative process that determines numerical estimates of hub and

authority weights System based on the HITS Algorithm:

- Clever Google: Achieve better quality search results than those generated by term index engines such as Alta Vista.

Difficulties from ignoring textual contexts - Drifting - Topic Hijaking

Automatic Classification of Web Documents: Assign a class label to each document from a set of predefined topic categories. Based on a set of examples of pre-classified documents Keyword-based document classification methods Statistical Models Multilayered Web Information Base: Layer 0: the Web itself Layer 1: the Web page descriptor layer Layer 2 and up: various web directory services constructed on the top of layer 1 Applications of Web Mining: Target potential customers for e-commerce Improve web server system performance Identify potential prime advertisement locations Facilitates adaptive / personalized sites Improve site design Fraud / Intrusion detection Predicts user’s actions 5.7 Mining Text Database

Mining Text Databases:


Text Databases and Information Retrieval: Text Databases (Document Databases):

- Large collections of documents from various sources, new articles, research papers, books, digital libraries, email messages and web pages, library databases etc.

- Data stored is usually semi-structured - Traditional information retrieval techniques become inadequate for the

increasingly vast amounts of text data. Information Retrieval:

- A field developed in parallel with database systems. - Information is organized into a large number of documents - Information retrieval problem: locating relevant documents based on user

input, such as keywords or example documents. Information Retrieval: Typical IR Systems:

- Online Library Catalogs - Online document management systems

Information Retrieval Vs Database Systems: - Some DB problems are not present in IR, eg., update, transaction

management, complex objects. - Some IR problems are not addressed well in DBMS, eg., unstructured

documents, approximate search using keywords and relevance. Precision: the percentage of retrieved documents that are in fact relevant to the query. Precision = |{Relevant} ^ {Retrieved}| -------------------------------- |{Retrieved}| Recall: the percentage of documents that are relevant to the query and were in fact retrieved. Recall = |{Relevant} ^ {Retrieved}| -------------------------------- |{Relevant}|

Keyword – Based Retrieval: A document is represented by a string, which can be identified by a set of keywords. Queries may use expressions of keywords.

- Eg. Car amd Repair shop, tea, coffee, DBMS but not Oracle - Queries and retrieval should consider synonyms, eg. Repair and

maintenance. Major difficulties of the Model:

Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, eg. Data mining Polysemy: The same keyword may mean different things in different contexts eg. Mining.

Similarity – Based Retrieval in Text DBs: Finds similar documents based on a set of common keywords.

Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords etc. Basic Techniques: Stop List:

Set of words that is deemed “irrelevant” even though they may appear frequently.


Eg. A, the, of, for, with etc. Stop lists may vary when document set varies.

Word Stem: Several words are small syntactic variants of each other since they share a common word stem.

A term frequency table: Each entry frequent table (i,j)

No. of occurrences of the word ti in document di Usually, the ratio instead of the absolute number of

occurrences is used. Similarity Metrics:

Measure the closeness of the document to a query ( a set of keywords) Relative term occurrences Cosine Distance

Latent Semantic Indexing: Basic Idea:

- Similar documents have similar word frequencies. - Difficulty: the size of the term frequency matrix is very large. - Use a singular value decomposition (SVD) techniques to reduce the size of

the frequency table. - Retain the K most significant rows of the frequency table.

Method: - Create a term frequency matrix, freq-matrix. - SVD Construction: Compute the singular valued decomposition of the

freq-matrix by splitting it into 3 matrices, U, S, V. Vector Identification:

- For each document d, replace its original document vector by a new excluding the eliminated terms.

Index Creation: - Store the set of all vectors, indexed by one of a number of techniques

(such as TV-tree) Other Text Retrieval Indexing Techniques: Inverted Index:

- Maintains two hash or B +tree indexed tables. Document Table:

- a set of documents records < doc_id, postings_list> Term-table: a set of term records, < term, postings_list> Answer Query: Find all docs associated with one or a set of terms. Advantage: Easy to implement Disadvantage: Do not handle well synonymy and polysely and posting lists could be too long (storage could be very large) Signature File:

- Associate a signature with each document. - A signature is a representation of an ordered list of terms that describe the

document. - Order is obtained by frequency analysis, stemming and stop lists.

Types of Text Data Mining: Keyword – based association analysis. Automatic document classification Similarity detection


Cluster documents by a common author Cluster documents containing information from a common source Link analysis: Unusual Correlation between entities. Sequence Analysis: Predicting a recurring event. Anomaly Detection: Find information that violates usual patterns. Hypertext Analysis: Patterns in anchors / links Anchor text correlations with linked objects. Keyword based Association Analysis:

Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them. First preprocess the text data by parsing, stemming, removing stop words etc. Then evoke association mining algorithms. Consider each document as a transaction

View a set of keywords in the document as a set of items in the transaction. Term level Association Mining: No need for human effort in tagging documents

The number of meaningless results and the execution time is greatly reduced. Automatic Document Classification: Motivation Automatic Classification for the tremendous number of on-line text documents. A Classification Problem: Training set: Human experts generate a training data set. Classification: The computer system discovers the classification rules.

Application: The discovered rules can be applied to classify new / unknown documents.

Text Document Classification differs from the classification of relational data Document databases are not structured according to attribute-value pairs. Association-Based Document Classification:

Extract keywords and terms by information retrieval and simple association analysis techniques. Obtain concept hierarchies of keywords and terms using: Available term classes, such as Word Net Expert Knowledge Some keyword classification systems. Classify documents in the training set into class hierarchies Apply term association mining method to discover sets of associated terms. Use the terms to maximally distinguish one class of documents from others. Derive a set of association rules associated with each document class Order the classification rules based on their occurrence frequency and discriminative power. Used the rules to classify new documents.

Document Clustering: Automatically group related documents based on their contents.

Require no training sets or predetermined taxonomies, generate a taxonomy at runtime. Major Steps: Preprocessing Remove stop words, stem, feature extraction, lexical analysis,… Hierarchical Clustering


Compute similarities applying clustering algorithms,… Slicing

Fan out controls, flatten the tree to configurable number of levels,…

5.8 Mining Spatial Databases

Spatial Data Mining refers to the extraction of knowledge, spatial relationships or other interesting patterns not explicitly stored in spatial databases. A spatial database stores a large amount of space-related data, such as maps, preprocessed remote sensing or medical imaging data, and VLSI chip layout data. Statistical spatial data analysis has been a popular approach to analyzing spatial data and exploring geographic information. The term ‘geostatistics’ is often associated with continuous geographic space, whereas the term ‘Spatial statistics’ is often associated with discrete space. Spatial Data Mining Applications: Geographic information systems Geo marketing Remote sensing Image database exploration Medical Imaging Navigation Traffic Control Environmental Studies Spatial Data Cube Construction and Spatial OLAP:

Spatial data warehouse is a subject-oriented integrated, time-variant and non-volatile collection of both spatial and non-spatial data in support of spatial data mining and spatial data related decision-making process.

There are three types of dimensions in a Spatial Data Cube: A non-spatial dimension contains only non-spatial data, each contains nonspatial data whose generalizations are non-spatial. A Spatial-to-nonspatial dimension is a dimension whose primitive-level data are spatial but whose generalization, starting at a certain high level, becomes non-spatial. A Spatial-to-Spatial dimension is a dimension whose primitive level and all of its high level generalized data are spatial.

Measures of Spatial Data Cube: A numerical measure contains only numeric data A Spatial measure contains a collection of pointers to spatial objects. Computation of Spatial Measures in Spatial Data Cube Construction:

Collect and store the corresponding spatial object pointers but do not perform precomputation of spatial measures in the spatial data cube. Precompute and store a rough approximation of the spatial measures in the spatial data cube. Selectively pre-compute some spatial measures in the spatial data cube.

Mining Spatial Association and Co-Location Pattern: Spatial Association rules can be mined in spatial databases.

A Spatial association rule is of the form A B [s%, c%] where A & B are sets of spatial or non-spatial predicates. S% is the support of the rule; c% is the confidence of the rule


Spatial Association mining needs to evaluate multiple spatial relationships among a large number of spatial objects, the process could be quite costly. An interesting mining optimization called ‘progressive refinement’ can be adopted in spatial association analysis. The method first mines large data sets roughly using a fast algorithm and then improves the quality of mining in a pruned data set using a more expensive algorithm.

Superset Coverage Property: It should allow a false-positive test, which might include some data sets that do not belong to the answer sets, but it should not allow a ‘false-negative test’, which might exclude some potential answers.

For mining spatial associations related to the spatial predicate close to and collect the candidates that pass the minimum support threshold by

Applying certain rough spatial evaluation algorithms. Evaluating the relaxed spatial predicate, ‘g close to’, which is generalized close to covering a broader context that includes ‘close to’, ‘touch’ and intersect’

Spatial Clustering methods: Spatial data clustering identifies clusters, or densely populated regions, according to some distance measurement in a large, multi dimensional data set.

Spatial Classification and Spatial Trend Analysis: Spatial Classification analyzes spatial objects to derive classification schemes in relevance to certain spatial properties. Example: Classify regions in a province into rich Vs poor according to the average family income. Trend analysis detects changes with time, such as the changes of temporal patterns in time-series data. Spatial trend analysis replaces time with space and studies the trend of non-spatial or spatial data changing with space. Example: Observe the trend of changes of the climate or vegetation with the increasing distance from an ocean. Regression and correlation analysis methods are often applied by utilization of spatial data structures and spatial access methods.

Mining Raster Databases: Spatial database systems usually handle vector data that consists of points, lines, polygons (regions) and their compositions, such as networks or partitions. Huge amounts of space-related data are in digital raster forms such as satellite images, remote sensing data and computer tomography.

Review Questions

Two Marks:

1. List out some of the application areas of Data mining systems. 2. Is data mining a hype or a persistent? 3. Write short notes on text mining. 4. What are the applications of spatial data bases? 5. Define Spatial Data Mining.


6. List out any five various commercial data mining tools. 7. What are the different Data security techniques used in data mining? 8. What is information retrieval? 9. What is keyword-based association analysis? 10. What is HITS algorithm? 11. List out some of the challenges of WWW. 12. What is web usage mining? 13. What are the three types of dimensions in Spatial data cube?

Sixteen Marks:

1. Discuss in detail the application of Data Mining for financial data analysis? 2. Discuss the application of data mining in business. 3. Discuss in detail of applications of data mining for biomedical and DNA data analysis

and telecommunication industry. 4. Discuss the Social impacts of Data Mining Systems. 5. Discuss about the various data mining tools. 6. Explain the Mining of Spatial databases. 7. Discuss the Mining of Text Databases, 8. What is web mining? Discuss the various web mining techniques.

Assignment Topic:

1. Explain in detail about the data mining tool DB-Miner.