chapter-1 introduction to data miningshodhganga.inflibnet.ac.in/bitstream/10603/22134/6/06_chapter...

Chapter-1

Introduction to Data Mining

In recent years there has been an explosive growth in the generation and storage of electronic information as more and more operations of enterprises are computerized. The cost of processing power and storage has been declining for many years. Major enterprises have been using database management system for at least 30 years and have been accumulated a large amount of information. Some companies, for example major banks and telecommunication companies, now have mountains of data and have started to realize that the information accumulated over the year is an important strategic asset. There is a potential business intelligence hidden in the large volume of data. What these companies require is techniques that allow them to distil the most valuable information from the accumulated data. The field of data mining provides such techniques.

Data mining or knowledge discovery in Databases (KDD) deals with exploration techniques based on advanced analytical methods and tools for handling a large

. amount of information. It is a collections of techniques that finds novel patterns that can assist an enterprise in understanding its business better. Many data mmmg techniques are closely related to some of the machine learning techniques.

The data mining process involves much hard work, including perhaps building data warehouse if the enterprise does not have one. A typical data mining process is likely to include the following steps:

1. Requirements analysis: The enterprise decision makers need to formulate goals that the data mining process is expected to achieve. The business problem must be clearly defined. One cannot use data mining without a1 good idea of what kind of outcomes the enterprise is looking for, since the. technique to be used and the data that is required are likely to be different for different goals. Furthermore, if the objectives have been clearly define~, it is easier to evaluate the results of the project. Once the goals have been agreed upon, the following further steps are needed.

2. Data selection and collection: This step may include finding the best source databases for the data that is required. If the enterprise has implemented a data warehouse, then most of the data could be available there. If the data is not available in the warehouse or the enterprise does not have a warehouse, the source OL TP systems need to be identified and the required information extracted and stored in some temporary system. In some cases, only a sample of the data available may be required.

1

3. Cleaning and preparing data: This may not be an onerous task if a data warehouse containing the required . data exists, since most of this must have already been done when data was loaded in the warehouse. Otherwise this task can be very resource intensive and sometimes more than 50% of effort in a data mining project is spent on this step. Essentially, a data store that integrates data from a number of databases may need to be created. When integrating data, one often encounters problems like identifying data, dealing with missing data, data conflicts and ambiguity. An ETL (extraction, transformation and loading) tool may be used to overcome these problems.

4. Data mining exploration and validation: Once appropriate data has been collected and cleaned, it is possible to start data mining exploration. Assuming that the user has access to one or more data mining tools, a data mining model may be constructed based on the enterprise's needs. It may be possible to take a sample of data and apply a number of relevant techniques. For each technique the results should be evaluated and their significance interpreted. This is likely to be an iterative process which should lead to selection of one or more techniques that are suitable for further exploration, testing, and validation.

5. Implementing, evaluating, and monitoring: Once a model has been selected and validated, the model can be implemented for use by the decision makers. This may involve software development for generating reports, or for results visualization and explanation,. for managers. It may be that more than one technique is available for the given data mining task. It is then important to evaluate the results and .choose the best technique. Evaluation may involve checking the accuracy and effectiveness of the technique. Furthermore, there is a need for regular monitoring of the performance of the techniques that have been implemented. It is essential that use of the tools by the managers be monitored and results evaluated regularly. Every enterprise evolves with time and so must the data mining system. Therefore, monitoring is likely to lead from time to time to refinement of tools and techniques that have been implemented.

6. Results visualization:, Explaining the results of data mining to the decision makers is an important step of the data mining process. Most commercial data mining tools include data visualization modules. These toqls are often vital in communicating the data mining results to the managers, although a problem dealing with a number of ·dimensions· must be visualized using a twodimensional computer screen or printout. Clever data visualization tools are being developed to display results that deal with more than two dimensions. The visualization tools available should be tried and used if found effective for the given problem.

2

1.1 Data Mining

DBMS gave access to the data stored but no analysis of data. Analysis was

required to unearth the hidden relationships within the data i.e. for decision support.

Increase in size of databases e.g. VLDBs, needed automated techniques for analysis as

they have grown beyond manual extraction. Typical scientific user knew nothing of

commercial business applications. The business database programmers knew nothing

of massively parallel principles. Solution was for database software producers to

create easy-to-use tools ·and form strategic relationships with hardware manufacturers.

This resulted in development of separate specialized stream of Computer Science

called Data Mining.

Data Mining can be defined as, "Non trivial extraction ofimplicit, previously

unknown, and potentially useful information from data".

Data Mining covers variety · of techniques to identify nuggets' of information or

decision-making knowledge in bodies of data, and extracting these in such a way that

they can be .Put to use in the areas such as decision support, prediction, forecasting

and estimation. The data is often voluminous, but as it stands of low value as no direct

use can be made of it; it is the hidden information in the data that is really useful. Data

mining encompasses a number of different technical approaches, such as clustering,

data summarization, learning classification rules, finding dependency net works,

analyzing changes, and detecting anomalies. Data mining is the analysis of data and

the use of software techniques for finding patterns and regularities in sets of data. The

computer is responsible for finding the. patte.rns by id~ntifying the underlying rules . . . . I .

. . and features in the data. It is possible to 'strike gold' in unexpected places as the data

mining software extracts patterns not previously discernable or so obvious that no-one

has noticed them before. In Data Mining, large volumes of data are sifted in an

attempt to find something worthwhile ... , .. , .......

3.

Data mining plays a leading role in the every facet of Business. It is one of the

ways by which a company can gain competitive advantage. Through application of

Data mining, one can tum large volumes of data collected from various front-end

systems like Transaction Processing Systems, ERP, and operational CRM into

meaningful knowledge.

At this stage it ·is pertinent to understa11d the difference between (i) Data (ii)

Information and (iii) knowledge.

For example a number like 3000 is data. Adding a context to data, converts data

into information. Let us add context i.e. Rs to number 3000 making it Rs. 3000. Rs.

3000 conveys much more meaning than number 3000.Rule contains (more contexts in

a relative Sense) compared to Information. If the salary of a person is Rs 3000 p.m. he

is poorly paid. A generalized rule for example, if salary of a person is between Rs

2000- Rs 5000 p.m., he is poorly paid: REPRESENTS KNOWLEDGE.

So in essence data mining technology converts data contained in database to a

form of rule, and generalized RULE which is knowledge. Put in another way, it

summarizes the information contained in thousands of records to few rules say 20 or

30, which is more actionable and meaningful.

Issues in Data Mining are Noisy data, Missing values, Static: data, Sparse data,

Dynamic data, Relevance,· Interestingness, Heterogeneity, Algorithm efficiency & Size

and complexity of data.

Businesses are looking for new ways to let end users find the data they need to

make decisions, serve customers and gain the competitive edge.

Some of the applications where Data Mining is used:

• Medicine - drug side effects, hospital cost analysis, genetic sequence analysis,

prediction etc.

• Finance- stock market prediction, credit assessment, fraud detection etc.

• Marketing/sales - product analysis, buying patterns, sales prediction, target

mailing, identifying 'unusual behavior' etc.

4

• Knowledge Acquisition

• Scientific discovery - superconductivity research, etc.

• Engineering - automotive diagnostic expert systems, fault detection etc.

1.2 Data Mining Techniques

Major techniques for converting Data into information/knowledge are:

( 1) Classification/ Prediction

(2) Affinity Grouping or Association Rules

(3) Clustering

( 4) Visualization techniques

Classification I Prediction:

Classification I Prediction involve use training examples where the value of

variable to be classified is already known. The model is presented with data

containing the class (which is known) as well as other attributes describing the class.

The underlying algorithm learns the mapping between the attribute values and class.

Sometimes these are also known as dependent and independent variables.

When once the model has learnt the mapping any new data containing the

attributes is presented to the model and the model classifies the sample into one of the

learnt classes.

Affinity Grouping/ Association Rules:

The task is to determine which items go together. Retail store can use the affinity

grouping for store layout; identify items that are purchased together, profitability

analysis, promotion/planning etc. It can also be used to identify cross-selling

opportunities.

5

Clustering:

Clustering is the task of segmenting a diverse group into number of smaller

groups. What distinguisher clustering from classification is that clustering does not

rely on predefined class.

Cluster Analysis

o Clustering and segmentation is basically partitioning the database so

that each partition or group is similar according to some criteria or

metric

o Clustering according to similarity is a concept which appears in many

disciplines e.g. in chemistry the clustering of molecules

o Data mining applications make use of clustering according to similarity

e.g. to segment a client/customer base

o It provides sub-groups of a population for further analysis or action -

very important when dealing with very large databases

o Can be used for profile generation for target marketing 1.e. where

previous response to mailing campaigns can be used to generate a

profile of people who responded and this can be used to predict

response and filter mailing lists to achieve the best response

Visualization:

We have age old saying which says one picture is worth hundred pages of written

information. Visualization of data in multi dimension space will bring lot of insight

in the data. It will also help to identify outlines, unusual values before one starts the

analysis.

6

1.3 Basic Steps In Data Mining

The basic steps in Data mining are:

1. Identify and obtain data

2. Validate, Explore, Clean data

3. Transform data to right level of Granularity

4. Add derived variables

5. Choose Modeling Techniques

6. Train the Model

7. Check the Model performance

8. Choose the best model

Broadly types of data used in Data mining applications are given below:

- Demographic data such as ago, income, profession etc.

- Transaction data: Specific to application

- Data shared within an Industry: Credit reports, catalogues etc.

- Data shared from Business partners

Data collected for data mining can be grouped under four major categories:

(1) Categorical: Set of values, which a field can take. There is no ordering or

hierarchy in data. Examples: Profession, Qualification, City, Pin code etc.

(2) Rank: Similar to categorical data but have a natural ordering. Examples

are income range like low, middle, high, Age: Young, middle age, old, very

old etc.

(3) Numeric: Amount purchased, Quality purchased, volume of Transaction,

No. of items in Inventory etc.

( 4) Date: Information about the date. This contains a wealth of Information

like

(i) Day of week

(ii) Day of month

(iii) Month in year

(iv) Quarter of the year

7

(v) Day of the year

When analyzing the data, date fields can be used in many different ways.

Derived Variables:

They are calculated columns not in original data. One set of derived variables contain

values from data aggregation, like weekly sales from daily sales. Depending upon the

application, the data has to be aggregated.

Another important class of derived variables are those that perform calculations from

original data values

Examples are: Debt/ earning

Income/Total Sales

Credit limit- balance etc .......... .

The purpose of derived variables is to find useful information that is not apparent in

original data.

Role of Fields in Data Mining:

Input Field(s)/Columns: Used as inputs to the Model. Sometimes these are referred

to as independent variables

Target Field/Columns: This is the field about which we are trying to understand and

normally linked to some form of behavior. For Example:

buy a product

Customer buys/does not

Customer is profitable/not profitable etc.

Ignored Fields/Columns : Columns that are not used.

Normal Data Format for Data Mining:

- All data should be in a single table or database view ........ ,,.,.. I

- Each row should correspond to an instance that is relevant to business

Columns with single/unique values for every new ignored.

Data Comes From:

Operational systems

8

DBMS (Database Management Systems)

ERP (Enterprise Resource Planning System)

Web servers and e-commerce data bases

Billing system(s)

Telecom switches

Point of sale/ A TM' s

Data Warehouses

1.4 Data Mining Process

Stages

• Data pre-processing

Scarch.!or patterns-Qu.cr ics, rulcs, neural nets, M.L,. statistics etc.,

t

o heterogeneity resolution

o data cleansing

o data warehousing ·

• Data Mining Tools applied

Analyst rev icws

OUtput

.Revise and rc£"ane queries

I

o extraction of patterns from the pre-processed data ,,, :······

• Interpretation and evaluation

o user bias i.e. can direct DM tools to areas of interest

• attributes of interest in databases

• goal of discovery

Jn.terprct Rcsul ts and take action

9

• domain knowledge

• prior knowledge or belief about the domain

1.5 Data Mining and Machine Learning

Differences

• Data Mining (DM) or Knowledge Discovery in Databases (KDD) is about

finding understandable knowledge

• Machine Learning (ML) is concerned with improving performance of an agent

o training a neural network to balance a pole is part of ML, but not of

KDD

• Efficiency of the algorithm and scalability is more important in DM or KDD

o DM is concerned with very large, real-world databases

o ML typically looks at smaller data sets

• ML has laboratory type examples for the training set

• DM deals with 'real world' data

o Real world data tend to have problems such as:

• missing values

• dynamic data

• pre-existing data

• nmse

1.6 Issues in Data Mining

• Noisy data

• Missing values

• Static data

• Sparse data

• Dynamic data

10

1.7

I N F u T

F A T T E R ;o.,T

•

•

•

•

•

..

Relevance

Interestingness

Heterogeneity

Algorithm efficiency

Size and complexity of data

Knowledge Representation Methods

Neural Networks

o a trained neural network can be thought of as an "expert" in the category

of information it has been given to analyze

o provides projections given new situations of interest and answers "what

if' questions

o problems include:

• the resulting network is viewed as a black box

• no explanation of the results is given i.e. difficult for the user to

interpret the results

• difficult to incorporate user intervention

• slow to train due to their iterative nature

0 u

... Ca. cer- ,.;.,k T F u T

11

A neural net can be trained to identify the risk of cancer from a number of factors

• Decision trees

o used to represent knowledge

o built using a training set of data and can then be used to classify new

objects

o problems are:

• Rules

• opaque structure - difficult to understand

• missing data can cause performance problems

• they become cumbersome for large data sets

o Example ouUock

o probably the most common form of representation

o tend to be simple and intuitive

o unstructured and less rigid

o problems are:

• difficult to maintain

• inadequate to represent many types of knowledge

o Example format

• if X then Y

• Frames

o templates for holding clusters of related knowledge about a very

particular subject

12

o a natural way to represent knowledge

o has a taxonomy approach

o problem is

• more complex than rule representation

Related Technologies

• Data Warehousing

• On-line Analytical Processing, OLAP

1.8 Data Warehousing

Definition

• A data warehouse can be defined as any centralized data repository which can

be queried for business benefit

• warehousing makes it possible to

o extract archived operational data

o overcome inconsistencies between different legacy data formats

o integrate data throughout an enterprise, regardless of location, format, or

communication requirements

o incorporate additional or expert information

Characteristics of a Data warehouse:

• Subject-oriented - data organized by subject instead of application e.g.

o an insurance company would organize their data by customer, premium,

and claim, instead of by different products (auto, life, etc.)

· o contains only the information necessary for decision support processing

• Integrated - encoding of data is often inconsistent e.g.

13

o gender might be coded as "m" and "f" or 0 and 1 but when data are.

moved from the operational environment into the data warehouse they

assume a consistent coding convention

• Time-variant - the data warehouse contains a place for storing data that are five

to 10 years old or older e.g.

o this data is used for comparisons, trends, and forecasting

o these data are not updated

• non-volatile - data are not updated or changed in any way once they enter the

data warehouse

o data are only loaded and accessed

Data warehousing Processes:

• insulate data - i.e. the current operational information (

o preserves the security and integrity of mission-critiCal OL TP

applications

o gives access to the broadest possible base of data

• retrieve data - from a variety of heterogeneous operational databases

o data is transformed and delivered to the data warehouse/store based on a

selected model (or mapping definition)

o metadata - information describing the model and definition of the source

data elements

• data cleansing - removal of certain aspects of operational data, such as low

level transaction information, which slow down the query times.

• transfer - processed data transferred to the data warehouse, a large database on

a high performance box

Uses of a Data warehouse

• A central store against which the queries are run

14

o uses very simple data structures with very little assumptions about the

relationships between data

• A data mart is a sinall warehouse which provides subsets of the main store,

summarized information

o depending on the requirements of a specific group/department

o marts often use multidimensional databases which can speed up query

processing as they can have data structures which are reflect the most likely

questions

Data Warehouse Model

Optirniud OLTF dt:tti:Jasf!S _., W!7mause wtzler DB2, Ora:le etc.

...

Structure of Data inside the Data Warehouse

i

15

. ~ ·~ .

M E T A D A T A

An example of levels of summarization of data

M E T A D A T A

n~ticm!l !Illes by mcmth 1985-1993

reg ian~! s!les Wwetk 1'!183-1993 .

Criteria for a Data Warehouse

• Load Performance

man thl'!l s!les b'!l pradr.lct line 1981-1993

weeki~ s!les by sub product 1985-1993

s!les detQI 1992-1993

!Illes detQ I 1982-1991

Lis;htl~ sul'?1!rnorizf!d

Cu I'T'~nt det~ ~ ~~v~~

o require incremental loading of new data on a periodic basis

o must not artificially constrain the volume of data

• Load Processing

16

o data conversions, filtering, reformatting, integrity checks, physical

storage, indexing, and metadata update

• Data Quality Management

o ensure local consistency, global consistency, and referential integrity

despite "dirty" sources and massive database size

• Query Performance

o must not be slowed or inhibited by the performance of the data

warehouse RDBMS

• Terabyte Scalability

o Data warehouse sizes are growing at astonishing rates so RDBMS must

not have any architectural limitations. It must support modular and

parallel management.

• Mass User Scalability

o Access to warehouse data must not be limited to the elite few has to

support hundreds, even thousands, of concurrent users while

maintaining acceptable query performance.

• Networked Data Warehouse

•

o Data warehouses rarely exist in isolation, users must be able to look at

and work with multiple warehouses from a single client workstation

Warehouse' Administration

o large scale and time-cyclic nature of the data warehouse demands

administrative ease and flexibility

• The RDBMS must Integrate Dimensional Analysis

o dimensional support must be inherent in the warehouse RDBMS to

provide the highest performance for relational OLAP tools

• Advanced Query Functionality

o End users require advanced analytic calculations, sequential and

comparative analysis, and consistent access to detailed and summarized

data

17

Problems with Data warehousing

• The rush of companies to jump on the band wagon as

These companies have slapped 'data warehouse' labels on traditional transaction

processing products and co- opted the lexicon of the industry in order to be

considered players in this fast-growing category.

1.9 Data warehousing & OL TP

Similarities and Differences between OLTP and Data Warehouse

OLTJ"' l.)ata VVarch.ous c

Purpose Run day-to-day operations Information retrieval and analysis

Structure RDBMS RDBMS

Data Model Normalised Multi- dim.ensional

Access SQL S Q L plus data analysis extensions

Type of Data Data that runs the business Data that analyses the business

Condition of Data Changing, incomplete Historical, descriptive

• OL TP systems designed to maximize transaction capacity but they:

o · cannot be repositories of facts and historical data for business analysis

o cannot quickly answer ad hoc queries

o rapid retrieval is almost impossible

o data is inconsistent and changing, duplicate entries exist, entries can be

mtssmg

o OL TP offers large amounts of raw data which is not easily understood

• Typical OL TP query is a simple aggregation e.g.

o What is the current account balance for this customer?

18

Data Warehouse systems

• Data warehouses are interested in query processing as opposed to transaction

processmg

• Typical business analysis query e.g.

o Which product line sells best in middle-America and how does this

correlate to demographic data?

OLAP (On-line Analytical processing

• Problem is how to process larger and larger databases

• OLAP involves many data items (many thousands or even millions) which are

involved in complex relationships

• Fast response is crucial in OLAP

• Difference between OLAP and OL TP

OLAP

o OL TP servers handle mission-critical production data accessed through

simple queries

o OLAP servers handle management-critical data accessed through an

iterative analytical investigation·

Common Analytical Operations

• Consolidation - involves the aggregation of data i.e. simple roll-ups or complex

expressions involvipg inter-related data

o e.g. sales offices can be rolled-up to districts and districts rolled-up to

regwns

• Drill-Down - can go in the reverse direction i.e. automatically display detail

data which comprises consolidated data

19

• "Slicing and Dicing" -ability to look at the data base from different viewpoints

e.g.

o one slice of the sales database might show all sales of product type

within regions;

o another slice might show all sales by sales channel within each product

type

o often performed along a time axis in order to analyze trends and find

patterns

Knowledge acquisition using Data Mining

• Expert systems are models of real world processes

• Much of the information is available straight from the process e.g.

o in production systems, data is collected for monitoring the system

o ·knowledge can be extracted using data mining tools

o experts can verify the knowiedge

• Example

o TIGON project - detection and diagnosis of an industrial gas turbine

eng me

1.10 Data Mining Projects

UU - Jordan town

• Data mining in the N Ireland Housing Executive

• Knee disorders classification

• Fault diagnosis in a telecommunication network ·

• A self-learning urology patient audit system

• Policy lapse/renewal prediction

• House price prediction

20

UUJExample

Policy lapse/renewal prediction

• Problem - predicting whether a motor insurance policy will be lapsed or

renewed

• 34 attributes stored for each policy

o 14 attributes were deemed relevant

o 2 attributes were derived from underlying attributes

• Predictive accuracy

0 71%

• In a period of 3 weeks achieved the same accuracy as statistical models

developed by the insurance company which had taken much longer to develop

MKS

The Mining Kernel System

• Based on the interdisciplinary approach of data mining

NetMap

TH-169_77

Cluster Analysis

21

MKS

• Data pre-processing functionality i.e.

o statistical operations for removing outliers

o sampling

o data dimensionality reduction

o dealing with missing data

• Algorithms provided for

o classification

o association

• Facility to provide what 1s interesting to the user and. presenting only

interesting rules

• Facility to incorporate domain knowledge into the knowledge discovery

process

Classification

Classification is a well-known task in data mining that aims to predict the class of

an unseen instance as accurately as possible. While single label classification, which

assigns each rule in the classifier the most obvious label, has been widely studied [9,

11, 13, 18], little work has been done on multi-label classification. Most of the work

to date on multi-label classification is related to text categorisation [10, 15]. There are

many approaches for building single class classifiers from data, such as divide-and-!

conquer [14] and separate-and-conquer [8]. Most traditional learning techniques

derived from these approaches, such as decision trees [7, 13], and statistical and

covering algorithms [11], are unable to treat problems with multiple labels.

The most common multi-label classification approach is one-versus-the rest (OvR)

[ 17], which constructs a set of binary classifiers obtained by training on each possible

22

class versus all the rest. OvR approach performs the winner-take-all strategy that

assigns a real value for each class to indicate the class membership.

Another known approach in multi-label classification is one-versus-one (OvO)

[15], which constructs a classifier that has been trained on each possible pair of

classes. For K classes, this results in (K-1) K/2 binary classifiers, which may be

problematic if K is large. On the other hand, the OvR approach has been criticised for

training on several separate classification problems, since each class can easily be

separated from the rest, and therefore problems a rise, like contradictory decisions, i.e.

whenever two or more rules predict the test instance, and no decision, i.e. whenever

none of the resulting rules can predict the test instance [ 6].

Another important task in data mining is the discovery of all association rules in

data. Classification and association rule discovery are similar, except that there is only

one target to predict in classification, i.e., the class, while association rule can predict

any attribute in the data. In recent years, a new approach that integrates association

rule with classification, named associative classification, has been proposed [9, 12]. A

· few accurate classifiers that use associative· classification have been presented in the

past few years, such as CBA [12], CMAR [9], and CPAR [18]. In existing associative

classification techniques, only one class label is associated with each rule derived, and

thus rules are not suitable for the prediction of multiple labels. However, multi-label

classification may often be useful in practise. Consider for example, a document

which has two class labels "Health" and "Government", and assume that the

document is associated 50 times with the "Health" label and 48 times with the

"Government" label, and the number of times the document appears in the training

data is 98. A traditional associative technique like CBA generates the rule associated

with the "Health" label simply because it has a larger representation, and discards the

other rule. However, it is very useful to generate the other rule; since it brings up

useful knowledge having a large representation in the training data, and thus could

take a role in classification. In this paper, a novel approach for multi-class and multi

label classification, named multi class, multi-label associative classification (MMAC).

23

chapter-1 introduction to data miningshodhganga.inflibnet.ac.in/bitstream/10603/22134/6/06_chapter...

Documents