chapter 1 introductionshodhganga.inflibnet.ac.in/bitstream/10603/24357/6/06_chapter 1.pdf · grown...

22
1 CHAPTER 1 INTRODUCTION 1.1 DATAMINING A promising area of contemporary research and development in Computer Science is data mining (Usama et al 1996; Han and Kamber, 2006; Hand et al 2001; Tognazzini, 2003). Data mining in recent times has been attracting more and more attention from an extensive range of diverse groups of people. Progress in digital acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human endeavour, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases and medical records). The interest has grown in the possibility of tapping these data and extracting from them information that might be of value to the owner of the database. Information retrieval is simply not enough anymore for decision-making. Confronted with huge collections of data, new records are created that needs to help us make better managerial decisions. These needs are automatic summarization of data, extraction of the “essence” of information stored and the discovery of patterns in raw data. The discipline concerned with this task is known as data mining. Data mining is often defined as finding hidden information in a database. Alternatively it has been called exploratory data analysis, data driven discovery and deductive

Upload: dinhkien

Post on 29-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

1

CHAPTER 1

INTRODUCTION

1.1 DATAMINING

A promising area of contemporary research and development in Computer

Science is data mining (Usama et al 1996; Han and Kamber, 2006; Hand et al

2001; Tognazzini, 2003). Data mining in recent times has been attracting more

and more attention from an extensive range of diverse groups of people. Progress

in digital acquisition and storage technology has resulted in the growth of huge

databases. This has occurred in all areas of human endeavour, from the mundane

(such as supermarket transaction data, credit card usage records, telephone call

details and government statistics) to the more exotic (such as images of

astronomical bodies, molecular databases and medical records). The interest has

grown in the possibility of tapping these data and extracting from them

information that might be of value to the owner of the database. Information

retrieval is simply not enough anymore for decision-making. Confronted with

huge collections of data, new records are created that needs to help us make better

managerial decisions. These needs are automatic summarization of data,

extraction of the “essence” of information stored and the discovery of patterns in

raw data. The discipline concerned with this task is known as data mining. Data

mining is often defined as finding hidden information in a database. Alternatively

it has been called exploratory data analysis, data driven discovery and deductive

2

learning. Data mining techniques have been successfully applied in many

different fields including marketing, manufacturing, process control, fraud

detection and network management (Klaus, 2002).

Data mining techniques are heavily used to search for relationships and

information ‘hidden’ in transaction data. The impact of data mining to

information science has been discussed by Chen and Liu (2004) by reviewing

personalized environments, electronic commerce and search engines. A web

crawler design for data mining was proposed by Thelwali (2001) in order to

extract information with less processing time. Furthermore, mining method is

presented by Chang and lee (2005) for retrieving sequential patterns over online

data streams which is useful for retrieving embedded knowledge. Owing to the

current explosion of information and the accessibility of cheap storage, collecting

enormous data has been achievable during the last decades. The ultimate invent

of this massive data collection is the utilization of this information to achieve

competitive benefits, by determining formerly unidentified patterns in data with

the aid of Online Analytical Processing (OLAP) tools. It is a highly tedious

process which illustrates the necessity of an automated process to ascertain

interesting and concealed patterns in data. Data mining techniques have

increasingly been studied especially in their application in real world databases

(Bigus, 1996; Mitchell, 1997; Sousa et al 1998).

Data mining is a major step in the Knowledge Discovery in Databases

(KDD) process. It consists of applying computational techniques that, under

acceptable computational efficiency limitations produce a particular enumeration

of patterns (or models) over the data (Mitchell, 1997). Data mining involves the

use of sophisticated data analysis tools to discover previously unknown, valid

patterns and relationships in large data sets. These tools can include statistical

3

models, mathematical algorithms and machine learning methods (Jeffrey, 2004).

The progress in data mining research has made it possible to implement several

data mining operations efficiently on large databases (Fayyad, 1997).

1.1.1 Kinds of Information

A myriad of data is collected from simple numerical measurements and

text documents, to more complex information such as spatial data, multimedia

channels and hypertext documents. Here is a non-exclusive list of a variety of

information collected in digital form in databases and in flat files.

· Business transactions

· Scientific data

· Medical and personal data

· Surveillance video and pictures

· Satellite sensing Games

· Digital media

· CAD and Software engineering data

· Virtual Worlds

· Text reports and memos(e-mail messages)

· The World Wide Web repositories

1.1.2 Kinds of Data to be Mined

In principle, data mining is not specific to one type of media or data. Data

mining should be applicable to any of information repository. However,

algorithms and approaches may differ when applied to different types of data

(Han and Kamber, 2006). Indeed, the challenges presented by different types of

data vary significantly. Data mining is being put into use and studied for

databases, including relational databases, object-relational databases and

4

object-oriented databases, data warehouses, transactional databases, unstructured

and semi structured repositories such as World Wide Web and advanced

databases such as spatial, multimedia , time-series and textual databases and even

flat files. Here are some examples in more detail:

· Flat files: Flat files are actually the most common data source for data

mining algorithms, especially at the research level. Flat files are simple data files

in text or binary format with a structure known by the data mining algorithm to be

applied. The data in these files can be transactions, time-series data, scientific

measurements, etc.

· Relational Databases: Briefly, a relational database consists of a set of

tables containing either values of entity attributes or values of attributes from

entity relationships. Tables have columns and rows, where columns represent

attributes and rows represent tuples. A tuple in a relational table corresponds to

either an object or a relationship between objects and is identified by a set of

attribute values representing a unique key.

The most commonly used query language for relational database is SQL.

SQL allows retrieval and manipulation of the data stored in the tables, as well as

the calculation of aggregate functions such as average, sum, min, max and count.

For instance, an SQL query to select the videos grouped by category would

be: SELECT count (*) FROM Items WHERE type=video GROUP BY category.

Data mining algorithms using relational databases can be more versatile than data

mining algorithms specifically written for flat files, since they can take advantage

of the structure inherent to relational databases. While data mining can benefit

from SQL for data selection, transformation and consolidation, it goes beyond

what SQL could provide, such as predicting, comparing, detecting deviations, etc.

5

· Data Warehouses: A data warehouse as a storehouse is a repository of

data collected from multiple data sources (often heterogeneous) and is intended to

be used as a whole under the same unified schema. A data warehouse gives the

option to analyze data from different sources under the same roof. If suppose that

a Video Store becomes a franchise in North America , many video stores

belonging to a Video Store Company may have different databases and different

structures. If the executive of the company wants to access the data from all

stores for strategic decision-making, future direction, marketing, etc., it would be

more appropriate to store all the data in one site with a homogenous structure that

allows interactive analysis. In other words, data from the different stores would

be loaded, cleaned, transformed and integrated together. To facilitate

decision-making and multi-dimensional views, data warehouses are usually

modelled by a multi-dimensional data structure.

· Transaction Databases: A transaction database is a set of records

representing transactions, each with a time stamp, an identifier and a set of items.

For example, in the case of the video store, the rentals table represents the

transaction database. Each record is a rental contract with a customer identifier, a

date and the list of items rented (i.e. video tapes, games, VCR, etc.,). Since

relational databases do not allow nested tables (i.e. asset as attribute value),

transactions are usually stored in flat files or stored in two normalized transaction

tables, one for the transactions and another for the transaction items. One typical

data mining analysis on such data is the so-called market basket analysis or

association rules in which associations between items occurring together or in

sequence are studied.

· Multimedia Databases: Multimedia databases include video, images,

audio and text media. They can be stored on extended object-relational or

6

object-oriented databases or simply on a file system. Multimedia is characterized

by its high dimensionality, which makes data mining even more challenging.

Data mining from multimedia repositories may require computer vision,

computer graphics, image interpretation and natural language processing

methodologies.

· Time-Series Databases: Time-series databases contain time related

data such as stock market data or logged activities. These databases usually have

a continuous flow of new data coming in, which sometimes causes the need for a

challenging real time analysis. Data mining in such databases commonly

includes the study of trends and correlations between evolutions of different

variables, as well as the prediction of trends and movements of the variables in

time.

· Spatial Databases: Spatial databases are databases that, in addition to

usual data, store geographical information like maps and global or regional

positioning. Such spatial databases present new challenges to data mining

algorithms.

· World Wide Web: The World Wide Web (WWW) is the most

heterogeneous and dynamic repository available. A very large number of authors

and publishers are continuously contributing to its growth and metamorphosis,

and a massive number of users are accessing its resources daily. Data in the

World Wide Web is organised in inter-connected documents. These documents

can be text, audio, video, raw data and even applications. Conceptually, the

World Wide Web is comprised of three major components: The content of the

Web, which encompasses documents available; the structure of the Web, which

covers the hyperlinks and the relationships between documents; and the usage of

7

the web, describing how and when the resources are accessed. A fourth

dimension can be relating to the dynamic nature or evolution of the documents.

Data mining in the World Wide Web or web mining tries to address all these

issues and is often divided into web content mining, web structure mining and

web usage mining.

1.1.3 Data Mining and Knowledge Discovery in Databases

The terms Knowledge Discovery in Databases and data mining are often

used interchangeably. Knowledge Discovery in Databases (KDD) is the process

of finding useful information and patterns in data. Data Mining is the use of

algorithms to extract the information and patterns derived by the KDD process.

Data Mining is formally defined as “the non-trivial extraction of implicit,

formerly unidentified, and potentially useful information from data” (Frawley et

al, 1992; Jeffrey, 2004; Fayyad 1997). With the enormous amount of data stored

in files, database and other repositories is increasing enormously, it is necessary

to develop powerful means of analysis. Perhaps interpretation of such data and

the extraction of interesting knowledge could help in decision making.

Fig 1.1 An overview of steps comprising of KDD [Bradley et al, 1998]

8

While data mining and knowledge discovery in the databases are

frequently treated as synonyms, data mining is the actual part of knowledge

discovery process. Fig 1.1 shows data mining as a step in an iterative knowledge

discovery process.

The KDD process consists of the following five steps:

1.1.3.1 Selection: The data needed for the data mining process may be obtained

from many different and heterogeneous data sources. This first step obtains

the data from various databases, files and non-electronic resources.

1.1.3.2 Pre-processing: The data to be used by the process may have incorrect or

missing data. There may be anomalous data from multiple sources

involving different data types and metrics. There may be many different

activities performed at this time. Erroneous data must be corrected or

removed, whereas missing data must be supplied or predicted (often using

data mining tools).

1.1.3.3 Transformation: Data from different sources must be converted into a

common format for processing. Some data may be used to reduce number

of possible data values being considered.

1.1.3.4 Data mining: Based on the data mining task being performed, this step

applies algorithms to the transformed data to generate the desired results.

1.1.3.5 Interpretation/evaluation: How the data mining results are presented to

the users is extremely important because the usefulness of the results is

dependent on it. Various visualisation and GUI strategies are used at this

last step.

9

It is common to combine some of these steps together. For instance, data

cleaning and data integration can be performed together as a pre-processing phase

to generate a data warehouse. Data selection and transformation can also be

combined where the consolidation of the data is the result of selection or as for

the case of data warehouses, the selection is done on transformed data. The KDD

is an iterative process. Once the discovered knowledge is presented to the user,

the evaluation measures can be enhanced and the mining can be further refined or

new data can be selected or further transformed or new data sources can be

integrated, in order to get different and more appropriate results.

1.1.4 Basic Data Mining Tasks

Data mining commonly encompasses a variety of algorithms. In general

data mining tasks can be classified into two categories: Descriptive mining and

Predictive mining. Descriptive mining is the process of drawing the essential

characteristics or general properties of the data in the database. Clustering,

Association and sequential mining are some of descriptive mining techniques.

Predictive mining is the process of inferring patterns from data to make

predictions. The Predictive mining techniques involve tasks like Classification,

Regression and Deviation detection. Following are the tasks of the data mining

(Margaret, 2008). These individual tasks may be combined to obtain more

sophisticated data mining applications.

· Classification

Classification maps data into predefined groups or classes. It is often

referred to as supervised learning because the classes are determined before

examining the data. Classification algorithms require that the classes be defined

based on data attribute values. They often describe these classes by looking at the

10

characteristics of data already known to belong to the classes. Pattern recognition

is a type of classification where an input pattern is classified into one of several

classes based on its similarity to these predefined classes.

· Regression

Regression is used to map a data item to a real valued prediction variable.

In actuality, regression involves the learning of the function that does this

mapping. Regression assumes that the target data fit into some known type of

function (e.g., linear, logistics, etc.) and then determines the best function of this

type that models the given data. Some type of error analysis is used to determine

which function is “best”.

· Time Series Analysis

With time series analysis, the value of an attribute is examined as it varies

over time. The values usually are obtained as evenly spaced time points (daily,

weekly, hourly, etc.). A time series plot is used to visualise the time series. There

are three basic functions performed in time series analysis. In one case, distance

measures are used to determine the similarity between different time series. In the

second case, the structure of the line is examined to determine its behaviour. A

third application would be to use the historical time series plot to predict future

values.

· Prediction

Many real-world data mining applications can be seen as predicting future

data states based on past and current data. Prediction can be viewed as a type of

classification. The difference is that prediction is predicting a future state rather

11

than a current state. Prediction applications include flooding, speech recognition,

machine learning and pattern recognition.

· Clustering

Clustering is similar to classification except that the groups are not

predefined, but rather defined by data alone. Clustering is alternatively referred to

as unsupervised learning or segmentation. It can be thought of as partitioning or

segmenting the data into groups that might or might not be disjointed. The

clustering is usually accomplished by determining the similarity among the data

on predefined attributes. The most similar data are grouped into clusters. Since

the clusters are not predefined, a domain expert is often required to interpret the

meaning of the created clusters.

A special type of clustering is called Segmentation. With segmentation a

database is partitioned into disjointed groupings of similar tuples called segments.

Segmentation is often viewed as being identical to clustering. In other circles

segmentation is viewed as a specific type of clustering applied to a data base

itself.

· Summarization

Summarization maps data into subsets with associated simple descriptions.

Summarization is also called characterisation or generalisation. It extracts or

derives representative information about the database. This may be accomplished

by actually retrieving portions of the data. Alternatively, summary type

information (such as mean of some numeric attribute) can be derived from the

data. The summarization succinctly characterizes the contents of the databases.

12

· Association Rules

Link analysis, alternatively referred to as affinity analysis or association

refers to the data mining task of uncovering relationships among data. The best

example of this type of application is to determine association rules. An

association rule is a model that identifies specific types of data associations.

These associations are often used in retail sales community to identify items that

are frequently purchased together. Associations are also used in many other

applications such as predicting the failure of telecommunication switches.

· Sequence Discovery

Sequential analysis or sequence discovery is used to determine sequential

patterns in data. These patterns are similar to associations in that data (or events)

are found to be related, but the relationship is based on time. Unlike a market

analysis, which requires the items to be purchased at the same time, in sequence

discovery the items are purchased over time in some order.

1.2 MOTIVATION FOR THE RESEARCH

As the volume of data in warehouses and on the Internet is growing faster,

the scalability of mining algorithms is a major concern. Classical association rule

mining algorithms that require more number of passes over the entire database,

can take hours or even days to execute, and in the future this problem will only

become worse. Sampling approach can be used to solve this scalability problem

(Haas, 1999). In the context of “standard” association rule mining, use of samples

can make mining studies feasible that were formerly impractical due to enormous

time requirements. Indeed, a number of large companies routinely run mining

algorithms on a sample rather than on the entire warehouse (Bin et al, 2002).

Especially, if the data comes as a stream flowing at a faster rate than can be

13

processed, sampling seems to be the only choice (Yanrong and Raj, 2005). Two

facts support the applicability of sampling in data mining. One is that some data

mining algorithms may not use all data in a data set all the time during their

execution (Fayyad and Smith, 1995). The other is that mining on part of the data

may produce comparable results on the whole data (Toivonen, 1996; Zaki et al

1996).

Sampling is the powerful data reduction technique that has been applied to a

variety of problems in database systems. In general, sampling can significantly

reduce the cost of mining, since the mining algorithms need to deal with only a

small dataset compared to original database (Yanrong and Raj, 2005).

Sampling-based algorithms also facilitate interactive mining i.e., when the goal is

to obtain one or more “interesting” association rules as quickly as possible, a user

might first mine a very small sample (Bin et al 2002). Sampling has often been

suggested as an effective tool to reduce the size of the dataset operated at some

cost to accuracy (Parthasarathy, 2002). For example, statistics collected from a

sample of the database is used to generate near-optimal query execution plans

(Venkatesan et al 2009). A fairly obvious way of reducing the database activity of

knowledge discovery is to use only a random sample of a relation and to find

approximate regularities (Toivonen, 1996). Sampling is one of the principal

means to optimize the computational cost incurred in association rule mining by

reducing the timing incurred for frequent itemset mining.

The importance of sampling for association rule mining has been

recognized by several researchers (Bin et al 2002; Toivonen, 1996; Zaki et al

1996; Surong et al, 2005). The usual approach is to take a portion of database

randomly of a previously determined size and then calculate the frequency of

item sets over the sample using a lower minimum support threshold ‘σ’ that is

14

slightly smaller than the user - specified minimum support σ. Also the

computational cost of association rule mining can be reduced in four ways:

· By reducing the number of passes over the data base (Gu et al, 2000;

Agarwal et al, 2000; Yuang and Huang, 2005).

· By sampling the database. (Bin et al, 2002; Toivonen, 1996; Zaki et al

1996; Surong et al 2005)

· By adding extra constraints on the structure of patterns (Wojciechowski

and Zakrzewiez 2002).

· Through parallelization. (Li, 2004; Manning and Keane, 2001).

Recent work in the area of approximate aggregation processing shows that

the benefits of sampling are most fully realized when the sampling technique is

tailored to the specific problem at hand as stated by Acharya et al (2000).

The motivation behind the research is that the association rules mined from

large databases with the aid of an efficient sampling technique will generate

effective and relevant association rules. It has been found that the factors like

sample size, sampling method, transaction length, transaction frequency and

more, affect the quality of the samples selected for association rule mining.

Most of the association rule mining algorithms presented in the literature

are used for mining the static databases. Now-a-days, research community has

focused their researches into the incremental database on behalf of the real world

applications and the necessity of handling the dynamically updating new records.

Motivated by this research interest, Researchers have been motivated to design

innovative and incremental algorithm for association rule mining because the

quantity of data available in the real life databases are increasing at a tremendous

rate.

15

The mining of association rules on transactional database is usually an

offline process since it is costly to find the association rules in large databases.

With usual market-basket applications, new transactions are generated and old

transactions may be obsolete as time advances. As a result, incremental updating

techniques should be developed for maintenance of the discovered association

rules to avoid redoing mining on the whole updated database. A database may

allow frequent or occasional updates and such updates may not only invalidate

existing association rules but also activate new rules. Thus it is nontrivial to

maintain such discovered rules in large databases. Since the underlying

transaction database has been changed as time advances, some algorithms, such

as Apriori, may have to resort to the regeneration of candidate itemsets for the

determination of new frequent item sets. It is however very costly even if the

incremental data subset is small.

1.3 PROBLEM SPECIFICATION

The volume of electronically accessible data in warehouses and on the

internet is growing faster than the speedup in processing times predicted by

Moore’s law (Winter and Auer,1998).Scalability of mining algorithms is

therefore a major concern. Classical mining algorithms that require one or more

passes over the entire database can take hours or even days to execute and in the

future this problem will still worsen. One approach to the scalability problem is

to exploit the fact that approximate answers often suffice and execute mining

algorithms over a ‘synopsis’ or ‘sketch’. The computation of many synopses are

proposed in the literature (Charu et al, 1999) that require one or more expensive

passes over all the data. The use of synopses may still fail to adequately address

the scalability problem unless the cost of producing the synopses is amortized

over many queries.

16

Association rule mining is one of the important issues in data mining and

there is a high demand in industry as extraction of association rules is directly

related to sales. It is used to extract hidden information from large data sets.

Market Basket Analysis (MBA) is one of the common applications, which uses

ARM to extract customers’ purchase behaviours to improve sales was proposed

by Chen et al (2005) and Brin et al (1997). Association rule is also applied in

telecommunication networks. Association rules also finds application in various

areas like telecommunication networks, market and risk management and

inventory control.

The task of mining association rules is usually performed in transactional

or relational databases, to derive a set of strong association rules. This may

require repeated scans through the database. Therefore, it can result in huge

amount of processing when working on a very large database. Many efficient

algorithms can significantly improve the performance in both efficiency and

accuracy of association rules (Gu et al 2000). However, sampling can be a direct

and easy approach to improve the efficiency when accuracy is not the soul

concern.

In general the mining process of association rules can be divided into two steps.

1. Frequent Itemset Generation: generate all sets of items that have support

greater than a certain threshold, called minsupport.

2. Association Rule Generation: generate all association rules that have

confidence greater than a certain threshold called minconfidence from the

frequent itemsets.

It has been proved by Venkatesan et al (2009) that the first step of ARM

dictates the computational and I/O requirements necessitating repeated passes

17

over the complete database. The literature presents with numerous algorithms for

ARM, among which Apriori developed by Agarwal et al (1994) has been the

most renowned chiefly owing to its efficiency in Knowledge discovery. An

apparent approach for achieving a considerable reduction in the computational

and I/O requirements of frequent itemset generation is sampling as stated by

Venkatesan et al (2009). However the Apriori algorithm suffers from two

bottlenecks namely: 1) Complex frequent itemset generation that consumes most

of the time, space and memory and 2) Requires multiple scans of the database

(Sotiris and Dimitris, 2006).

The execution of traditional association rule mining algorithms that

necessitate multiple number of passes over the complete database can consume

hours or even days. Recently to reduce this adverse effect, researchers have

intended to develop efficient approaches that reduce the I/O and computational

requirements of the ARM techniques. Sampling has been recognized as a

dominant data reduction technique that has been successfully applied to an

assortment of data mining algorithms for reducing computational overhead.

In most sampling based association rule mining algorithms, the samples

are selected randomly from the large database without considering any of the

characteristics of the database. Moreover, it has been highly difficult to determine

an optimal sample size for effectively mining association rules.

Many algorithms have been developed for mining static datasets. It is

nontrivial to maintain such discovered rules from large datasets, this was the main

idea behind Incremental Association Rules Mining (IARM), which recently has

received much attention from the Data Mining researcher.Applying data mining

techniques to real-world applications is a challenging task because the databases

are dynamic i.e., changes continuously due to addition, deletion, modification

18

etc., of the contained data (Zhang et al 2003). Generally if the dataset is

incremental in nature, the frequent item sets discovering problem consumes more

time. Once in a while, the new records are added in an incremental dataset.

Generally when compared to the entire data set, the size of the increments or the

number of records added to the dataset is very small. But the assumption of the

rules in the updated dataset may get distorted due to the addition of these new

records (Nath et al, 2010). Hence a few new association rules may be created and

a few old ones may become obsolete. When new transactions are inserted into the

original databases, traditional batch-mining algorithms resolve this problem by

reprocessing the entire new databases. But they require much computational time

and ignore the available mined knowledge as stated by Hong et al (2008).

In the real world where large amounts of data grow steadily, some old

association rules can become useless and new databases may give rise to some

implicitly valid patterns or rules. Hence, updating rules or patterns are important.

The FUP algorithm proposed by Cheung et al (1996) is well known in relation to

this issue. A simple method for solving the updating problem is to reapply the

mining to the entire database, but this approach is time consuming. The FUP

algorithm uses information from old frequent itemsets to improve its

performance. The FUP algorithm has been extended to handle insertions as well

as deletions from the original database. Several other approaches to incremental

mining have been proposed (Ayan et al, 1999; Cheung et al, 1997; Hong et al,

2001; Lee et al, 2001; Ng and Lam, 1999; Ng et al, 2001; Sarda and Srinivas,

1998; Thomas et al, 1997; Velosa et al, 2000; Sarasere, 1995).

19

1.4 OBJECTIVE OF THE RESEARCH

The discovery of association rules is a computationally expensive task.

Further the transaction databases are typically very large. It is therefore

imperative to develop fast scalable techniques for mining them.

Following are the objectives of this thesis.

· To design a novel sampling based association rule mining algorithm that

- Effectively mine association rules by taking into account the

following factors namely sample size and the sampling method.

- Ultimately enhances the speed, efficiency and economy of mining

association rules.

- Achieves good tradeoff between accuracy and time

· To design incremental mining algorithm for growing databases that

- Extends the Pre-FUFP maintenance algorithm proposed by Hong

and Wang (2001).

- Addresses the recent items handling problem during the incremental

mining process.

- Solely designed to handle the recent items with more weightage

because of the fact that these items obtain better reachability among

customers.

- Adaptively design the support value based on the number of

transactions present in the incremental database rather than the

whole updated database.

1.5 SCOPE OF THE RESEARCH

In this thesis, a new sampling and incremental algorithm for effectively

mining association rules are proposed. Progressive sampling is employed in the

20

proposed approach to decide on an optimal sample size for achieve desirable

number and quality association rules.

The proposed work can operate on all transaction databases. In addition to

that it can be applied to data streams as well as continuous data such as colour

structure descriptor of images. It can also be applied to the medical data base to

mine for possible diseases and symptoms by mapping medical data to a

transaction format suitable for mining association rules and identifying

constraints which may prove to be priceless in future medical decision making.

Another area is direct marketing (retail, banking, insurance and fundraising

industries). Direct marketing is a modern business activity with an aim to

maximize the profit generated from marketing to a selected group of customers.

A key to direct marketing is to select a subset of customers so as to

maximize the profit return while minimizing the cost.

The proposed incremental approach efficiently handles the items that are

included recently in the updated database based on adaptive support threshold and

most suitable for developing better business strategies and can be used for sales

forecasting.

1.6 RESEARCH CONTRIBUTIONS

Scalability is one of the issue arise in Association Rule mining. Other issue

arise in incremental mining includes the efficiency of algorithms for the task, the

conciseness of result that are output by these algorithms and the re-mining of a

database after it has been updated with fresh data. Each of these issues are

addressed in this thesis.

21

The contributions of the thesis are:

(i) A novel progressive sampling based algorithm for the problem of

efficiently sampling for association rules is presented.

(ii) An extensive analysis of the progressive sampling-based approach for

association rule mining is done by using various real life datasets.

(iii) This approach is designed by taking into account the sample size and

sampling method with the intent of enhancing speed, efficiency and

economy of mining association rules

(iv) An efficient incremental mining approach entitled “Enhanced

Pre-FUFP “is presented.

(v) The proposed approach incorporates the recent items concept to

further extend the Pre-FUFP maintenance algorithm to address the

recent items handling problem during the incremental mining process

by incorporating an adaptive threshold value is incorporated to avoid

the costly process of rescanning.

1.7 CHAPTER ORGANIZATION

The remainder of this dissertation is organized in the following fashion:

Chapter 2 presents the necessary background for the thesis. These include

the ways for improving efficiency of Association Rule Mining and reviews

published researches related to sampling based Association Rule mining

algorithms as well as incremental mining algorithms that are closely related to the

research.

Chapter 3 presents the basic concepts and existing methodologies in

sampling techniques and incremental mining algorithms.

22

Chapter 4 presents the overall methodology of proposed progressive

sampling approaches for association rule mining from very large transaction

databases.

Chapter 5 presents the overall methodology of proposed algorithm for

incremental mining of Association Rules.

Chapter 6 presents the datasets that are been studied for carrying out the

sampling based as well as incremental approach for ARM. The experimental

setup, performance study conducted on the proposed algorithms and experimental

results are also discussed in this chapter.

Finally, Chapter 7 summarizes and concludes the main contributions of the

proposed work and outlines future avenues to explore.