datamining2.0

8/8/2019 DATAMINING2.0

http://slidepdf.com/reader/full/datamining20 1/17

DATA MININGProf Jyotiranjan Hota



DATA MINING

Datamining is the search for the relationships and global patterns

that exist in large databases but are hidden among vast amounts of

data , such as the relationship between patient data and their

medical diagnosis . This relationship represents valuable knowledge

about the database and the objects in the database , if the database

is a faithful mirror of the real world registered by the database.

Datamining is the non trivial extraction of implicit , previously

unknown and potentially useful information from the data . This

encompasses a number of technical approaches , such as clustering

, data summarization , classification, finding dependency networks ,analyzing changes and detecting anomalies .



KDD VIS- À - VIS DATA MINING

KDD seeks knowledge from data .KDD was formalized in 1989 . It

is the process of identifying a valid, potentially useful and ultimatelyunderstandable structure in data .

¡ Steps :

1.Data Selection

2. Data Cleaning and pre processing

3. Data transformation and reduction

4. Data Mining algorithm selection

5. post processing and interpretation of the discovered knowledge

6. Data Visualization

KDD process is highly iterative and interactive .

Data mining is only one of the many steps involved in knowledge

discovery in databases .



DBMS VIS- À - VIS DMIf we know exactly what information we are seeking ,a DBMS query would suffice .If

We vaguely know the possible correlations or pattern then data mining techniques are

Useful . One of the tasks of data mining is hypothesis testing wherein we formulate

a hypothesis and test it by sifting through the database . Thus DBMS supports some

primitive data mining task .

There are 3 different ways in which data mining systems use a relational DBMS .

1. DM may not use DBMS at all : DM uses it¶s own memory and storage management

DBMS is treated as a data repository from which data is expected to be downloaded

into DM¶s own memory structures before DM algorithm starts . Advantage is that one

can optimize the memory management specific to the data mining algorithm . These

systems ignore the field proven technologies of DBMS like recovery andconcurrency .

2. Loosely coupled : DBMS used only for storage and retrieval of data . One can use

loosely coupled SQL to fetch data records as required by data mining algorithm .

Front end of the application is implemented in a host programming language with

embedded SQL statements in it .



3. Tightly Coupled Appr oach : Portion of application programs are selectively pushed

into the database system to perform the necessary computation . Data is stored in the

database and all processing is done at the database end . This avoids performance

degradation and takes full advantage of database technology . Performance of the

approach depends non the way to optimize the DM process while mapping it to a query

. There are two suggested approaches . One is built-in query optimizer of the DBMS andsecond is We can have an external Optimizer .

Related Areas of Data Mining :

STATISTICS : Statistics is a theory-rich approach for data analysis . Statistical analysis

systems are used by analysts to detect unusual patterns and explain patterns using

statistical models such as linear models .

MACHINE LEARNING : ML is the automation of a learning process and learning is

tantamount to the construction of rules based on observations. A learning algorithm

takes the data set and It¶s accompanying information as input and returns a concept

representing the learning as output.

Supervised Lear ning : SL means learning from examples , where a training set is

given which acts as examples from the classes . The system finds a description of each

class . Once the description and hence the classification rule has been formulated , it is

used to predict the class of previously unseen objects .This is similar to discriminate

analysis which occurs in Statistics .



2. Unsupervised Lear ning : It is a learning from observation and discovery . Here there is

no training set or prior knowledge of the classes . The system analyzes the given set of

data to observe similarities emerging out of the subsets of the data .This is similar to

cluster analysis in statistics .

MATHEMATICAL PROGRAMMING : Most of the Data Mining tasks can be equivalently

formulated as problems in mathematical programming for which efficient algorithms are

available . One of the major active research topics in this field is Support Vector

Machines approach for classification .



DM TECHNIQUES :

Fundamental goals of Data Mining are :

Prediction : It makes use of existing variables in the database in order to predict unknown or

future values of Interest .

Descr iption : It focuses on finding patterns describing the data and the subsequentpresentation for user interpretation .

Another approach of the study of Data Mining techniques is to classify as

1.User Guided or verification-driven DM

2.Discovery-driven or automatic discovery of rules

Most of the techniques of DM have elements of both the models .



VERIFICATION MODEL : Here the user makes a hypothesis and tests the hypothesis

on the data to verify it¶s validity .

EXAMPLE : In a super market , with a limited budget for a mailing campaign to launch

a new product , it is important to identify the section of the population most likely to buy

the new product . User formulates a hypothesis to identify potential customers and their

common characteristics . Historical data about transactions and demographics

information can then be queried to reveal comparable purchases and the characteristics

shared by those purchasers . The whole operation can be repeated by successive

refinements of hypotheses until the required limit is reached . The user may come up

with a new hypothesis or may refine the existing one and verify it against the database .



Discovery Model

It is the system automatically which automatically discover

important information hidden in the data . The data is sifted insearch of frequently occurring patterns,trends and generalizationsabout the data without intervention or guidance from the user .

Example of such a model is a supermarket database which is minedto discover particular groups of customers to target for a mailingcampaign .The data is searched with no hypothesis in mind otherthan for the system to group the customers according to the commoncharacteristics found . Typical discovery driven tasks are

Discovery of association rules

Discovery of classification rules

Clustering

Discovery of frequent episodes

Deviation Detection



DISCOVERY OF ASSOCIATION RULES

A ssociation rule is an expression of the form X => Y where X and

Y are the sets of Items .Given a database , the goal is to discover allthe rules that have the support and confidence greater than or equalto the minimum support and confidence respectively .

Let L={l1,l2,l3 «««.,lm} be a set of items . Let D , the Database , bea set of transactions where each transaction T is a set of items . Tsupports an item x , if x is in T . T is said to support a subset of items

X if T supports each item x in X . X =>Y holds with confidence c , if c%of the transactions in D that that support X also supports Y . The ruleX=>Y has support s in the transaction set D if s% of the transactionsin D support X U Y . Support means how often X and Y occur togetheras a percentage of the total transactions . Confidence measures howmuch a particular item is dependent on another . So patterns withhigh support and confidence that occurs in a database is of muchinterest to end user .Patterns with very low confidence and support is

of no or little significance .



CLUSTERING

Clustering is a method of grouping data into different groups so that the datain each group share similar trends and patterns. Clustering constitutes a major

class of data mining algorithm .

Example : A retailer may want to know where similarities exist in his

customer base , so that he can create and understand different groups . He can

use the existing database of the different customers or more specifically

different transactions collected over a period of time . Clustering methods willhelp him in identifying different categories of customers . During discovery

process , the difference between data sets can be discovered in order to

separate them into different groups and similarities between data sets can be

used to group similar data together .



DISCOVERY OF CLASSIFICATION RULES

Classification of large data sets is an important problem in data mining . Forexample database with a number of records and for a set of classes such thateach record belongs to one of the given classes , the problem of classification isto decide the class to which a given record belongs. Classification problem isalso concerned with generating a description or a model for each class fromthe given data set . There are

Several classification discovery models like

decision trees

Neural Networks

Genetic A lgorithms

Statistical models like linear/geometric discriminates .

Applications include

1.Credit Card A nalysis

2.Banking

3.Medical A pplication etc



Example : Domestic flights in our country were at one time only operated byIndian Airlines Recently many private airlines have their operations for

domestics travel .Some of the customers of Indian Airlines started with theseprivate airlines as a result of which Indian Airlines lost these customers . Indian Airlines wants to understand why some customers are loyal while others leave .Ultimately ,the airline wants to predict which customers it is most likely to lose toit¶s customers . Their aim to build a model based on the historical data of loyalcustomers versus customers who left .This becomes a classification problem . Itis a supervised learning task as the historical data becomes the training setwhich is used to train the model . The decision tree is the most popular

classification technique .

NEURAL NETWORK

NNs are a new paradigms in computing which involves developingmathematical Structures with the ability to learn .The methods are the result of academic attempts To model the nervous system learning . Neural networks

have the remarkable ability to derive meaning from complicated data and canbe used to extract patterns and detect trends that are too complex to be noticedby either human or other computer techniques.



Genetic Algorithms :

Genetic algorithms are a relatively new computing paradigm, inspiredby Darwin·s theory of evolution. A population of individuals, eachrepresenting a possible solution to a problem, is initially created atrandom. Then pairs of individuals combine to produce offspring for

the next generation.Mutation process is also used to randomlymodify the genetic structure of some members of each newgeneration. The algorithm runs to generate solutions for successivegenerations. The probability of an individual reproducing isproportional to the goodness of the solution is represents. Hence, thequality of the solutions in successive generation improves. Theprocess is terminated when an acceptable or optimum solution isfound, or after some fixed time limit. Genetic algorithms are

appropriate for problems which require optimization, with respect tosome computable criterion. The paradigm can also be applied to datamining problems. The quantity to be minimized is often the numberof classification errors on a training set .



SUPPORT VECTOR MACHINES

SVMs is based on statistical learning theory and is increasing becoming useful

in data mining . The main idea is to non linearly map the data set into a high

dimensional feature space and use a linear discriminator to classify the data .

It·s success has been demonstrated in the area of regression , classification and

decision-tree Consruction .

DM Problems

Sequence Mining : It is concerned with mining sequence data .It may be

noted that in the discovery of association rules ,we are interested in finding

associations between items irrespective of their order of occurrence. For

example , we may be interested in the association between the purchase of a

particular brand of soft drinks and the occurrence of stomach upsets . But it is

more relevant to identify whether there is some pattern in the stomach upset

which occurs after the purchase of soft drink .



Web Mining : WWW is a fertile area for DM research . Web mining can be broken down into followingsub tasks .

1 . Resour ce Finding : Retrieving documents intended for the web .

2. Infor mation Selection and Prepr ocessing : Automatically selecting and preprocessing specificinformation from sources retrieved from the web .

3. Gener alization : to automatically discover general patterns at individual web sites as well as acrossmultiple sites

4. Analysis : Validation and/or interpretation of the mined patterns .

Text Mining

In Text mining , text documents can be structured by means of information Extraction ,textcategorization or applying NLP techniques as a preprocessing step before performing any kind of KDTs .Text Mining covers

1. Text Categorization

2. Exploratory Data Analysis3. Text Clustering

4. Finding Pattern in text Databases

5. Information Extraction



SPATIAL DATA MINING

It deals with spatial or location data .Development in IT ,Digital Mapping ,remote

Sensing A ND the global diffusion of GIS places demands on developing data driven

inductive approaches to spatial analysis .

datamining2.0

Documents