an introduction to data mining by rand ali computer engineering & information technology...

An Introduction to Data Mining

By Rand AliComputer Engineering & Information

Technology Department

What is data Mining?

Extraction of interesting patterns or knowledge from huge amount of data.

Why Data Mining

The progress of computer hardware technology has led to large supplies of powerful and affordable computers, data collection equipment and storage media.

The last decade has experienced a revolution in information availability and exchange via the Internet.

Why Data MiningThe fast-growing, great amount of data,

collected and stored in large and many data repositories, has far exceeded our human ability for understanding without powerful tools.

As a result, data collected in large data repositories become “data tombs”—data archives that are seldom visited.

We are data rich but information poor

Data Mining objectiveData mining tools perform data analysis

and may uncover important data patterns, contributing greatly to business strategies and scientific and medical research.

Data Mining turn data tombs into “golden

nuggets” of knowledge.

Data mining—searching for knowledge (interesting patterns) in your data.

Data Mining is a step of knowledge Discovery process

Knowledge discovery as a process is an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data).

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures;

7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

What makes a pattern interesting?

a pattern is interesting if it is 1. easily understood by humans. 2. valid on new or test data with

some degree of certainty.3. potentially useful.4. novel.

Origins of Data Mining

Primary Data Mining Tasks

In general, data mining tasks can be classified into two categories: descriptive and predictive.

Predictive methods, use some variables to predict unknown or future values of other variables.

Ex: Classification, Regression, Deviation Detection.

Descriptive methods, characterize the general properties of the data in the database.

Ex: Association Rule Discovery, Clustering, Sequential Pattern Discovery.

1- Association Rule Discovery Given a set of records each of which

contain some number of items from a given collection.

Association Rules Discovery produces dependency rules which will predict occurrence of an item based on occurrences of other items.

2-Sequential Pattern DiscoverySequential pattern mining is the discovery

of frequently occurring ordered events or subsequences as patterns.

An example of a sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.”

3-ClassificationClassification is the process of finding a model

(or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.

The derived model is based on the analysis of a

set of training data (i.e., data objects whose class label is known).

Classification Example

4-RegressionWhereas classification predicts

categorical (discrete, unordered) labels, Regression analysis is used to predict missing or unavailable numerical data values rather than class labels.

5-Clustringclustering analyzes data objects without

consulting a known class label. In general, the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels.

Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived.

6-Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers.

Application 1Market basket analysis

analyzing customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.

The discovery of such associations can help to develop marketing strategies by gaining insight into which items are frequently purchased together by customers.

Possible Marketing StrategiesIn one strategy, items that are frequently

purchased together can be placed in proximity in order to further encourage the sale of such items together.

Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.

If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables.

The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.

For example, the information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule below:

Computer=>antivirus_software[support=2%

confidence =60%] (1)

Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules.

A support of 2% for Association Rule (1) means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together.

A confidence of 60% means that if a customer buys a computer, there is 60% chance that he will buy antivirus as well.

Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

Such thresholds can be set by users or

domain experts

Application2Data Mining &DNA data analysis

a great deal of biomedical research has focused on DNA data analysis.

Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities, as well as the discovery of new medicine and approaches for disease diagnosis, prevention, and treatment.

An important focus in genome research is the study of DNA sequences since such sequences form the foundation of the genetic codes of all living organisms.

All DNA sequences comprise four basic building blocks (called nucleotides): adenine(A), cytosine(C), guanine(G), and thymine(T).

These four nucleotides are combined to form long sequences or chains that resemble a twisted ladder.

DNA structure

Human beings have around 100,000 genes.

Most diseases are not triggered by a single gene but by a combination of genes acting together.

Association analysis methods can be used to help determine the kinds of genes that are likely to co-occur in target samples.

Such analysis would facilitate the discovery of groups of genes and the study of interactions and relationships between them.

Thank you

an introduction to data mining by rand ali computer engineering & information technology...

Documents