an introduction to data mining by rand ali computer engineering & information technology...

29
An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Upload: alberto-dallas

Post on 01-Apr-2015

212 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

An Introduction to Data Mining

By Rand AliComputer Engineering & Information

Technology Department

Page 2: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

What is data Mining?

Extraction of interesting patterns or knowledge from huge amount of data.

Page 3: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Why Data Mining

The progress of computer hardware technology has led to large supplies of powerful and affordable computers, data collection equipment and storage media.

The last decade has experienced a revolution in information availability and exchange via the Internet.

Page 4: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Why Data MiningThe fast-growing, great amount of data,

collected and stored in large and many data repositories, has far exceeded our human ability for understanding without powerful tools.

As a result, data collected in large data repositories become “data tombs”—data archives that are seldom visited.

Page 5: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

We are data rich but information poor

Page 6: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Data Mining objectiveData mining tools perform data analysis

and may uncover important data patterns, contributing greatly to business strategies and scientific and medical research.

Data Mining turn data tombs into “golden

nuggets” of knowledge.

Page 7: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Data mining—searching for knowledge (interesting patterns) in your data.

Page 8: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Data Mining is a step of knowledge Discovery process

Page 9: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Knowledge discovery as a process is an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data).

2. Data integration (where multiple data sources may be combined)

3. Data selection (where data relevant to the analysis task are retrieved from the database)

4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures;

7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

Page 10: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

What makes a pattern interesting?

a pattern is interesting if it is 1. easily understood by humans. 2. valid on new or test data with

some degree of certainty.3. potentially useful.4. novel.

Page 11: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Origins of Data Mining

Page 12: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Primary Data Mining Tasks

In general, data mining tasks can be classified into two categories: descriptive and predictive.

Predictive methods, use some variables to predict unknown or future values of other variables.

Ex: Classification, Regression, Deviation Detection.

Descriptive methods, characterize the general properties of the data in the database.

Ex: Association Rule Discovery, Clustering, Sequential Pattern Discovery.

Page 13: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

1- Association Rule Discovery Given a set of records each of which

contain some number of items from a given collection.

Association Rules Discovery produces dependency rules which will predict occurrence of an item based on occurrences of other items.

Page 14: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

2-Sequential Pattern DiscoverySequential pattern mining is the discovery

of frequently occurring ordered events or subsequences as patterns.

An example of a sequential pattern is “Customers who buy a Canon digital camera are likely to buy an HP color printer within a month.”

Page 15: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

3-ClassificationClassification is the process of finding a model

(or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown.

The derived model is based on the analysis of a

set of training data (i.e., data objects whose class label is known).

Page 16: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Classification Example

Page 17: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

4-RegressionWhereas classification predicts

categorical (discrete, unordered) labels, Regression analysis is used to predict missing or unavailable numerical data values rather than class labels.

Page 18: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

5-Clustringclustering analyzes data objects without

consulting a known class label. In general, the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels.

Clusters of objects are formed so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. Each cluster that is formed can be viewed as a class of objects, from which rules can be derived.

Page 19: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

6-Outlier Analysis

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers.

Page 20: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Application 1Market basket analysis

analyzing customer buying habits by finding associations between the different items that customers place in their “shopping baskets”.

The discovery of such associations can help to develop marketing strategies by gaining insight into which items are frequently purchased together by customers.

Page 21: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Possible Marketing StrategiesIn one strategy, items that are frequently

purchased together can be placed in proximity in order to further encourage the sale of such items together.

Market basket analysis can also help retailers plan which items to put on sale at reduced prices. If customers tend to purchase computers and printers together, then having a sale on printers may encourage the sale of printers as well as computers.

Page 22: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

If we think of the universe as the set of items available at the store, then each item has a Boolean variable representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of values assigned to these variables.

The Boolean vectors can be analyzed for buying patterns that reflect items that are frequently associated or purchased together. These patterns can be represented in the form of association rules.

Page 23: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

For example, the information that customers who purchase computers also tend to buy antivirus software at the same time is represented in Association Rule below:

Computer=>antivirus_software[support=2%

confidence =60%] (1)

Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules.

A support of 2% for Association Rule (1) means that 2% of all the transactions under analysis show that computer and antivirus software are purchased together.

A confidence of 60% means that if a customer buys a computer, there is 60% chance that he will buy antivirus as well.

Page 24: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold.

Such thresholds can be set by users or

domain experts

Page 25: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Application2Data Mining &DNA data analysis

a great deal of biomedical research has focused on DNA data analysis.

Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities, as well as the discovery of new medicine and approaches for disease diagnosis, prevention, and treatment.

Page 26: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

An important focus in genome research is the study of DNA sequences since such sequences form the foundation of the genetic codes of all living organisms.

All DNA sequences comprise four basic building blocks (called nucleotides): adenine(A), cytosine(C), guanine(G), and thymine(T).

These four nucleotides are combined to form long sequences or chains that resemble a twisted ladder.

Page 27: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

DNA structure

Page 28: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Human beings have around 100,000 genes.

Most diseases are not triggered by a single gene but by a combination of genes acting together.

Association analysis methods can be used to help determine the kinds of genes that are likely to co-occur in target samples.

Such analysis would facilitate the discovery of groups of genes and the study of interactions and relationships between them.

Page 29: An Introduction to Data Mining By Rand Ali Computer Engineering & Information Technology Department

Thank you