datamining overview

52
Data Mining Overview

Upload: amitav-pattnaik

Post on 10-Apr-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 1/52

Data Mining Overview

Page 2: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 2/52

Data Mining - Overview 2

ContentCase Data Mining ± Supervised Learning

Case Data Mining ± Unsupervised Learning

Definition

Applications

Techniques

Supervised Learning

Unsupervised Learning

Page 3: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 3/52

Data Mining - Overview 3

ContentDM as a Business Process

DM MethodologyReferences

Page 4: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 4/52

Data Mining - Overview 4

Definition Advanced methods for exploring and modeling

relationships in large amounts of data (SAS)

Process of discovering meaningful newcorrelations, patterns and trends by sifting through

large amounts of data stored in repositories, using

 pattern recognition technologies as well as

statistical and mathematical techniques.´ (Gartner 

Group)

Page 5: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 5/52

Data Mining - Overview 5

Definition (Cont) Process of exploration and analysis, by automatic

or semi automatic means, of large quantities of 

data in order to discover meaningful patterns andrules

From the middle of 1900s, corporate data has

increased by factor of 100,000! due to automated

operations throwing enormous opportunities to

improve business decision making

Page 6: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 6/52

Data Mining - Overview 6

Applications Data Mining is useful when large amount of data

and something worth learning (i.e. resulting

knowledge is worth more money than it costs todiscover)

Research

Process Improvement

Marketing

Customer Relationship Management (CRM)

Page 7: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 7/52

Data Mining - Overview 7

Application (Cont) CRM (cont)

 ± Presenting single image of organization

 ± Keeping single image of customer 

 ± Knowing Likes and dislikes of customers

 ± Anticipating their needs and exploiting them

 proactively ± Recognizing their displeasure and do some

thing before it is too late

Page 8: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 8/52

Data Mining - Overview 8

Popular Applications(source ± kdnuggets)

Page 9: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 9/52

Data Mining - Overview 9

TechniquesSupervised Learning (Directed Knowledge

Discovery)

Classification (e.g. assigning customers to predefined segment. Discrete classes)

Estimation / Regression (e.g. Value of real estate.Continuous)

Prediction: Classification or Estimation for future(Which customer will close account in 6 month)

Time Series Analysis

Page 10: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 10/52

Data Mining - Overview 10

Techniques (Cont)Unsupervised Learning (Undirected Knowledge

Discovery)

Association Rules (Affinity Grouping): Whichthings go together 

Sequence Discovery: Association Rules based ontime

Clustering: Segmenting diverse group into number of similar group / cluster 

Dimension reduction

Summarization / Characterization / Generalization

Page 11: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 11/52

Data Mining - Overview 11

Overview of Techniques - 1

Logistic

Regression

Predicts probability of success; Gives

subset selection of variables

ClassificationTree

Gives a decision tree with rules of classification

 Neural

 Network 

Is very opaque but gives higher level

of accuracy in many situations

k-Nearest

 Neighbor 

Groups cases into neighbors and

assigns a class based on majority of 

cases in a neighborhood

Classification

Page 12: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 12/52

Data Mining - Overview 12

Illustrative Applications - Classification

Target Marketing

Attrition Prediction/Churn Analysis

Fraud Detection

Credit Scoring

Predicting for every case which class it belongs to or 

 probability of success based on its predictor variables data

Page 13: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 13/52

Data Mining - Overview 13

Overview of Techniques - 2

Multiple Linear 

Regression

Gives predicted values based on

Regression Model

Regression Tree Gives a decision tree with rules of 

 prediction

k-Nearest

 Neighbors

Groups cases into neighbors and

assigns a value based on majority of cases in a neighborhood

 Neural Network 

Prediction

Page 14: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 14/52

Data Mining - Overview 14

Illustrative Applications - Prediction

Forecasting sales

Predicting price fluctuations

Predicting profitability of business units

Predicting market value of assets

Predicting yield or consumption of criticalinputsPredicting for every case a value based on its

 predictor variables data

Page 15: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 15/52

Data Mining - Overview 15

Overview of Techniques - 3

k-Means

Clustering

For given number of clusters ± k value - develops

clusters based on minimum distance between the

cluster centers and the cases in the cluster.

Hierarchical

Clustering

Builds, through successive steps, clusters by

grouping cases having less dissimilarities and

finally creating a single cluster. The user can

choose the number of clusters corresponding to a

distance measure.Principal

Components

Creates new variables, called Principal

Components, that are uncorrelated and that

explain majority of variability in original data.

Clustering and Dimension Reduction

Page 16: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 16/52

Data Mining - Overview 16

Dimension Reduction When there are many dimensions

(predictors), say 20, 30 or 50..

Or when several predictors are correlated

Develop new variables that:

 ± Explain the major portion of variability in data,

and

 ± Are uncorrelated

Page 17: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 17/52

Data Mining - Overview 17

Illustrative Applications - Clustering

Market segmentation

Product grouping based on customer preferences

Grouping of business units based on performance

 parameters

Grouping channel partners based on performance

 parametersGrouping of homogenous cases based on

 predefined variables data

Page 18: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 18/52

Data Mining - Overview 18

Overview of Techniques - 4

 AssociationRules

Gives prediction of combinationsof events that will occur together 

based on the past occurrences

Market Basket Analysis / Affinity

Page 19: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 19/52

Data Mining - Overview 19

Illustrative Applications ± 

Market Basket

Cross selling

Product placement in a store

Forecasting sales

Predicting events that occur together as antecedents and consequentswith certain level of confidence and support number of events

Page 20: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 20/52

Data Mining - Overview 20

DM as Business ProcessIdentifying the business problem (and how will

business benefits will be measured)

Planning direct marketing campaign - new Product Understanding customer attrition

Mining Data to transform data into ActionableInformation

Who are more likely to buy product Which customers are likely to leave. Are they

worth keeping?

Page 21: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 21/52

Data Mining - Overview 21

DM as Business Process (Cont)Acting on the information

Contacting more likely customers

Offering special services to valuable customers

likely to leave

Measuring the results

Actual Business benefits achieved as definedearlier 

Page 22: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 22/52

Data Mining - Overview 22

DM MethodologyWhy Methodology?

Avoid learning that is not true

Avoid learning that is true but not useful

Page 23: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 23/52

Data Mining - Overview 23

Learning that is not true Incorrect Data

Data may not be relevant (business situation has changed)

Summarization of data may have destroyed importantinformation (Fig 3.1 pg 47)

Due to small volume of data, pattern emerges due to

chance (when India does well in cricket, sensex goes up)

Model set may not reflect relevant population (³Issue of 

Credit´ model built on persons who were given credit. Poll

conducted on WEB)

Page 24: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 24/52

Data Mining - Overview 24

Learning that is true but not useful Learning that are already known: People in area with no

cell coverage, do not buy cell phones

Learning that can not be used: Product sale is related toweather (Can you change weather?). Bad credit history

may be predictive of more insurance claim, but regulators

may prohibit usage of such information

Page 25: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 25/52

Data Mining - Overview 25

DM Methodology ± 11 StepsStep 1: Translate business problem into DM problem

State in specific term (i.e. instead of ³Gaining insight into

customer behavior´, Identify customer who are unlikely torenew subscription)

Determine type of problem (Classification, Clustering,

etc.)

Decide how results will be used

 ± Contact high risk / high value customer and try to lure

them with offer 

 ± Forecast customer population in future months

Page 26: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 26/52

Data Mining - Overview 26

DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data

Input variables

 ± Which one?

Ignore Input columns with only one value

Ignore Input columns with unique value for each row (e.g.customer name)

Choose only one column out of two having highcorrelation. (e.g. Age_Difference and Age_Ratio)

 ± What should it contain: Example of all possibleoutcome

- Availability

Ideally from DW (If present) but may need to supplement

Page 27: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 27/52

Data Mining - Overview 27

DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data (Cont) Input variables (Cont)

 ± How Many?

Do not eliminate at this stage Needs to be done later on ± How Much Data?More the merrier 

 Needs to optimize w.r.t. cost involved in processing, etc.

(Rule: If doubling size does not improve result much,stop)- How much history?Seasonality? (Consider seasonality. Data that is too old,

may not be relevant. Typically 2 ± 3 years for CRM)

Page 28: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 28/52

Data Mining - Overview 28

DM Methodology ± 11 Steps (cont)Step 3: Get to know Data

Data Type

Descriptive statistics Validation (Why were so many customers born on 1911?Are they really that old?)

Page 29: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 29/52

Data Mining - Overview 29

Data Type Columns: Categorical Vs Continuous

 ± Categorical: Takes discrete values (# of children, Marital Status)

 ± Continuous: Takes continuous values (Income)

Unordered vs Ordered Columns

 ± Unordered: (Marital Status, Sex)

 ± Ordered: Rank (e.g. ³Low´, ³High´) ± Ordered: Interval (e.g. Temperature)

 ± Ordered: True Numeric (e.g. Sales in Rs.,Weight

Page 30: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 30/52

Data Mining - Overview 30

Descriptive

Statistics We can get general

idea about the way

data are distributed

 Alcohol 

Mean 13.00

Standard Error 0.06

Median 13.05

Mode 13.05

Standard Deviation 0.81

Sample Variance 0.65

Kurtosis -0.85Skewness -0.05

Range 3.80

Minimum 11.03

Maximum 14.83

Sum 2314.11Count 178

Largest(1) 14.83

Smallest(1) 11.03Confidence Level(95.0%) 0.12

Page 31: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 31/52

Data Mining - Overview 31

Data Visualization We can study data

distribution using

Histogram

Histogram - All Types of Wines

0

10

20

30

40

  1  1

.   5

  1   2

.   5

  1   3

.   5

  1  4

.   5  o  r  e

Bin - Alcohol Content

      F

     r     e     q     u     e     n     c

.00%20.00%

40.00%60.00%80.00%100.00%120.00%

Frequency

Cumulative %

 

i¡ 

to¢ 

r £ 

¤  -¥ 

y ¦  e A§ 

in e¡ 

0

5

10

15

20

25

  1  1

.   5

  1   2

.   5

  1   3

.   5

  1  4

.   5   o  r  e

B in - Alcohol Conte nt

      F

     r     e     q

     u

     e     n

.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Frequency

Cumulative %

i to r - y eB ine

0

10

20

3040

  1  1.   5

  1   2.   5

  1   3.   5

  1  4.   5   o  r

  e

Bin - Alcohol Content

      F     r     e     q     u     e     n     c     y

.00%

50.00%

100.00%150.00%

Frequency

Cumulative %

Page 32: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 32/52

Data Mining - Overview 32

Data Visualization Visual presentation of data (e.g. Graphs like bar 

chart, X-Y Plot of two variables, Scatter Chartetc.)

Correlation-ship between data

Page 33: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 33/52

Data Mining - Overview 33

ValidationIncorrect Values:

Reasons

 ± Transcription error  ± Laziness (force entry for birth day many were

 born on November 11, 1911!!)

 ± Programming error (value of previous field gets

entered in this field) ± Old code and new code coexist!

 ± Collected wrongly (Time zone not considered)

Page 34: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 34/52

Data Mining - Overview 34

ValidationIncorrect Values:

Reasons

 ± Stored incorrectly (Numeric instead of character type)

³My data must be clean because no human beinghas touched it manually´ .. One CEO

Result: 50% data wrong, because human beingdid not touch system clocks on computers!

Page 35: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 35/52

Data Mining - Overview 35

DM Methodology ± 11 Steps (cont)

Step 4: Create a model set

Sampling

 ± Proportionate (Including multiple time frames) ± Over sampling

Partitioning

 ± Training

 ± Validation

 ± Test

Page 36: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 36/52

Data Mining - Overview 36

DM Methodology ± 11 Steps (cont)

Step 5: Fix Problems with Data

Correct Error 

Missing Values Outliers

Page 37: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 37/52

Data Mining - Overview 37

Missing DataReasons

 ± ³Missing Data´ might be important

information. (e.g. not providing TN do not bother me calling) Keep a flag

Page 38: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 38/52

Data Mining - Overview 38

Missing DataReasons (Cont)

 ± Nature of Problem. (e.g. New customer do not

have 12 month history data) Build separatemodel for those

 ± Sources not providing data (e.g. externalvendor not able to provide certain data) Replace

 by other derived value / build separate model ± Data was never collected

Page 39: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 39/52

Data Mining - Overview 39

Missing DataWhat to do?

 ± Do Nothing

 ± Filter rows (introduces bias) ± Ignore column

 ± Predict New Value

 ± Build separate model

 ± Modify operations systems to collect data

Page 40: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 40/52

Data Mining - Overview 40

Missing Data Correction Delete record

Problems

 ± Too many rows thrown out

 ± Bias introduced (All persons not wanting to state³Salary´ out)

Replace values with:

 ± Mode

 ± Mean (Local / Global)

 ± Median

 ± User specified value

Will replacement create problems?

Page 41: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 41/52

Data Mining - Overview 41

Outliers Outlier are cases that contain unusual high or low

data value in a variable.

Such records unduly influence the model. If they are not a natural occurrence they should be

remove

Treatment depends upon algorithm chosen

(Decision tree ± no problem. Clustering ± Defineseparate cluster. Some cases ± remove / replacewith Max / Min )

Page 42: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 42/52

Data Mining - Overview 42

DM Methodology ± 11 Steps (cont)

Step 6: Transform Data

Normalization

Transforming

Page 43: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 43/52

Data Mining - Overview 43

Transformation Derived Variables

Create derived variable that represent

something in real world (e.g. Passenger *Miles)

Page 44: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 44/52

Data Mining - Overview 44

TransformationExtracting Information from a column /

Transformation

26 Jan and 15 Aug Holiday Date: Holiday / Working Day

Date: Festive Season / Normal Season

Time: Peak Hour / Off-peak Hour 

Telephone Number: Landline / Mobile

Address: Single House / Multi-unit dwelling

Categorize continuous data (e.g. Income)

Page 45: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 45/52

Data Mining - Overview 45

DM Methodology ± 11 Steps (cont)

Step 7: Build Model

Choose one or more techniques

Step 8: Asses Models

Some Errors are more serious than others

Confusion Matrix

Lift

RMS

Ratio of intra-cluster to inter-cluster distance

Page 46: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 46/52

Data Mining - Overview 46

DM Methodology ± 11 Steps (cont)

Step 9: Deploy Model

Choose one or more techniques

Step 10: Asses Results

Example:

What was the cost of direct marketing campaign?(Including DM Cost)

What were benefits>)

Step 11: Begin Again

Things change over time

Better way of handling

Page 47: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 47/52

Data Mining - Overview 47

DM and KDDKDD (Knowledge Discovery in Database) and DM

are used interchangeably.Some prefer to differentiate. KDD consists of:

Selection: Sourcing Data

Preprocessing: Correcting erroneous data,handling missing data

Transformation: Transforming data to more usableformats

Data Mining: Applying various algorithms

Presentation / Interpretation / Evaluation of data

Page 48: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 48/52

Data Mining - Overview 48

SEMMA Methodology (SAS) Sample from data sets, Partition into Training,

Validation and Test datasets

Explore data set statistically and graphically

Modify:Transform variables, Impute missingvalues

Model: fit predictive models e.g. regression, tree,collaborative filtering

Assess: Compare models

Page 49: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 49/52

Data Mining - Overview 49

Miscellaneous

Data Mining Issues

Human Interaction

Over fitting

Outliers

Interpretation of Results

Visualization of Results Large Datasets (some algorithm do not scale. Use

Sampling or Parallel processing)

Page 50: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 50/52

Data Mining - Overview 50

Miscellaneous (Cont)

Data Mining Issues (Cont)

High Dimensionality

Multimedia Data Missing Data

Irrelevant Data

Noisy data

Changing Data

Integration of KDD in traditional DBMS systems

Applications

Page 51: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 51/52

Data Mining - Overview 51

Miscellaneous (Cont)

Future

Data Mining Query Lang (DMQL) based on SQL

DMQL should bring out ± Generalized Relation: Obtained by

Generalizing data from input data

 ± Characteristic Rule: Condition satisfied by

almost all records in target class ± Discriminate Rule: Condition satisfied by target

class but not by other classes

 ± Classify Rule: Used to classify data

Page 52: DataMining Overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 52/52

Data Mining - Overview 52

References

1. Michael Berry, Gordon Linoff ³Mastering Data

Mining´, Wiley Publications (Ch 1, 3, 5, 6, 7)

2. Michael Berry, Gordon Linoff ³Data MiningTechniques´, Wiley Publications, (Ch 7 ± 

Overview of Data Mining Techniques)

3. Margaret Dunham, ³Data Mining ± Introductory

and Advanced Topics´, Pearson Edition (Ch1,2,3)