datamining overview

8/8/2019 DataMining Overview

http://slidepdf.com/reader/full/datamining-overview 1/52

Data Mining Overview



Data Mining - Overview 2

ContentCase Data Mining ± Supervised Learning

Case Data Mining ± Unsupervised Learning

Definition

Applications

Techniques

Supervised Learning

Unsupervised Learning




ContentDM as a Business Process

DM MethodologyReferences




Definition Advanced methods for exploring and modeling

relationships in large amounts of data (SAS)

Process of discovering meaningful newcorrelations, patterns and trends by sifting through

large amounts of data stored in repositories, using

pattern recognition technologies as well as

statistical and mathematical techniques.´ (Gartner

Group)




Definition (Cont) Process of exploration and analysis, by automatic

or semi automatic means, of large quantities of

data in order to discover meaningful patterns andrules

From the middle of 1900s, corporate data has

increased by factor of 100,000! due to automated

operations throwing enormous opportunities to

improve business decision making




Applications Data Mining is useful when large amount of data

and something worth learning (i.e. resulting

knowledge is worth more money than it costs todiscover)

Research

Process Improvement

Marketing

Customer Relationship Management (CRM)




Application (Cont) CRM (cont)

± Presenting single image of organization

± Keeping single image of customer

± Knowing Likes and dislikes of customers

± Anticipating their needs and exploiting them

proactively ± Recognizing their displeasure and do some

thing before it is too late




Popular Applications(source ± kdnuggets)




TechniquesSupervised Learning (Directed Knowledge

Discovery)

Classification (e.g. assigning customers to predefined segment. Discrete classes)

Estimation / Regression (e.g. Value of real estate.Continuous)

Prediction: Classification or Estimation for future(Which customer will close account in 6 month)

Time Series Analysis




Techniques (Cont)Unsupervised Learning (Undirected Knowledge

Discovery)

Association Rules (Affinity Grouping): Whichthings go together

Sequence Discovery: Association Rules based ontime

Clustering: Segmenting diverse group into number of similar group / cluster

Dimension reduction

Summarization / Characterization / Generalization




Overview of Techniques - 1

Logistic

Regression

Predicts probability of success; Gives

subset selection of variables

ClassificationTree

Gives a decision tree with rules of classification

Neural

Network

Is very opaque but gives higher level

of accuracy in many situations

k-Nearest

Neighbor

Groups cases into neighbors and

assigns a class based on majority of

cases in a neighborhood

Classification




Illustrative Applications - Classification

Target Marketing

Attrition Prediction/Churn Analysis

Fraud Detection

Credit Scoring

Predicting for every case which class it belongs to or

probability of success based on its predictor variables data





Multiple Linear

Regression

Gives predicted values based on

Regression Model

Regression Tree Gives a decision tree with rules of

prediction

k-Nearest

Neighbors

Groups cases into neighbors and

assigns a value based on majority of cases in a neighborhood

Neural Network

Prediction




Illustrative Applications - Prediction

Forecasting sales

Predicting price fluctuations

Predicting profitability of business units

Predicting market value of assets

Predicting yield or consumption of criticalinputsPredicting for every case a value based on its

predictor variables data





k-Means

Clustering

For given number of clusters ± k value - develops

clusters based on minimum distance between the

cluster centers and the cases in the cluster.

Hierarchical

Clustering

Builds, through successive steps, clusters by

grouping cases having less dissimilarities and

finally creating a single cluster. The user can

choose the number of clusters corresponding to a

distance measure.Principal

Components

Creates new variables, called Principal

Components, that are uncorrelated and that

explain majority of variability in original data.

Clustering and Dimension Reduction




Dimension Reduction When there are many dimensions

(predictors), say 20, 30 or 50..

Or when several predictors are correlated

Develop new variables that:

± Explain the major portion of variability in data,

and

± Are uncorrelated




Illustrative Applications - Clustering

Market segmentation

Product grouping based on customer preferences

Grouping of business units based on performance

parameters

Grouping channel partners based on performance

parametersGrouping of homogenous cases based on

predefined variables data





AssociationRules

Gives prediction of combinationsof events that will occur together

based on the past occurrences

Market Basket Analysis / Affinity




Illustrative Applications ±

Market Basket

Cross selling

Product placement in a store

Forecasting sales

Predicting events that occur together as antecedents and consequentswith certain level of confidence and support number of events




DM as Business ProcessIdentifying the business problem (and how will

business benefits will be measured)

Planning direct marketing campaign - new Product Understanding customer attrition

Mining Data to transform data into ActionableInformation

Who are more likely to buy product Which customers are likely to leave. Are they

worth keeping?




DM as Business Process (Cont)Acting on the information

Contacting more likely customers

Offering special services to valuable customers

likely to leave

Measuring the results

Actual Business benefits achieved as definedearlier




DM MethodologyWhy Methodology?

Avoid learning that is not true

Avoid learning that is true but not useful




Learning that is not true Incorrect Data

Data may not be relevant (business situation has changed)

Summarization of data may have destroyed importantinformation (Fig 3.1 pg 47)

Due to small volume of data, pattern emerges due to

chance (when India does well in cricket, sensex goes up)

Model set may not reflect relevant population (³Issue of

Credit´ model built on persons who were given credit. Poll

conducted on WEB)




Learning that is true but not useful Learning that are already known: People in area with no

cell coverage, do not buy cell phones

Learning that can not be used: Product sale is related toweather (Can you change weather?). Bad credit history

may be predictive of more insurance claim, but regulators

may prohibit usage of such information




DM Methodology ± 11 StepsStep 1: Translate business problem into DM problem

State in specific term (i.e. instead of ³Gaining insight into

customer behavior´, Identify customer who are unlikely torenew subscription)

Determine type of problem (Classification, Clustering,

etc.)

Decide how results will be used

± Contact high risk / high value customer and try to lure

them with offer

± Forecast customer population in future months




DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data

Input variables

± Which one?

Ignore Input columns with only one value

Ignore Input columns with unique value for each row (e.g.customer name)

Choose only one column out of two having highcorrelation. (e.g. Age_Difference and Age_Ratio)

± What should it contain: Example of all possibleoutcome

- Availability

Ideally from DW (If present) but may need to supplement




DM Methodology ± 11 Steps (cont)Step 2: Select appropriate Data (Cont) Input variables (Cont)

± How Many?

Do not eliminate at this stage Needs to be done later on ± How Much Data?More the merrier

Needs to optimize w.r.t. cost involved in processing, etc.

(Rule: If doubling size does not improve result much,stop)- How much history?Seasonality? (Consider seasonality. Data that is too old,

may not be relevant. Typically 2 ± 3 years for CRM)




DM Methodology ± 11 Steps (cont)Step 3: Get to know Data

Data Type

Descriptive statistics Validation (Why were so many customers born on 1911?Are they really that old?)




Data Type Columns: Categorical Vs Continuous

± Categorical: Takes discrete values (# of children, Marital Status)

± Continuous: Takes continuous values (Income)

Unordered vs Ordered Columns

± Unordered: (Marital Status, Sex)

± Ordered: Rank (e.g. ³Low´, ³High´) ± Ordered: Interval (e.g. Temperature)

± Ordered: True Numeric (e.g. Sales in Rs.,Weight




Descriptive

Statistics We can get general

idea about the way

data are distributed

Alcohol

Mean 13.00

Standard Error 0.06

Median 13.05

Mode 13.05

Standard Deviation 0.81

Sample Variance 0.65

Kurtosis -0.85Skewness -0.05

Range 3.80

Minimum 11.03

Maximum 14.83

Sum 2314.11Count 178

Largest(1) 14.83

Smallest(1) 11.03Confidence Level(95.0%) 0.12




Data Visualization We can study data

distribution using

Histogram

Histogram - All Types of Wines

0

10

20

30

40

1 1

. 5

1 2

. 5

1 3

. 5

1 4

. 5 o r e

Bin - Alcohol Content

F

r e q u e n c

.00%20.00%

40.00%60.00%80.00%100.00%120.00%

Frequency

Cumulative %

i¡

to¢

r £

¤ -¥

y ¦ e A§

in e¡

0

5

10

15

20

25

1 1

. 5

1 2

. 5

1 3

. 5

1 4

. 5 o r e

B in - Alcohol Conte nt

F

r e q

u

e n

.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Frequency

Cumulative %

i to r - y eB ine

0

10

20

3040

1 1. 5

1 2. 5

1 3. 5

1 4. 5 o r

e

Bin - Alcohol Content

F r e q u e n c y

.00%

50.00%

100.00%150.00%

Frequency

Cumulative %




Data Visualization Visual presentation of data (e.g. Graphs like bar

chart, X-Y Plot of two variables, Scatter Chartetc.)

Correlation-ship between data




ValidationIncorrect Values:

Reasons

± Transcription error ± Laziness (force entry for birth day many were

born on November 11, 1911!!)

± Programming error (value of previous field gets

entered in this field) ± Old code and new code coexist!

± Collected wrongly (Time zone not considered)




ValidationIncorrect Values:

Reasons

± Stored incorrectly (Numeric instead of character type)

³My data must be clean because no human beinghas touched it manually´ .. One CEO

Result: 50% data wrong, because human beingdid not touch system clocks on computers!




DM Methodology ± 11 Steps (cont)

Step 4: Create a model set

Sampling

± Proportionate (Including multiple time frames) ± Over sampling

Partitioning

± Training

± Validation

± Test





Step 5: Fix Problems with Data

Correct Error

Missing Values Outliers




Missing DataReasons

± ³Missing Data´ might be important

information. (e.g. not providing TN do not bother me calling) Keep a flag




Missing DataReasons (Cont)

± Nature of Problem. (e.g. New customer do not

have 12 month history data) Build separatemodel for those

± Sources not providing data (e.g. externalvendor not able to provide certain data) Replace

by other derived value / build separate model ± Data was never collected




Missing DataWhat to do?

± Do Nothing

± Filter rows (introduces bias) ± Ignore column

± Predict New Value

± Build separate model

± Modify operations systems to collect data




Missing Data Correction Delete record

Problems

± Too many rows thrown out

± Bias introduced (All persons not wanting to state³Salary´ out)

Replace values with:

± Mode

± Mean (Local / Global)

± Median

± User specified value

Will replacement create problems?




Outliers Outlier are cases that contain unusual high or low

data value in a variable.

Such records unduly influence the model. If they are not a natural occurrence they should be

remove

Treatment depends upon algorithm chosen

(Decision tree ± no problem. Clustering ± Defineseparate cluster. Some cases ± remove / replacewith Max / Min )





Step 6: Transform Data

Normalization

Transforming




Transformation Derived Variables

Create derived variable that represent

something in real world (e.g. Passenger *Miles)




TransformationExtracting Information from a column /

Transformation

26 Jan and 15 Aug Holiday Date: Holiday / Working Day

Date: Festive Season / Normal Season

Time: Peak Hour / Off-peak Hour

Telephone Number: Landline / Mobile

Address: Single House / Multi-unit dwelling

Categorize continuous data (e.g. Income)





Step 7: Build Model

Choose one or more techniques

Step 8: Asses Models

Some Errors are more serious than others

Confusion Matrix

Lift

RMS

Ratio of intra-cluster to inter-cluster distance





Step 9: Deploy Model

Choose one or more techniques

Step 10: Asses Results

Example:

What was the cost of direct marketing campaign?(Including DM Cost)

What were benefits>)

Step 11: Begin Again

Things change over time

Better way of handling




DM and KDDKDD (Knowledge Discovery in Database) and DM

are used interchangeably.Some prefer to differentiate. KDD consists of:

Selection: Sourcing Data

Preprocessing: Correcting erroneous data,handling missing data

Transformation: Transforming data to more usableformats

Data Mining: Applying various algorithms

Presentation / Interpretation / Evaluation of data




SEMMA Methodology (SAS) Sample from data sets, Partition into Training,

Validation and Test datasets

Explore data set statistically and graphically

Modify:Transform variables, Impute missingvalues

Model: fit predictive models e.g. regression, tree,collaborative filtering

Assess: Compare models




Miscellaneous

Data Mining Issues

Human Interaction

Over fitting

Outliers

Interpretation of Results

Visualization of Results Large Datasets (some algorithm do not scale. Use

Sampling or Parallel processing)




Miscellaneous (Cont)

Data Mining Issues (Cont)

High Dimensionality

Multimedia Data Missing Data

Irrelevant Data

Noisy data

Changing Data

Integration of KDD in traditional DBMS systems

Applications




Miscellaneous (Cont)

Future

Data Mining Query Lang (DMQL) based on SQL

DMQL should bring out ± Generalized Relation: Obtained by

Generalizing data from input data

± Characteristic Rule: Condition satisfied by

almost all records in target class ± Discriminate Rule: Condition satisfied by target

class but not by other classes

± Classify Rule: Used to classify data




References

1. Michael Berry, Gordon Linoff ³Mastering Data

Mining´, Wiley Publications (Ch 1, 3, 5, 6, 7)

2. Michael Berry, Gordon Linoff ³Data MiningTechniques´, Wiley Publications, (Ch 7 ±

Overview of Data Mining Techniques)

3. Margaret Dunham, ³Data Mining ± Introductory

and Advanced Topics´, Pearson Edition (Ch1,2,3)

datamining overview

Documents