data mining vs. statistics pavel brusilovsky. 2 objectives 2 intro to data mining data mining vs....

Data Mining vs. Statistics

Pavel Brusilovsky

2

Objectives

2

• Intro to Data Mining

• Data Mining vs. Statistics

• Data Mining vs. Text Mining

• Applications of Data Mining

3

Data Mining

3

• Data Mining – is a cutting edge technology to analyze diverse,

multidisciplinary and multidimensional complex data

• Data mining could identify relationships in your multidimensional and heterogeneous data that cannot be identified in any other way

• Successful application of state-of-the-art data mining technology to

marketing and sales is indicative of analytic maturity and the success of a company

• Working definition of Data Mining:– Data Mining is a process of discovering previously unknown and

potentially useful hidden pattern in your data

44

What is the Taxonomy of Data Mining?• Data mining taxonomy, based on application

– Data Mining– Text Mining– Web Mining– Image Mining…

• Data mining taxonomy, based on the usage of domain knowledge:– Verification-driven data mining

• Is associated with traditional quantitative approaches that permit a decision maker to express and verify organizational and personal domain knowledge

– Discovery driven data Mining• It tied with knowledge discovery technology capable of automatically

discovering previously unknown patterns hidden in the data– Combination of both classes leads to synergy that can produce

meaningful and reliable results that may not be obtained within the framework of each class of data mining independently

• Data mining taxonomy, based on estimation paradigm:– supervised learning– unsupervised learning

55

What is the deference between “Search” and “Discover”

Source:http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt

66

Example: Amazon.com purchase suggestion

Amazon.com increased sales by 15%, using data/text mining generated purchase suggestions

77

Data Mining and Related Fields

Statistics: “The model is king” (Hand)Data Mining: “The data is king”

88

Is Data Mining extension of Statistics?

• Data Mining and Statistics: mutual fertilization with convergence

• Statistical Data Mining (Graduate course, George Mason University)

• Statistical Data Mining and Knowledge Discovery (Hardcover) by Hamparsum Bozdogan (Editor)– An overview of Bayesian and frequentist issues that arise in

multivariate statistical modeling involving data mining

• Data Mining with Stepwise Regression (Dean Foster, Wharton School)– use interactions to capture non-linearities– use Bonferroni adjustment to pick variables to include– use the sandwich estimator to get robust standard errors

99

What are Data Mining Myths?

• Myth 1: Data mining automatically discovers hidden pattern in your data

• Myth 2: Data mining is design for business analysts who are not professional in quantitative fields

• Myth 3: Data mining findings can be easily translated into decision-maker actions

• Myth 4: Data mining encompasses decision analysis/decision support technology

1010

What are logical steps of Data Mining?SEMMA methodology (SAS Enterprise Miner)• The core process of conducting data mining study includes the following

steps (SEMMA):– Sample– Explore– Modify– Model– Assess

• SEMMA is a logical organization of the functional tool set of SAS Enterprise Miner for carrying out the core tasks of data mining

• SEMMA is focused on the model development aspects of data mining

1111

CRoss-Industry Standard Process for Data Mining (CRISP-DM)

Six phases of CRISP-DM:1. Business understanding2. Data understanding3. Data preparation4. Modeling5. Evaluation6. Model deployment

SPSS Clementine

www.crips-dm.org

1212

Statistics vs. Data Mining: Concepts

Feature Statistics Data Mining

Type of Problem Well structured Unstructured / Semi-structured

Inference Role Explicit inference plays great role in any analysis

No explicit inference

Objective of the Analysis and Data Collection

First – objective formulation, and then - data collection

Data rarely collected for objective of the analysis/modeling

Size of data set Data set is small and hopefully homogeneous

Data set is large and data set is heterogeneous

Paradigm/Approach Theory-based (deductive) Synergy of theory-based and heuristic-based approaches (inductive)

Signal-to-Noise Ratio STNR > 3 0 < STNR <= 3

Type of Analysis Confirmative Explorative

Number of variables Small Large

1313

Statistics vs. Data Mining: Regression ModelingFeature Statistics Data Mining

Number of inputs Small Large

Type of inputs Interval scaled and categorical with small number of categories (percentage of categorical variables is small)

Any mixture of interval scaled, categorical, and text variables

Multicollinearity Wide range of degree of multicollinearity with intolerance to multicollinearity

Severe multicollinearity is always there, tolerance to multicollinearity

Distributional assumptions, homoscedasticity,

outliers, missing values

Intolerance to distrubitional assumption violation, homoscedasticity,

Outliers/leverage points, missing values

Tolerance to distributional assumption violation, outliers/leverage points, and missing values

Type of model Linear / Non-linear / Parametric / Non-Parametric in low dimensional X-space (intolerance to uncharacterizable non-linearities)

Non-linear and non-parametric in high dimensional X-space with tolerance to uncharacterizable non-linearities

1414

What is an unstructured problem?

Well-structured Business Problem

Unstructured Business Problem

Definition Can be described with a high degree of completeness

Cannot be described with a high degree of completeness

Can be solved with a high degree of certainty

Cannot be resolved with a high degree of certainty

Experts usually agree on the best method and best solution

Experts often disagree about the best method and best solution

Can be easily and uniquely translated into quantitative counterpart

Cannot be easily and uniquely translated into quantitative counterpart

Goal Find the best solution Find reasonable solution

Complexity Ranges from very simple to complex

Ranges from complex to very complex

1515

What are differences between Data/Text Mining and Statistics?

• Statistical analysis is designed to deal with structured data in order to solve structured problem:– Results are software and researcher independent– Inference reflects statistical hypothesis testing

• Data mining is designed to deal with structured data in order to solve unstructured business problems– Results are software and researcher dependent (absence of

implementation standards)– Inference reflects computational properties of data mining

algorithm at hand• Text mining is designed to deal with unstructured data in order to

solve unstructured problems– Results are software and researcher dependent– Inference reflects computational properties and visualization

capability of text mining algorithm at hand

1616

When data mining technology is appropriate?

• Data mining technology is appropriate if: – The business problem is unstructured– Accurate prediction is more important than the explanation– The data include the mixture of interval, nominal, ordinal, count,

and text variables, and the role and the number of non-numeric variables are essential

– Among those variables there are a lot of irrelevant and redundant attributes

– The relationship among variables could be non-linear with uncharacterizable nonlinearities

– The data are highly heterogeneous with a large percentage of outliers, leverage points, and missing values

– The sample size is relatively large

• Important marketing and sales studies/projects have the majority of these features

1717

Accurate prediction is more important than the explanation

1818

What is Breiman Uncertainty Principle?

• Breiman uncertainty principle:

Accuracy * Interpretability = Breiman’s constant

• Breiman uncertainty principle means that

The higher method’s accuracy, the lower its interpretability, and vice versa

1919

What are great Data Mining Ideas?• Injecting randomness into function estimation procedure

• Bagging (Breiman, 1996):– Apply the same unstable algorithm to different samples (with

replacement) of the original data– Different samples yield different models– The average of the predictions of these models might be better

than the predictions from any single model

• Boosting (Friedman, Hastie, and Tibshirani (1999):– Each model is based on the same original data– The first individual model is fit to the original data– For the second model, subtract the predicted value from the

original target value, and use the difference as the target value to train the second model

– For the third model, subtract weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on.

2020

What are the best Data Mining Conferences?

• Annual SAS Data Mining Technology Conference– The world’s largest data mining conference that balancing

theory and practice

• Annual International Conference on Knowledge Discovery and Data Mining (KDD)– Sponsored by the American Association for Artificial Intelligence

(AAII)

• Annual International Salford Systems Data Mining Conference– Focusing on solving real world challenges– Business Applications of CART, MARS, TreeNet, and Random

Forrest– Keynote speakers: Jerome Friedman (Stanford University) and Leo

Breiman (University of California, Berkeley)

2121

What are the best data mining tools?

• Salford Systems’ Tools (CART, Random Forest, MARS, TreeNet)

• SAS Enterprise Miner/Text Miner

• SPSS Clementine

• Megaputer Intelligence PolyAnalyst

2222

Reference (Data Mining)

• Randall Matignon (2007), Neural Network Modeling Using SAS Enterprise Miner , SAS® Institute Inc.

• David J. Hand, Data Mining: Statistics and More? The American Statistician, May 1998, Vol. 52 No. 2http://www.amstat.org/publications/tas/hand.pdf

• Friedman, J.H. 1997. Data Mining and Statistics. What’s connection? Proceedings of the 29th Symposium on the Interface: Computing Science and Statistics, May 1997, Houston, Texas

• Doug Wielenga (2007), Identifying and Overcoming Common Data Mining Mistakes, SAS Global Forum Paper 073-2007

• Nathan Treloar (2002), Text Mining: Tools, Techniques, and Applications http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt.

data mining vs. statistics pavel brusilovsky. 2 objectives 2 intro to data mining data mining vs....

Documents

data mining data mining

data collection data

statistics data mining

taxonomy of data mining

data mining crisp

data mining study

definition of data mining

data mining findings