statistics in data mining 1: what is data mining?

1

Statistics in Data Mining

1: What is Data Mining? 2: What is Statistics? 3: Data Mining Versus Statistics 4: Statistics in Data Mining by Examples 5: An Ideal Statistical Data Mining Textbook

Prepared for Information Management Association Conferenceby

Pinyuen Chen, National Taiwan University & Syracuse [email protected]

2

1: What is Data Mining?

Excerption from http://en.wikipedia.org/wiki/Data_Mining

Data mining (DMM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns using tools such as classification, association rule mining, clustering, etc. Data mining is a complex topic and has links with multiple core fields such as computer science and adds value to rich seminal computational techniques from statistics, information retrieval, machine learning and pattern recognition.

3

Definitions and Remarks of Data Mining:• The nontrivial extraction of implicit, previously unknown,

and potentially useful information from data. Frawley, Piatetsky-Shapiro,

and Matheus (1992)

• The science of extracting useful information from large data sets or databases. Hand, Mannila, and Smyth (2001)

• Involves sorting through large amounts of data and picking out relevant information.

• Usually used by businesses and other organizations, but is increasingly used in the sciences to extract information from the enormous data sets generated by modern experimentation.

諫逐客書 : 「泰山不讓土壤，故能成其大，河海不擇細流，故能就其深

4

2. What is Statistics?

Excerption from http://en.wikipedia.org/wiki/Statisics

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities. Statistics are also used for making informed decisions – and misused for other reasons – in all areas of business and government.

5

Some Definitions and Remarks about Statistics:

Statistics is the science of collecting, organizing, and interpreting numerical facts, which we call data. The goal of statistics is to gain understanding from data. (Introduction to the Practice of Statistics, Moore and McCabe)

The data collected by the ancient Chinese, Egyptians, Babylonians, and Greeks were all statistics long before the field was officially recognized.

6

Statistics is a subject of amazingly many uses and surprisingly few effective practitioners. The traditional road to statistical knowledge is blocked, for most, by a formidable wall of mathematics. Our approach here avoids that wall. --- Efron and Tibshirani (1994)

Sometimes statisticians design new techniques by applying mathematical theories; other times they try to find the theoretical basis for an empirically correct method. The beauty of the field is that one seeks the unification of theoretical validity and empirical usefulness. --- Hua Tang, a Gertrude Cox Scholar and a Ph. D. student in statistics at Stanford

Signal processing requires a signal to process. --- by an instructor in the course “Principles of Modern Radar.”

7

Data Mining requires a data to mine.

Statistics requires a data to analyze.

Data mining has been defined independently of statistics though “mining data” for patterns and predictions is what statistics is all about.

Some techniques that are classified under data mining such as CHAID classification and regression trees and CART Chi-squared Automatic Interaction

Detector grew out of the statistical profession more than anywhere else.

The basic ideas of probability, independence and causality, and over-fitting are the foundation on which both data mining and statistics are built.

3. Data Mining versus Statistics

8

Classical data mining techniques such as CART, neural networks, and nearest neighbor techniques tend to be more robust to messier real world data and also more robust to being used by less expert users.

The time is right … Because of the use of computers, there now exists large quantities of data that is available to users. If there were no data, there would be no interest in mining it.

Computer hardware has dramatically upped the ante by several orders of magnitude in storing and processing the data makes some of the most powerful data mining techniques feasible today.

9

• Unlike statistical data analysis, data mining is not based or focused on an existing model which is to be tested or whose parameters are to be optimized.

• Although the term "data mining" is usually used in relation to analysis of data, like artificial intelligence, it is an umbrella term with varied meanings in a wide range of contexts.

• In spite of this, some exploratory data work is always required in any applied statistical analysis to get a feel for the data. So sometimes the line between good statistical practice and data mining is less than clear.

10

• In statistical analyses where there is no underlying theoretical

model, data mining is often approximated via stepwise

regression methods wherein the space of 2k possible

relationships between a single outcome variable and k

potential explanatory variables is smartly searched. With the

advent of parallel computing, it became possible (when k is

less than approximately 40) to examine all 2k models. This

procedure is called all subsets or exhaustive regression.

11

• Cross validation (another method used in both statistics and dada mining): a technique that produces an estimate of generalization error based on resampling.

• dividing the data into two or more separate data subsets allows one subset to be used to evaluate the generalizeability of the model learned from the other data subset(s).

• A data subset used to build a model is called a training set; the evaluation data subset is called the test set.

• holdout method, k-fold cross validation, and the leave-one-out method.

12

• A pitfall of data mining: lead to discovering correlations that exist due to chance rather than due to an underlying relationship. However,

• When properly done, determining correlations in investment analysis has proven to be very profitable for statistical arbitrage operations (such as pairs trading strategies).

• Correlation analysis has shown to be very useful in risk management.

13

Bottom Line:

From an academic standpoint at least, is

that there is little practical difference

between a statistical technique and a

classical data mining technique.

14

4. Statistics in Data Mining by Examples

Association Rule Discovery

Clustering

Classification …..

15

Association Rule Discovery

Tracking customer’s buying habit (Market Basket Analysis).

Association rules are used to discover elements that co-occurfrequently within a data set consisting of multiple independent selections of elements (such as purchasing transactions), and to discover rules, such as correlation, which relate co-occurring elements. Questions such as "if a customer purchases product A, how likely is he to purchase product B?“ can be answered by association-finding algorithms. The task is to reduce a potentially huge amount of information to a small, understandable set of statistically supported statements.

Example: Stores found that when men went to buy beer on Friday nights, they often bought diapers as well.

16

Data mining has been cited as the method by which the U.S. Army unit Able Danger supposedly had identified the 9/11 attack leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an al Qaeda cell operating in the U.S. more than a year

before the attack. Able Danger,

wikinews:U.S. Army intelligence had detected 9/11 terrorists year before, says officer

Both CIA and their Canadian counterparts, CSIS, have put this method of interpreting data to work for

them as well, although they have not said how.

17

Clustering (used by both statisticians and data miners)

A method by which like records are grouped together. Companies have grouped the population by demographic information into segments

that they believe are useful for direct marketing and sales. Some of these clusters may relate to their business and some of them may not. Can use these same clusters to structure their business and marketing offers.Some commercial available cluster tags (Excerpted from the book Building Data Mining Applications for CRM by Alex Berson, Stephen Smith, and Kurt Thearling)

18

Clustering is the happy medium between homogeneous clusters and the fewest number of clusters.

The difference between clustering and nearest neighbor prediction

Nearest Neighbor Clustering

Used for prediction as well as consolidation.

Used mostly for consolidating data into a high level view and general grouping of records into like behaviors.

Space is defined by the problem to be solved (supervised learning).

Space is defined as default n-dimensional space, or is defined by the user, or is a predefined space driven by past experience (unsupervised learning).

Generally only uses distance metrics to determine nearness.

Can use other metrics besides distance to determine nearness of two records - for example linking two points together.

19

Classification: Definition

Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is

the class. Find a model for class attribute as a function of

the values of other attributes. Goal: previously unseen records should be

assigned a class as accurately as possible. A test set is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

20

Classification: Applications

1. Direct Marketing

2. Fraud Detection

3. Customer Attrition/Churn

4. A Comparative Analysis of a 4-Group And a 6-Group Job Classification

21

Classification: Application 1

Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers

likely to buy a new cell-phone product. Approach:

Use the data for a similar product introduced before. We know which customers decided to buy and which decided

otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction

related information about all such customers. Type of business, where they stay, how much they earn, etc.

Use this information as input attributes to learn a classifier model.

22


Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class

attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an

account.

23


Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost

to a competitor. Approach:

Use detailed record of transactions with each of the past and present customers, to find attributes.

How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.

Label the customers as loyal or disloyal. Find a model for loyalty.

24

5: An Ideal Statistical Data Mining Textbook

Content:

1. Background for Probability & Statistics

2. Introduction of Data Mining and Statistical Software

3. Multivariate Statistical Analysis

4. Principal Components Analysis

5. Regression Analysis

6. Time series Analysis

7. Classification & Discriminant Analysis

8. Clustering Analysis

statistics in data mining 1: what is data mining?

Documents