classical techniques: statistics, neighborhoods, and clustering

12
Classical Techniques: Statistics, Neighborhoods, and Clustering

Post on 21-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classical Techniques: Statistics, Neighborhoods, and Clustering

Classical Techniques: Statistics, Neighborhoods, and Clustering

Page 2: Classical Techniques: Statistics, Neighborhoods, and Clustering

What is Statistics?

• Statistics is a branch of mathematics concerning the collection and the description of data

• Statistics was in fact, born from very humble beginnings of real-world problems from business, biology and gambling!

• Knowing statistics in everyday life will help the average business person make better decisions by allowing them to figure out risk and uncertainty when all the facts either aren’t known or can’t be collected.

• Today, data mining has been defined independently of statistics.

Page 3: Classical Techniques: Statistics, Neighborhoods, and Clustering

Data, Counting and Probability

• One thing that is always true about statistics is that there is always data involved.

• Statistics can help greatly in this process by helping to answer several important questions about the data:– What patterns are there in my database?

– What is the chance that an event will occur?

– Which patterns are significant?

– What is the high-level summary of the data that gives me some idea of what is contained in my database?

• One of the great values of statistics is in presenting a high-level view of the database that provides some useful information without requiring every record to be understood in details

Page 4: Classical Techniques: Statistics, Neighborhoods, and Clustering

Histogram

• The first step then in understanding statistics is to understand how the data is collected into a higher-level form – one of the most notable ways of doing this is with the histogram.

Page 5: Classical Techniques: Statistics, Neighborhoods, and Clustering

Histogram (cont’d)

Page 6: Classical Techniques: Statistics, Neighborhoods, and Clustering

Figure 6-1 An example database of customers with different Predictor types

Page 7: Classical Techniques: Statistics, Neighborhoods, and Clustering
Page 8: Classical Techniques: Statistics, Neighborhoods, and Clustering

Statistics for Prediction

• Regression is a powerful and commonly used tool in statistics and is discussed here

• Linear Regression:– The simplest form of regression is simple linear

regression that contains only one predictor and a prediction

– The relationship between the two can be mapped on a two-dimensional space and the records plotted for the prediction values along the Y-axis and the predictor values along the X-axis.

Page 9: Classical Techniques: Statistics, Neighborhoods, and Clustering

Statistic for prediction (cont’d)

Page 10: Classical Techniques: Statistics, Neighborhoods, and Clustering
Page 11: Classical Techniques: Statistics, Neighborhoods, and Clustering

Nearest Neighbor• Clustering and the nearest neighbor prediction techniques

are among the oldest techniques used in data mining

• Clustering is – namely, that like records are grouped or clustered together.

• Nearest neighbor is a prediction technique that is quite similar to clustering – in order to predict what a prediction value is in one record, look for records with similar predictor values in the historical database and use the prediction value from the record that is ‘nearest’ to the unclassified record

Page 12: Classical Techniques: Statistics, Neighborhoods, and Clustering

How to use Nearest Neighbor for Prediction

• One of the essential elements underlying the concept of clustering is that one particular object (whether it is a car, a food item, or a customer) can be closer to another object than can some third object.

• The nearest neighbor prediction algorithm simply stated is as follows:– Objects that are ‘near’ each other will also have

similar prediction values. Thus, if you know the prediction value of the objects, you can predict it for its nearest neighbors.