data mining: an introduction

Data Mining: An Introduction

Billy Mutell

“The Library of Babel” Analogy

Network of bookshelves with every book ever written

All the books one could possibly imagine must exist somewhere in this library

Books have titles like ‘Axaxxas mlo’, ‘The Bible’ & ‘Tomorrow's Winning Lottery

Numbers’

Roughly 251,312,000 or 1.956 x 101,834,097 volumes in library

May be viewed as a metaphor for information in today’s society, where there’s growing amounts of data and, but not enough information

Content

•General Information

•Approaches to searching for information

•Project and plans

• The nontrivial extraction of implicit, previously unknown, and potentially useful information from data

• The science of extracting useful information from large data sets or databases

What is Data Mining?

• With increased data, techniques needed to be created

How Did it Evolve to What We Have Today?

Information Retrieval

Statistics

Machine LearningAlgorithms

Database Management

Data Mining

Practical Applications

Government Intelligence

Insurance

Bank Finance

Branch Evaluation

Pharmaceutical Reactions in Patients

Content




There are two models for mining data

Predictive: Makes projected conclusions about values based on known results from different data

Includes: Regression, Classification, Time Series Analysis

Classification: Maps data into predefined groups

Example: Identifying potential credit risks

Time Series Analysis: Examining the value of an attribute as it varies over time

Example: Choosing stocks

There are two models for mining data

Descriptive: Identifies patterns or relationships in data

Includes: Clustering, Association Rules, Sequence Discovery

Clustering: Very similar to Classification, but groups are defined by data and not predefined

Association Rules: Identifies specific types of data pairings

Example: If someone buys jelly, they’re probably buying peanut butter

Sequence Discovery: Highlights patterns on temporal sequences

Example: If someone buys a CD player, they’ll probably buy CDs within a week

• Statistical Based Algorithms • Decision Tree Based Algorithms • Rule Based Algorithms • Distance Based Algorithms

Information Analysis

iii xy

Linear Regression Examples

Regression- Estimation of output value based on input values; takes input data and fits it into a formula according to output

Statistical Based Algorithms

nnxcxccy ...110

By determining the regression coefficients {c0, c1, …, cn}, we can estimate the relationship the output parameter, y, and the input parameters, {x1,…, xn}

Dead or Alive?

Alive? Dead?

Woman? Man?

Non-Mathematician?

Mathematician?

Modern?

Ancient?

Pythagoras!

Decision Tree Example: 20 Questions

Rule Based Algorithms

Works well to perform classification through if-then analysis

Trees have an implied order in which there is splitting; rules have no order

car ,

FthenclassIfgrade

DthenclassadegradeandgrIf

CthenclassadegradeandgrIf

BthenclassadegradeandgrIf

AthenclassgradeIf

,60

,7060

,8070

,9080

,90

Parametric vs Nonparametric Models

Parametric Model- Describes the relationship between input and output through algebraic equations where some parameters aren’t specified

Nonparametric Model- Data driven and more appropriate for mining applications

Creates models based on input while Parametric Methods assume models ahead of time

More flexible than Parametric Models and generally easier to work with

Content




• Quest to improve customer/movie predictability through data mining and linear regression

• Teams win $1,000,000 prize

• Must beat Cinematch, Netflix’s current program to predict movie preferences

• http://www.netflixprize.com/

NetFlix: A Case Study

http://www.netflixprize.com/

http://www.netflixprize.com/

What others have done so far:

“If I have seen further, it is by standing on the shoulders of giants.”

-Isaac Newton 1676

There are currently 31,443 contestants on 25,713 teams from 167 different countries.

Important to remember that everyone is given the same amount of incomplete data, and we have to use that to predict rest of the data (unknown to us, known to Netflix)

Current Leaders are from Budapest, Hungry and they’ve accurately predicted the data 8.7% better than Cinematch

K-Nearest Neighbor Algorithm (k-NN)

A set of pairs is given, where the xi’s take values in a metric space X upon which is defined a metric d and the θi’s take values in the set {1,2,…M} of possible classes. Each θi is considered to be an index of the category to which the ith individual belongs, and each xi is the outcome of the set of measurements made upon that individual.

A new pair (x,θ) is given, where only the measurement of x is observable, and it is desired to estimate θ by using information in the set of correctly classified points. Thus, we will call

the nearest neighbor of x if

nnxxx ,,...,,,, 2211

xxdxxd ni ,,min ni ,...2,1

nn xxxx ,...,, 21

The Nearest-Neighbor classification decision method gives to x the category θ’n of its nearest neighbor x’n

K-Nearest Neighbor Algorithm (k-NN)

If k=3, we classify the dot as a triangle

If k=5, we classify the dot as a rectangle

x

TRIANGLEx

SQUAREx

Name Gender Height) Output

Kristina F 1.6 Short

Jim M 2 Tall

Maggie F 1.9 Medium

Martha F 1.88 Medium

Stephanie F 1.7 Short

Bob M 1.85 Medium

Kathy F 1.6 Short

Dave M 1.7 Short

Worth M 2.2 Tall

Steven M 2.1 Tall

Debbie F 1.8 Medium

Todd M 1.95 Medium

Kim F 1.9 Medium

Amy F 1.8 Medium

Wynette F 1.75 Medium

Suppose we want to know what the entry <Pat, F, 1.6> would be classified as…

Set K=5 and find the K nearest neighbors:

<Kristina, F, 1.6> => SHORT

<Kathy, F, 1.6> => SHORT

<Stephanie, F, 1.7> => SHORT

<Dave, M,1.7> => SHORT

<Wynette, F, 1.75> => MEDIUM

Thus KNN would classify <Pat, F, 1.6> as SHORT

Take data from Netflix and sift through it

Develop a function that maps non-linear data to a linear format so that it may be clustered and regressed

Map data to matrices in Rn

Use Support Vector Machines to map input vectors to a higher dimensional space where a maximal separating hyper-plane is constructed

Create a way to interpret this data in the form of movie recommendations

Also…

Use k-NN Approach along with Latent Semantic Indexing techniques to analyze scripts and key thematic plots and look for correlations/clusters

What I plan to do from here:

Questions?

data mining: an introduction

Documents