data mining lectures lecture 2: data measurement padhraic smyth, uc irvine ics 278: data mining...

68
Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Post on 21-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

ICS 278: Data Mining

Lecture 2: Measurement and Data

Page 2: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Today’s lecture

• Questions on homework?

• Office hours tomorrow: 9:30 to 11

• Outline of today’s lecture:– From lecture 1: various tasks in data mining– Chapter 2: Measurement and Data

• Types of measurement• Distance measures• Multidimensional scaling

• Discussion of class projects

Page 3: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Slides from Lecture 1……

Page 4: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….

Page 5: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Exploratory Data Analysis

• Getting an overall sense of the data set– Computing summary statistics:

• Number of distinct values, max, min, mean, median, variance, skewness,..

• Visualization is widely used– 1d histograms– 2d scatter plots– Higher-dimensional methods

• Useful for data checking– E.g., finding that a variable is always integer valued or positive– Finding the some variables are highly skewed

• Simple exploratory analysis can be extremely valuable– You should always “look” at your data before applying any

data mining algorithms

Page 6: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Exploratory Data Analysis(Pima Indians data, scatter plot matrix)

Page 7: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….

Page 8: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Descriptive Modeling

• Goal is to build a “descriptive” model – e.g., a model that could simulate the data if needed– models the underlying process

• Examples:– Density estimation:

• estimate the joint distribution P(x1,……xp)

– Cluster analysis:• Find natural groups in the data

– Dependency models among the p variables• Learning a Bayesian network for the data

Page 9: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Anemia Group

Control Group

Page 10: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Descriptive Modeling

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4ANEMIA PATIENTS AND CONTROLS

Red Blood Cell Volume

Red

Blo

od C

ell H

emog

lobi

n C

once

ntra

tion

Anemia Group

Control Group

3.3 3.4 3.5 3.6 3.7 3.8 3.9 43.7

3.8

3.9

4

4.1

4.2

4.3

4.4

Red Blood Cell Volume

Re

d B

loo

d C

ell

He

mo

glo

bin

Co

nce

ntr

atio

n

EM ITERATION 25

Page 11: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

5115

11111151511151

77777777

111333

3333131113332232

User 5

User 4

User 3

User 2

User 1

Learning User Navigation Patterns from Web Logs

Page 12: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Clusters of Probabilistic State Machines Cadez, Heckerman, et al, 2003

B

E

C

A

B

E

C

A

B

E

C

A

Cluster 1 Cluster 2

Cluster 3

Motivation:capture heterogeneityof Web surfing behavior

Page 13: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

WebCanvas algorithm and software - currently in new SQLServer

Page 14: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Another Example of Descriptive Modeling

• Learning Directed Graphical Models (aka Bayes Nets) – goal: learn directed relationships among p variables– techniques: directed (causal) graphs– challenge: distinguishing between correlation and

causation

canceryellow fingers?

smoking

– example: Do yellow fingers cause lung cancer?

hidden cause: smoking

Page 15: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….

Page 16: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Predictive Modeling

• Predict one variable Y given a set of other variables X– Here X could be a p-dimensional vector

– Classification: Y is categorical– Regression: Y is real-valued

• In effect this is function approximation, learning the relationship between Y and X

• Many, many algorithms for predictive modeling in statistics and machine learning

• Often the emphasis is on predictive accuracy, less emphasis on understanding the model

Page 17: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Predictive Modeling: Fraud Detection

• Credit card fraud detection– Credit card losses in the US are over 1 billion $ per year– Roughly 1 in 50k transactions are fraudulent

• Approach– For each transaction estimate p(fraudulent | transaction)– Model is built on historical data of known fraud/non-fraud– High probability transactions investigated by fraud police

• Example:– Fair-Isaac/HNC’s fraud detection software based on neural networks,

led to reported fraud decreases of 30 to 50%– http://www.fairisaac.com/fairisaac

• Issues– Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff?

Page 18: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Predictive Modeling: Customer Scoring

• Example: a bank has a database of 1 million past customers, 10% of whom took out mortgages

• Use machine learning to rank new customers as a function of p(mortgage | customer data)

• Customer data– History of transactions with the bank– Other credit data (obtained from Experian, etc)– Demographic data on the customer or where they live

• Techniques– Binary classification: logistic regression, decision trees, etc– Many, many applications of this nature

Page 19: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Predictive Modeling: Telephone Call Modeling

• Background– AT&T has about 100 million customers– It logs 200 million calls per day, 40 attributes each– 250 million unique telephone numbers– Which are business and which are residential?

• Approach (Pregibon and Cortes, AT&T,1997)– Proprietary model, using a few attributes, trained on known

business customers to adaptively track p(business|data)– Significant systems engineering: data are downloaded nightly,

model updated (20 processors, 6Gb RAM, terabyte disk farm)

• Status: – running daily at AT&T – HTML interface used by AT&T marketing

Page 20: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

From C. Cortes and D. Pregibon,Giga-mining, in Proceedings of theACM SIGKDD Conference, 1997

Page 21: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Different Data Mining Tasks

• Exploratory Data Analysis

• Descriptive Modeling

• Predictive Modeling

• Discovering Patterns and Rules

• + others….

Page 22: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Structure: Models and Patterns

• Model = abstract representation of a processe.g., very simple linear model structure

Y = a X + b– a and b are parameters determined from the data– Y = aX + b is the model structure– Y = 0.9X + 0.3 is a particular model– “All models are wrong, some are useful” (G.E. Box)

• Pattern represents “local structure” in a data set– E.g., if X>x then Y >y with probability p– or a pattern might be a small cluster of outliers in

multi-dimensional space

Page 23: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Pattern Discovery

• Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally

• given market basket data we might discover that• If customers buy wine and bread then they buy cheese with

probability 0.9• These are known as “association rules”

• Given multivariate data on astronomical objects• We might find a small group of previously undiscovered

objects that are very self-similar in our feature space, but are very far away in feature space from all other objects

Page 24: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Pattern Discovery

ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

Page 25: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Pattern Discovery

ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDDDCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBBBDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBACDCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDABDDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADCCBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAABADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCDCCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBCBDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBCCDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB

Page 26: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Example of Pattern Discovery

• IBM “Advanced Scout” System– Bhandari et al. (1997)– Every NBA basketball game is annotated,

• e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan

• This creates a huge untapped database of information

– IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter”

– IBM claimed around 1998 that all NBA teams except 1 were using this software…… the “other team” was Chicago.

Page 27: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Page 28: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Components of Data Mining Algorithms

• Representation:– Determining the nature and structure of the

representation to be used• Score function

– quantifying and comparing how well different representations fit the data

• Search/Optimization method– Choosing an algorithmic process to optimize the score

function; and• Data Management

– Deciding what principles of data management are required to implement the algorithms efficiently.

Page 29: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Task

What’s in a Data Mining Algorithm?

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Page 30: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Task

An Example: Linear Regression

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Regression

Y = Weighted linear sum of X’s

Least-squares

Gaussian elimination

None specified

Regression coefficients

Page 31: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Task

An Example: Decision Trees (C4.5 or CART)

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Classification

Hierarchy of axis-parallel linear class boundaries

Cross-validated accuracy

Greedy Search

None specified

Decision tree classifier

Page 32: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Task

An Example: Hierarchical Clustering

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Clustering

Tree of clusters

Various

Greedy search

None specified

Dendrogram

Page 33: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Task

An Example: Association Rules

Representation

Score Function

Search/Optimization

Data Management

Models, Parameters

Pattern Discovery

Rules: if A and B then C with prob p

No explicit score

Systematic search

Multiple linear scans

Set of Rules

Page 34: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Data Measurement

Page 35: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Measurement

Real world

Relationship in data

Data

Relationship in real world

Mapping domain entities to symbolic representations

Page 36: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Nominal or Categorical Variable

                                                                  

Here, numerical values just "name" the attribute uniquely. No ordering impliedi.e. jersey numbers in basketball; a player with number 30 is not more of anything than a player with number 15; certainly not twice whatever number 15 is.

Page 37: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Measurements, cont.ordinal measurement - attributes can be rank-ordered. Distances between attributes do not have any meaning. i.e., on a survey you might code Educational Attainment as 0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? No. The interval between values is not interpretable in an ordinal measure.

interval measurement - distance between attributes does have meaning. i.e., when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. average makes sense, however ratios don't - 80 degrees is not twice as hot as 40 degrees

Page 38: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Measurements, cont.ratio measurement - an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that "...we had twice as many clients in the past six months as we did in the previous six months."

                                                      

Page 39: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Hierarchy of Measurements

                                                      

Page 40: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Scales

scale Legal transforms example

Nominal/Categorica

lAny one-one mapping Hair color, employment

ordinalAny order preserving

transformSeverity, preference

intervalMultiply by constant, add a

constantTemperature, calendar time

ratio Multiply by constant Weight, income

Page 41: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Why is this important?

• As we will see….– Many models require data to be represented in a specific

form– e.g., real-valued vectors

• Linear regression, neural networks, support vector machines, etc

• These models implicitly assume interval-scale data (at least)

– What do we do with non-real valued inputs?• Nominal with M values:

– Not appropriate to “map” to 1 to M (maps to an interval scale) – Why? w_1 x employment_type + w_2 x city_name– Could use M binary “indicator” variables

» But what if M is very large? (e.g., cluster into groups of values)

• Ordinal?

Page 42: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Mixed data

• Many real-world data sets have multiple types of variables, – e.g., demographic data sets for marketing– Nominal: employment type, ethnic group– Ordinal: education level– Interval: income, age

• Unfortunately, many data analysis algorithms are suited to only one type of data (e.g., interval)

• Exception: decision trees– Trees operate by subgrouping variable values at internal

nodes– Can operate effectively on binary, nominal, ordinal, interval– We will see more details later…..

Page 43: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Distance Measures

• Many data mining techniques are based on similarity or distance measures between objects.

• Two methods for computing similarity or distance:1. Explicit similarity measurement for each pair of objects2. Similarity obtained indirectly based on vector of object

attributes.

• Metric: d(i,j) is a metric iff1. d(i,j) 0 for all i, j and d(i,j) = 0 iff i = j2. d(i,j) = d(j,i) for all i and j3. d(i,j) d(i,k) + d(k,i) for all i, j and k

Page 44: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Vector data and distance matrices

• Data may be available as n “vectors” each p-dimensional

• Or “data” itself may be a n x n matrix of similarities or distances

Page 45: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Distance

))i(x,),i(x),i(x()i(x p21 • Notation: n objects with p measurements

2

1p

1k

2kkE ))j(x)i(x()j,i(d

• Most common distance metric is Euclidean distance:

• Makes sense in the case where the different measurements are commensurate; each variable measured in the same units.

• If the measurements are different, say length and weight, Euclidean distance is not necessarily meaningful.

Page 46: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

StandardizationWhen variables are not commensurate, we can standardize them by dividing by the sample standard deviation. This makes them all equally important.

2

1

1

2)(1

ˆ

n

ikkk xix

n

The estimate for the standard deviation of xk :

where xk is the sample mean:

n

1ikk )i(x

n

1x

(When might standardization *not* be a such a good idea? hint: think of extremely skewed data and outliers, e.g., Gates income)

Page 47: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Weighted Euclidean distance

2

1p

1k

2kkkWE ))j(x)i(x(w)j,i(d

If we have some idea of the relative importance ofeach variable, we can weight them:

Page 48: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Other Distance Metrics

• Minkowski or L metric:

• Manhattan, city block or L1 metric:

• L

1p

1kkk ))j(x)i(x()j,i(d

p

1kkk )j(x)i(x)j,i(d

)j(x)i(xmax)j,i(d kkk

Page 49: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Additive Distances

• Each variable contributes independently to the measure of distance.

• May not always be appropriate…

object i object j

height(i) height(j)

diameter(i) diameter(j)

height2(i)

height100(i)

… height2(j)

height100(j)

Page 50: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Dependence among Variables

• Covariance and correlation measure linear dependence

• Assume we have two variables or attributes X and Y and n objects taking on values x(1), …, x(n) and y(1), …, y(n). The sample covariance of X and Y is:

• The covariance is a measure of how X and Y vary together.– it will be large and positive if large values of X are

associated with large values of Y, and small X small Y

n

1i

)y)i(y)(x)i(x(n

1)Y,X(Cov

Page 51: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Sample correlation coefficient

• Covariance depends on ranges of X and Y• Standardize by dividing by standard deviation• Sample correlation coefficient

2

1

1

2

1

2

1

))(())((

))()()((),(

n

i

n

i

n

i

yiyxix

yiyxixYX

Page 52: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Sample Correlation Matrix

business acreage

nitrous oxide

percentage of large residential lots

-1 0 +1

Data on characteristicsof Boston surburbs

average # rooms

Median house value

Page 53: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Mahalanobis distance

2

11 )()()()(),( jxixjxixjid T

MH

1. It automatically accounts for the scaling of the coordinate axes2. It corrects for correlation between the different features

Price:1. The covariance matrices can be hard to determine accurately2. The memory and time requirements grow quadratically rather

than linearly with the number of features.

Page 54: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

What about…

Y

X

(X,Y) = ?

linear covariance, correlation

Are X and Y dependent?

Page 55: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Binary Vectors

• matching coefficient

j=1 j=0

i=1 n11 n10

i=0 n01 n00

00011011

0011

nnnn

nn

• Jaccard coefficient

011011

11

nnn

n

Number ofvariables whereitem j =1 and item i = 0

Page 56: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Other distance metrics

• Categorical variables– Number of matches divided by number of dimensions

• Distances between strings of different lengths– e.g., “Patrick J. Smyth” and “Padhraic Smyth”– Edit distance

• Distances between images and waveforms– Shift-invariant, scale invariant– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)

Page 57: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Transforming Data

• Duality between form of the data and the model– Useful to bring data onto a “natural scale”– Some variables are very skewed, e.g., income

• Common transforms: square root, reciprocal, logarithm, raising to a power– Often very useful when dealing with skewed real-world

data

• Logit: transforms from 0 to 1 to real-line

p

pp

1)(logit

Page 58: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Multidimensional Scaling (MDS)

• Say we have data in the form of an N x N matrix of dissimilarities– 0’s on the diagonal– Symmetric

• Examples– Perceptual dissimilarity of N objects in cognitive science

experiments– String-edit distance between N protein sequences

• MDS:– Find k-dimensional coordinates for each of the N objects

such that Euclidean distances in “embedded” space matches set of dissimilarities as closely as possible

Page 59: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Multidimensional Scaling (MDS)

• MDS criterion

• Optimization: find the set of N k-dimensional positions that minimize S– If original dissimilarities are Euclidean

• -> linear algebra solution (equivalent to principal components)

– Non-Euclidean (more typical)• Local iterative hill-climbing, e.g., move each point to increase S, repeat• Complexity is O(n2 k) per iteration (iteration = move all points)

– See Faloutsos and Lin (1995) for FastMap: O(nk) approximation for large N

• Often used for visualization, e.g., k=2, 3

jiji

jidjijidS,

2

,

2 ),(/)),(),((

Originaldissimilarities

Euclidean distancein “embedded” k-dim space

Page 60: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

MDS example: input distance data

Chicago Raleigh Boston Seattle S.F. Austin Orlando

Chicago 0

Raleigh 641 0

Boston 851 608 0

Seattle 1733 2363 2488 0

S.F. 1855 2406 2696 684 0

Austin 972 1167 1691 1764 1495 0

Orlando 994 520 1105 2565 2458 1015 0

Page 61: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Page 62: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Result of MDS

Page 63: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

MDS: Example data

Page 64: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

MDS: 2d embedding of face images

Page 65: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Page 66: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Data Quality

• Individual measurements– Random noise in individual measurements

• Variance (precision)• Bias• Random data entry errors• Noise in label assignment (e.g., class labels in medical data sets)

– Systematic errors• E.g., all ages > 99 recorded as 99• More individuals aged 20, 30, 40, etc than expected

– Missing information• Missing at random

– Questions on a questionnaire that people randomly forget to fill in• Missing systematically

– Questions that people don’t want to answer– Patients who are too ill for a certain test

Page 67: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Data Quality

• Collections of measurements– Ideal case = random sample from population of interest– Real case = often a biased sample of some sort– Key point: patterns or models built on the training data may

only be valid on future data that comes from the same distribution

• Examples of non-randomly sampled data– Medical study where subjects are all students– Geographic dependencies– Temporal dependencies– Stratified samples

• E.g., 50% healthy, 50% ill– Hidden systematic effects

• E.g., market basket data the weekend of a large sale in the store• E.g., Web log data during finals week

Page 68: Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 2: Measurement and Data

Data Mining Lectures Lecture 2: Data Measurement Padhraic Smyth, UC Irvine

Next Lecture

• Chapter 3

– Exploratory data analysis and visualization