csc323 – week 3 outline quiz associations between two variables scatter plots correlation...

31
CSC323 – Week 3 CSC323 – Week 3 Outline Quiz Associations between two variables Scatter plots Correlation coefficient Linear regression analysis

Upload: aubrey-boone

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

CSC323 – Week 3CSC323 – Week 3

Outline

Quiz

Associations between two variables

• Scatter plots

• Correlation coefficient

Linear regression analysis

Page 2: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Association between two variablesAssociation between two variables

Example: University fees for the Big Ten UniversitiesData were collected to study the association between the percentage of students that were from out of state and the tuition paid by nonresidents students (in thousand dollars).

Does the tuition money increase with the percentage of non residents students?

University Tuition (1,000$)

(Y)

Nonresidents (%) (X)

Northwestern

16.4 72

Illinois 7.6 8

Minnesota 8.7 23

Ohio State 9.3 9

Penn State 10.7 18

Purdue 9.6 27

Indiana 10.2 29

Iowa 8.6 31

Wisconsin 9.1 35

Michigan 15.9 30

Michigan State

10.5 9

Page 3: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Example:Example: Size of diamond and price of ring

The source of the data is a full page advertisement placed in the Straits Times newspaper issue of February 29, 1992, by a Singapore-based retailer of diamond jewelry.The variables are the size of the diamond in carats (1 carat = .2 gram) and the price of ladies’ rings (single diamond stone) in Singapore dollars.

Carats Singapore dollars

.17 355

.16 328

.17 350 .18 325.25 642 ……. …..

How would you describe the association between the two variables?

Page 4: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Association between variablesAssociation between variables

Data are collected for the two variables on each individual/unit.

Two variables are associated if changes in one variable correspond to changes in the second variable.

If there is a strong association, knowing one variable helps predicting the other.

Number of programs running and CPU usage

If the association is weak, information about one variable is not very useful in studying the other.

Number of users and CPU usage

Page 5: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Useful terminologyUseful terminology

The following terms are often used:

Response variable: measures the outcome of the study(Dependent variable)

Explanatory variable: explains or causes changes in the response variable(Independent variable)

Can you identify this distinction in the examples shown earlier?

1) Tuition = Response variable Non-residents=Explanatory variable

2) Carat=Explanatory variable Price=Response variable

Page 6: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Scatter plots: displaying data about two Scatter plots: displaying data about two variablesvariablesScatter plots show the relationship between two quantitative variables.One variable (independent variable) appears on the x-axis (horizontal axis) and the dependent variable appears on the y-axis (vertical axis). Each observation is represented by a point in the plot.

Tuition

Non

resi

dent

st

uden

ts

NWU

UMich

Page 7: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Interpreting scatter plotsInterpreting scatter plots

1. Look for the overall pattern and for striking deviations

2. Define form, direction and strength of the relationship:a. Form: roughly linear if the points follow a straight line

or nonlinear…b. Direction: positive or negative?c. Strength: how closely the points follow a clear form

3. Check for the presence of outliers, individual values that fall outside the overall pattern

4. Two variables are positively (negatively) associated if the increase of one variable correspond to an increase (decrease) in the other variable.

Page 8: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

2000 Presidential Elections2000 Presidential Elections

Did the butterfly ballots confuse voters? Did voters for Al Gore instead cast their votes for other candidates?

Bush spokesman Ari Fleishcher stated on Nov. 9 that "Palm Beach County is a Pat Buchanan stronghold and that's why Pat Buchanan received 3,407 votes there."

What is the level of support that Pat Buchanan enjoys in Palm Beach County?The published election results show the association between the vote totals for Pat Buchanan and the total population for Florida counties.

Page 9: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Is the association positive or negative? Is the form of the relationship almost linear?

Page 10: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Example: House data in Albuquerque (NM) Example: House data in Albuquerque (NM) in 1993in 1993

Selling price (100$)

Ann

ual T

axes

($

)

Interpret the graph: form, direction & strength of the relationship

Page 11: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Another example: The statistics of poverty Another example: The statistics of poverty and inequalityand inequalityData from U.N.E.S.C.O. 1990 Demographic Year Book .For 97 countries in the world, data are given for birth rates and for an index of the Gross National Product.

Page 12: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

The plot before shows a non-linear association! Sometimes we can make it linear, by using some transformations on the variables. Possible transformations are, for example, “ln”, “exp”, “sqrt”. Here we consider the ln(GNP)=natural log of GNP.

Birth rate(1,000 pop)

Log G.N.P.

Page 13: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Measure of Linear AssociationMeasure of Linear Association

If there is a strong linear association between the variables, then the cloud of points on the scatter plot will be close to a line.

Birth rate

(1,000 pop)

Log G.N.P.

Page 14: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

The Correlation Coefficient rThe Correlation Coefficient r

The correlation coefficient r measures the direction and the strength of the linear relationship between two variables.

• It is a value between –1 and 1• The closer r is to 1 or –1, the stronger the linear association is. • Positive values of r imply a positive association, negative values imply a

negative association• Values of r close to 0 imply weak linear association.

It is defined as

y

i

x

i

s

yy

s

xx

nr

1

1

Where X has average and standard deviation sx, and Y has average and standard deviation sy.

xy

Page 15: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Examples of correlationExamples of correlation

Birth rate (1,000 pop)

Log G.N.P.

r = -0.74

Selling price (100$)

Ann

ual T

axes

($)

r=0.65

Negative association

Positive association

Page 16: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Diamond rings dataDiamond rings data

Carat

Price in US dollars

N=48 Average s.d. Min Max

X Carat 0.20 0.056 0.12 0.35

Y Price in US $

865.144 213.64

385 1879

Strong positive association

r = 0.989

Diamond carats vs Price in US$

Page 17: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Positive CorrelationPositive CorrelationIn each plot there are 100 points. The correlation coefficient measures the amount of clustering around a line

If r is close to 1, then points lie close to a straight line!!

Page 18: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Negative CorrelationNegative Correlation

Negative correlation: as x increases, y tends to decrease.

If r is close to – 1, then points lie close to a straight line!!

Page 19: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Guess the correlationGuess the correlation

Match the diagrams with the following correlations: – 0.93 – 0.75 –0.20 0.27 0.63 1.0

Page 20: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Change of scaleChange of scale

These are the low and high temperatures in Boulder (CO) for the month of April 1996. The first scatter plot uses degrees in Fahrenheit and the second plot uses degrees in centigrade. Notice that Co = 5/9*(Fo – 32)

Are the correlations between low and high temperatures in the two graphs different?

r = 0.74 r = ?

Page 21: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Different correlations?Different correlations?

In which diagram below is the correlation coefficient the largest? The smallest?

Page 22: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Outliers and nonlinear associationOutliers and nonlinear association

How are the data sets different?

Page 23: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Plot the data: the nature of the association between x and y is very different. The correlation coefficient can be misleading in presence of outliers or non-linear association. Check the scatter plot of the data

Perfect association!Why is r not equal to 1?

Outliers change the value of r. What would the value of r be without the outliers?

r = 0.82

Page 24: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Which of the following diagrams should be summarized by r?

(1) (2) (3)

Page 25: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Ecological CorrelationsEcological Correlations

Ecological correlations are based on rates or averages. They can be misleading as they tend to overstate the strength of the association. The following example deals with the relationship between income and education level for individuals in 3 states (A, B, C).

This shows the averages.The correlation is almost 1!!

This shows individual data. The correlation is now moderate.Variability within each state!!!

Page 26: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

SummarySummary

The correlation coefficient r varies between –1 and 1. If r=0 means there no linear association between X and Y. If r=1 or –1, then the points in a scatter plot lie on a straight line.

Positive r indicates positive association between X and Y. Negative r indicates negative association between X and Y. Both variables X and Y must be quantitative. The correlation coefficient between X and Y is the same as the correlation between Y and X

r does not change if we change the units of measurement for X and Y

The correlation measures only the linear relationship between two variables

r can be strongly affected by the presence of outliers.

Page 27: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Correlation does not mean Causation!!Correlation does not mean Causation!!

Page 28: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

The correlation between teachers’ salaries and the consumption of alcohol over a period of years turned out to be almost 0.90. Do the teachers drink?

Both variables moved together, because both are influenced by a third variable (confounding variable) which is the long run growth in national income and population.

A "bad example“ published in The New York Times' weekly science supplement called "Science Times" on August 22, 1989. It stated, "The experts have also developed startling evidence of the cat's renowned ability to survive, this time in the particular setting of New York City, where cats are prone at this time of year to fall from open windows in tall buildings. Researchers call the phenomenon feline high-rise syndrome." "Even more surprising, the longer the fall, the greater the chance of survival. Only one of 22 cats that plunged from above 7 stories died, and there was only one fracture among the 13 that fell more than 9 stories.

Page 29: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

The following graph displays the number of radios in the U.K. form 1924 to 1937 and the number of mental defectives for 10,000 people for the same years.

A social scientist states: “as more people gave up intellectual pursuits like readings for listening to the radio, general atrophy of the brain set in and lead to increased mental disability” ?!?!?!

Page 30: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Data miningData mining

Search for patterns and associations in very large databases, that are hidden in vast amount of data.

For instance: Market basket data purchases recorded by the cash

scanners of a national retail chainWeb logs data Logs of the visits to a certain websiteExploratory data analysis techniques are used to discover

information from huge datasets!

Because of the very large dimension of the datasets, efficient algorithms are necessary to “mine” the data.

Data mining is cross-disciplinary: statistical methods made efficient by computer scientists!

Page 31: CSC323 – Week 3 Outline  Quiz  Associations between two variables Scatter plots Correlation coefficient  Linear regression analysis

Correlation is often used in data mining to to construct the “association rules”, i.e. to learn about the associations among variables.

Association is often confused with causation in data mining!

A supermarket manager observes that there is a strong positive correlation between the sales of hamburgers and hotdogs, and between the sales of hotdogs and barbecue sauce.

He decides to sell hotdogs at a large discount, hoping to increase profit by simultaneously raising the price of the barbecue sauce.

What is the causal model (cause& effect) that is assumed by the manager?

Will the manager make money on this sale?