correlation, ols (simple) regression, logistic regression, reading tables

31
Correlation, OLS (simple) regression, logistic regression, reading tables

Upload: ferdinand-cameron

Post on 16-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Correlation, OLS (simple) regression, logistic regression, reading tables

Correlation, OLS (simple) regression, logistic regression, reading tables

Page 2: Correlation, OLS (simple) regression, logistic regression, reading tables

Review – What are the odds?

• “Test” statistics – say, the “r” – help us evaluate whether there is a relationship between variables that goes beyond chance

• If there is, one can reject the null hypothesis of no relationship• But in the social sciences, one cannot take more than five chances in one-hundred of

incorrectly rejecting the null hypothesis• Here is how we proceed:

– Computers automatically determine whether the test statistic’s coefficient (expressed numerically, such as .03) is of sufficient magnitude to reject the null hypothesis

– How large must a coefficient be? That varies. In any case, if a computer decides that it’s large enough, it automatically assigns one, two or three asterisks (*, **, ***).

– One asterisk is the minimal level required for rejecting the null hypothesis. It is known as < .05, meaning less than five chances in 100 that a coefficient of that magnitude (size) could be produced by chance.

– If the coefficient is so large that the probability is less than one in one-hundred that it was produced by chance, the computer assigns two asterisks (**)

– An even better result is three asterisks (***), where the probability that a coefficient was produced by chance is less than one in a thousand

Page 3: Correlation, OLS (simple) regression, logistic regression, reading tables

CORRELATION

Page 4: Correlation, OLS (simple) regression, logistic regression, reading tables

Correlation

• r: simple relationship between variables– Coefficients range between -1 and +1 (0 = no relationship)

• R: multiple correlation – cumulative association of multiple variables• Computers automatically test correlations for statistical significance (this does not imply there

is a causal relationship – that’s up to researchers to hypothesize)

Correlations

1.000 .719**

. .000

26 26

.719** 1.000

.000 .

26 26

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

HEIGHT

WEIGHT

HEIGHT WEIGHT

Correlation is significant at the 0.01 level(2-tailed).

**.

Sig. (2-tailed) means that the significance of the relationship was computed without specifying the direction of the effect. Of course, here we see that the relationship is positive - both variables rise and fall together. HEIGHT

76747270686664626058

WE

IGH

T

240

220

200

180

160

140

120

100

Correlation “matrix”Displays relationships between variables

Used when independent and dependent variables are continuous

Page 5: Correlation, OLS (simple) regression, logistic regression, reading tables

Correlation matrices• When continuous variables are used, data analysis often begins with a correlation matrix • Correlation matrices display the simple, “bivariate” relationships between every possible

combination of continuous variables. Dependent variables that use continuous measures are usually included.

• The same variables run in the same order down the left and across the top– When a variable intersects with itself, “1” is inserted as a placeholder

Effort Male

Richard B. Felson and Jeremy Staff, “Explaining the Academic Performance-Delinquency Relationship,” Criminology (44:2, 2006)

Page 6: Correlation, OLS (simple) regression, logistic regression, reading tables

REGRESSION

Page 7: Correlation, OLS (simple) regression, logistic regression, reading tables

Regression (ordinary - known as “OLS”)

STATISTICS•r2 – coefficient of determination: proportion of change in the dependent variable accounted for by the change in the independent variable (R2 – summary effect of multiple IV’s on the DV)•b or B. Reports the unit change in the DV for each unit change in the IV. Unlike r’s, which are on a scale of -1 to +1, b’s and B’s are not “standardized,” so they are not comparable.

• For our purposes, it makes no difference whether it is lowercase (b) or uppercase (B).•SE - the standard error. All coefficients include an error component. The larger this error as a proportion of the b or B, the less likely that the b or B will prove statistically significant.

Indep. variables B SE pProcedure

1.Dependent variable is understood - it is “embedded” in the table (here it is “citizen perceptions of social disorder,” a continuous measure)2.Independent variables normally run down the left column3.Significant relationships (p <.05) are denoted two ways - with asterisks, and/or a p-value column4.When assessing a relationship, note whether the B or b is positive (no sign) or negative (- sign).

Used when independent and dependent variables are continuous

Joshua C. Hinkle and Sue-Ming Yang, “A New Look Into Broken Windows: What Shapes Individuals’ Perceptions of Social Disorder?,” Journal of Criminal Justice (42: 2014, 26-35)

Page 8: Correlation, OLS (simple) regression, logistic regression, reading tables

LOGISTIC REGRESSION

Page 9: Correlation, OLS (simple) regression, logistic regression, reading tables

STATISTICS•b (or B) is the logistic regression coefficient

– For our purposes, it makes no difference whether it is lowercase or uppercase.•Exp b, the “odds ratio,” reports the effect on the DV of a one-unit change in the IV.An Exp b of exactly 1 means that as the IV changes one unit the odds that the DV will change are even, same as a coin toss. No relationship between variables can be assumed.•Exp b’s greater than 1 indicate a positive relationship, less than 1 a negative relationship

– Arrest decreases (negative b) the odds of repeat victimization by 22 percent (1 - .78 = .22), but the effect is non-significant (no asterisk)

– Not reported (positive b) increases the odds of repeat victimization by 89 percent(1 + .89) or 1.89 times, a statistically significant change

– Prior victimization increases the odds of repeat victimization 408 percent or 5.08 times, also statistically significant (it’s not 508 percent because Exp b’s begin at 1)

Logistic regression

*

**

Used when dependent variable is nominal (i.e., two mutually exclusive categories, 0/1)and independent variables are nominal or continuous

Dependent variable: Risk of a future assault (0,1)

Richard B. Felson, Jeffrey M. Ackerman and Catherine A. Gallagher, “Police Intervention and the Repeat of Domestic Assault,” Criminology (43:3, 2005)

Page 10: Correlation, OLS (simple) regression, logistic regression, reading tables

Huh?

Original 2X 3X two times three times larger larger

200% 100% Original larger larger

Page 11: Correlation, OLS (simple) regression, logistic regression, reading tables

Class exercise

• Come up with a goofy hypothesis that can be tested using logistic regression– Dependent variable must be nominal (0 and 1)– Independent variable can be nominal or continuous (continuous

preferred)• Make believe that you collected data on x number of cases, coded each for

the dependent and independent variables, and entered it into a statistics program

• Make up a coefficient for the b statistic reported by your statistics package, that reflects the relationship between these variables.– If the relationship is positive, the b must be positive - if it’s negative,

the b must be negative– For the coefficient, arbitrarily pick a fraction between .015 and 1.85.

The larger the fraction (whether positive or negative) the stronger the relationship, and the more likely it is statistically significant

Page 12: Correlation, OLS (simple) regression, logistic regression, reading tables

Logistic regression –going from b to exp(b)

• Use an exponents calculator– http://www.rapidtables.com/calc/math/Exponent_Calculator.htm

• For “number,” always enter the constant 2.72• For “exponent,” enter the b or B value, also known as the “log-odds”• The result is the odds ratio, also known as exp(b)• In the left example the b is 1.21, and the exp(b) is 3.36.

– Meaning, for each unit change in the IV, the DV increases 236 percent• In the right example the b is -.610 (note the negative sign) and the exp(B) is .543

– Meaning, for each unit change in the IV, the DV decreases 46 percent (1.00-.54)

Page 13: Correlation, OLS (simple) regression, logistic regression, reading tables

READING TABLES

Page 14: Correlation, OLS (simple) regression, logistic regression, reading tables

OLS regression Logistic regression

S.E.

IV’s B S.E. p

OLS regression analysis predicting perception of social disorder (DV)

Logistic regression analysis predictingfeeling unsafe (DV)

IV’s B Exp B S.E. p

DV is continuous DV is nominal – 0 and 1Joshua C. Hinkle and Sue-Ming Yang, “A New Look Into Broken Windows: What Shapes Individuals’ Perceptions of Social Disorder?,” Journal of Criminal Justice (42: 2014, 26-35)

Page 15: Correlation, OLS (simple) regression, logistic regression, reading tables

Logistic regression – Effects of broken homes on children

Research questions

•Use the column Exp(B) and percentages to describe the effects of significant variables

•Describe the levels of significance using words

Dependent variable:

conviction for crime

of violence

Logisticregression

Delphone Theobald, David P. Farringron and Alex R. Piquero, “Cildhood Broken Homes and Adult Violence: An Analysis of Moderators and Mediators,” Journal of Criminal Justice (41:1, 2013)

Page 16: Correlation, OLS (simple) regression, logistic regression, reading tables

• Youths from broken homes were 236 percent more likely of being convicted of a crime of violence. The effect was significant, with less than 1 chance in 100 that it was produced by chance.

• Youths with poor parental supervision were 128 percent more likely to be convicted of a violent crime. The effect was significant, with less than 5 chances in 100 that it was produced by chance.

Page 17: Correlation, OLS (simple) regression, logistic regression, reading tables

DV: “co-offending”

Different ways to measure the main IV’s (each is a

separate independent variable)

Additional, “control” independent variables. Each

is measured on a scale or, if it is a nominal variable (e.g.,

gender) is coded 0 or 1, with 0 usually denoting the “reference” category,

meaning the value to which the results are compared.

Here the reference category for race is “non-white”, and

for gender it is “female.”

A “model” is a uniquecombination of independent variablesRegression coefficient.

Positive means IV and DV go up and down together,

negative means as one rises the other falls.

Logistic regression –Economic adversity criminal cooperation

Holly Nguyen and Jean Marie McGloin, “Does Economic Adversity Breed Criminal Cooperation? Considering the Motivation Behind Group Crime,” Criminology (51:4, 2013)

Page 18: Correlation, OLS (simple) regression, logistic regression, reading tables

Research questions

•What is the relationship between the size of gatherings and substance use?•What is the relationship between the presence of peers and substance use?•What is the relationship between the behavior of peers and substance use?

“Poisson” logistic regression* –effects of audience characteristics on substance use

Alcohol and cannabis useat adolescent parties

* “Poisson” best when comparing counts of things

“standardizing” makes the b’s comparable

Owen Gallupe and Martin Bouchard, “Adolescent Parties and Substance Use: A Situational Approach to Peer Influence,” Journal of Criminal Justice (41: 2013, 162-171)

Page 19: Correlation, OLS (simple) regression, logistic regression, reading tables

Findings

•Higher levels of substance use tend to occur in smaller gatherings

•Less alcohol use in the presence of close friends

•Except that higher levels of alcohol/cannabis use when used by friends

•Peer behavior is the key

Page 20: Correlation, OLS (simple) regression, logistic regression, reading tables

COMPLICATIONS

Page 21: Correlation, OLS (simple) regression, logistic regression, reading tables

• Probability statistics are the most common way to evaluate relationships, but they are being criticized for suggesting misleading results. (Click here for a summary of the arguments.)

• We normally use p values to accept or reject null hypotheses. But the actual meaning is more subtle:– Formally, a p <.05 means that, if an association between variables was

tested an infinite number of times, a test statistic coefficient as large as the one actually obtained (say, an r of .3) would come up less than five times in a hundred if the null hypothesis of no relationship was actually true.

• For our purposes, as long as we keep in mind the inherent sloppiness of social science, and the difficulties of accurately quantifying social science phenomena, it’s sufficient to use p-values to accept or reject null hypotheses.

• We should always be skeptical of findings of “significance,” and particularly when very large samples are involved, as even weak relationships will tend to be statistically significant. (See next slide.)

A caution on hypothesis testing…

Page 22: Correlation, OLS (simple) regression, logistic regression, reading tables

Statistical significance v. size of the effect• Once we are confident that an effect was NOT caused by chance, we need to inspect its

magnitude• Consider this example from an article that investigated the “marriage effect”

– Logistic regression was used to measure the association of disadvantage (coded 0/1) and the probability of arrest (Y/N) under four conditions (not important here)

Bianca E. Bersani and Elaine Eggleston Doherty, “When the Ties That Bind Unwind: Examining the Enduring and Situational Processes of Change Behind the Marriage Effect,” Criminology (51:2, 2013)

– Without knowing more, it seems that the association between these two variables is confirmed in model 1 (p < .05) and model 4 (p < .001).

• But just how meaningful are these associations? Since logistic regression was used, we can calculate exp B’s.

– For model 1, the exp B is 1.08, meaning that “disadvantaged” persons are eight percent more likely to have been arrested than non-disadvantaged

– For model 4 the exp B climbs to 38 percent (a little more than one-third more likely) • Since standard error decreases as sample size increases, large samples have a well-known

tendency to label even the most trivial relationships as “significant”• Aside from exp B, r2 is another statistic that can help clue us in on just how meaningful

relationships are “in the real world”

Model 1 Model 2 Model 3 Model 4

b Sig SE b Sig SE b Sig SE b Sig SE

Disadvantage .078 * .037 .119 NS .071 .011 NS .107 .320 *** .091

Page 23: Correlation, OLS (simple) regression, logistic regression, reading tables

Sometimes probabilities are given in a dedicated column -there may be no “asterisks,” or they may be in an unusual place

Probability that a

coefficient was generated

by chance:* <.05

** <.01*** <.001

Asterisks are at the end of variable

names

Shelley Johnson Listwan, Christopher J. Sullivan, Robert Agnew, Francis T. Cullen and Mark Colvin, “The Pains of Imprisonment Revisited: The Impact of Strain on Inmate Recidivism,” Justice Quarterly (30:1, 2013)

Page 24: Correlation, OLS (simple) regression, logistic regression, reading tables

Sometimes different dependent variables run across the top

Probability that a coefficient was generated by chance:

* <.05 ** <.01 *** <.001 Daniel P. Mears, Joshua C. Cochran, Brian J. Stults, Sarah J. Greenman, Avinash S. Bhati and Mark Greenwald, “The ‘True” Juvenile Offender: Age Effects and Juvenile Court Sanctioning,” Criminology (52:2, 2014)

Page 25: Correlation, OLS (simple) regression, logistic regression, reading tables

Dependent variable

Sometimes relationships between variables are given separately for different measures of the dependent variable

Sometimes relationships between variables are given separately for each value of a “control” variable

Richard B. Felson and Keri B. Burchfield, “Alcohol and the Risk of Physical and Sexual Assault Victimization,” Criminology (42:4, 2004)

Yuning Wu, Ivan Y. Sun and Ruth A. Triplett, “Race, Class or Neighborhood Context: Which Matters More in Measuring Satisfaction With Police?,” Justice Quarterly (26:1, 2009)

Independentvariables

Independentvariables

Dependent variable

Controlvariable:

Neighborhood disadvantage

Page 26: Correlation, OLS (simple) regression, logistic regression, reading tables

And just when you thought you had it “down”…

It’s rare, but sometimes categories of the dependent variable run in rows, and the independent variable categories run in columns.

Hypothesis: SOCP (intensive supervision) fewer violations

Jodi Lane, Susan Turner, Terry Fain and Amber Sehgal, “Evaluating an Experimental Intensive Juvenile Probation Program: Supervision and Official Outcomes,” Crime & Delinquency (51:1, 2005)

Page 27: Correlation, OLS (simple) regression, logistic regression, reading tables

Parking lot exercise

Page 28: Correlation, OLS (simple) regression, logistic regression, reading tables
Page 29: Correlation, OLS (simple) regression, logistic regression, reading tables

Final exam practice

Page 30: Correlation, OLS (simple) regression, logistic regression, reading tables

• The final exam will ask the student to interpret a table. The hypothesis will be provided.

• Student will have to identify the dependent and independent variables

• Students must recognize whether relationships are positive or negative

• Students must recognize whether relationships are statistically significant, and if so, to what extent

• Students must be able to explain the effects described by log-odds ratios (exp b) using percentage

• Students must be able to recognize and interpret how the effects change:

– As one moves across models (different combinations of the independent variable)

– As one moves across different levels of the dependent variable

• For more information about reading tables please refer to the week 14 slide show and its many examples

• IMPORTANT: Tables must be interpreted strictly on the techniques learned in this course. Leave personal opinions behind. For example, if a relationship supports the notion that wealth causes crime, then wealth causes crime!

Sample question and answer on next slide

Page 31: Correlation, OLS (simple) regression, logistic regression, reading tables

Hypothesis: Unstructured socializing

and other factors youth violence

1.In which model does Age have the greatest effect?Model 1

2.What is its numerical significance?.001

3.Use words to explain #2Less than one chance in 1,000 that the relationship between age and violence is due to chance

4.Use Odds Ratio (same as Exp b) to describe the percentage effect of Age on Violence in Model 1For each year of age increase, violence is seventeen percent more likely

5.What happens to Age as it moves from Model 2 to Model 3? What seems most responsible?Age becomes non-significant. Most likely cause is introduction of variable Deviant Peers.