skillshare - regression analysis for data journalism

Post on 16-Apr-2017

1.874 Views

Category:

Data & Analytics

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

(AN INTRODUCTION)

REGRESSION ANALYSIS FOR DATA-JOURNALISM

Camila SalazarSchool of Data Fellow

@milamila07

Outline1. Target audience2. A step beyond descriptive statistics3. What is regression analysis?4. Example: the effect of education on wages5. Other types of regression analysis useful in data

journalism.6. Using regression models in data journalism

TARGET AUDIENCE

Target audience• Data journalists

• School of Data Fellows

• People with basic knowledge of statistics

• Journalism students

A STEP BEYOND DESCRIPTIVE STATISTICS

So you are in the newsroom...There’s a big debate in you country about the importance of education

Your editor asks you to make a story about the importance of education

First step: descriptive statisticsYou find data about education in your country and start

calculating the descriptive statistics.

Descriptive statisticsWith descriptive statistics you find:

-How many people has a college degree.

-Unemployment according to the level of education.

And...You interview young people that are still in highschool that don’t want to go to college. And you want to convince them with your story how could they improve their future earnings if they go to college.

You can’t answer this question using descriptive statistics :(

But...You can calculate how much an extra year of schooling increases wages using regression analysis!

WHAT IS REGRESSION ANALYSIS?

What is regression analysis?

Regression analysis is a statistical tool for the

investigation of relationships between variables.

What is regression analysis?

It helps you explain how the value of a dependent

variable (Y) changes when and independent variable

(X) is varied, holding all other variables fixed.

What is regression analysis?

For example:

Health (Y)

Vegetables consumption (X), exercise (X), sleep (X)

dependent variableindependent variables

The linear regression It’s a method for modeling the linear relationship between a dependent variable Y and one or more explanatory variables.

dependent variable independent

variable

error term

coefficient

We are interested in estimating B (the

coefficient). It captures the effect X has on Y,

holding all other factors fixed.

The linear regressionFor example you want to explain the effect of education on

wages.

Wage EducationExperience

Variation in wage that has to do with educationVariation in wage that has

to do with experience

What is a linear regression?• You have to formulate a hypothesis about the

relationships of interest. • Have some theory behind your assumptions.• There are some essential assumptions and

statistical properties of the regression that you have to consider. Wage

EXAMPLE: THE EFFECT OF EDUCATION ON

WAGES

Example• Database with 994 observations. • 3 variables: wage (in dollars), experience, years of

education.• The equation to estimate:

Wage

Example

Wage

Example: coefficients

Wage

An additional year of education increases wage by $161.68, holding all other factors fixed.

An additional year of experience increases wage by $16.54, holding all other factors fixed.

Example: p-value

Wage

P-Value

But, what is the p-value?

Example: p-value

Wage

With statistics you can’t be 100% certain.

A relatively simple way to interpret P values is to think of them as representing how likely a result would occur by chance.

Example: p-value

Wage

Null-hypothesis: is a hypothesis which the researcher tries to disprove, reject or nullify.

“Education has NO explanatory power over wages”“Men are NOT taller than women on average”

To test the null-hypothesis we use the p-value.

Example: p-value

Wage

The p-value is the probability of being wrong when rejecting the null hypothesis

If your p-value is small < 0.05 you have strong evidence to reject the null hypothesis.

“Men are significantly taller than women, p=0.01.” That means there is a 1% chance that men are NOT actually taller than women and this result happened only because of random chance.

Example

Wage

P-Value

It tells you if the coefficient is statistically significant.With a low p-value (less than 10%, 5% or 1%) you can reject the null hypothesis that the coefficient is equal to zero (it has no explanatory power). In this case,

the coefficients are significant. That means that education and experience have explanatory power on wage.

Example

Wage

R-squared: This indicates how well the explanatory variables explain the variability of the dependent variable.

In this case: 33.8% of the variability of wage is explained by the years of education and years of

experience.

OTHER TYPES OF REGRESSION ANALYSIS

The logistic regression

Wage

Imagine you want to estimate the probability that a person with a college degree is employed.

The linear regression wouldn’t be very useful.

The logistic regression

Wage

Is a regression model where the dependent variable (Y) is categorical. For example (binary):

1= unemployed, 0= employedIt is used to estimate the probability of a binary response based

on one or more independent variables.

The logistic regression

Wage

Explanatory variables:

-Age-Education-Family income-Ocuppation

Logistic regression

Employed

Unemployed

The model would tell you, for example, that a person with a college degree is three times more likely to be employed that a person that only went to highschool.

The logistic regression

Wage

• The coefficients can not be interpreted as the rate of change in the dependent variable.

• You check the sign of the coefficients.

• You can calculate marginal effects or odds ratio (logit).

USING REGRESSION MODELS IN DATA

JOURNALISM

Some examples"Does School Pay Off? How Much?" - El Financiero (Costa Rica),

winner of the Data Journalism Awards 2014.

http://www.elfinancierocr.com/gnfactory/especiales/2015/calculadorasalarial/

Wage

Some advice• Statistical analysis can be complex. If you’re not

sure find advice with an expert! • Be transparent with your methodology.• Study a lot! • https://www.coursera.org/ Free courses!

Wage

References-Wooldridge (2010). Introductory Econometrics

-Long (1997). Regression models for categorical and limited dependent variables

-Costa Rica National Survey of Income and Spending (2004).

Wage

THANKS :) @milamila07

schoolofdata.org

top related