skillshare - regression analysis for data journalism
Post on 16-Apr-2017
1.874 Views
Preview:
TRANSCRIPT
(AN INTRODUCTION)
REGRESSION ANALYSIS FOR DATA-JOURNALISM
Camila SalazarSchool of Data Fellow
@milamila07
Outline1. Target audience2. A step beyond descriptive statistics3. What is regression analysis?4. Example: the effect of education on wages5. Other types of regression analysis useful in data
journalism.6. Using regression models in data journalism
TARGET AUDIENCE
Target audience• Data journalists
• School of Data Fellows
• People with basic knowledge of statistics
• Journalism students
A STEP BEYOND DESCRIPTIVE STATISTICS
So you are in the newsroom...There’s a big debate in you country about the importance of education
Your editor asks you to make a story about the importance of education
First step: descriptive statisticsYou find data about education in your country and start
calculating the descriptive statistics.
Descriptive statisticsWith descriptive statistics you find:
-How many people has a college degree.
-Unemployment according to the level of education.
And...You interview young people that are still in highschool that don’t want to go to college. And you want to convince them with your story how could they improve their future earnings if they go to college.
You can’t answer this question using descriptive statistics :(
But...You can calculate how much an extra year of schooling increases wages using regression analysis!
WHAT IS REGRESSION ANALYSIS?
What is regression analysis?
Regression analysis is a statistical tool for the
investigation of relationships between variables.
What is regression analysis?
It helps you explain how the value of a dependent
variable (Y) changes when and independent variable
(X) is varied, holding all other variables fixed.
What is regression analysis?
For example:
Health (Y)
Vegetables consumption (X), exercise (X), sleep (X)
dependent variableindependent variables
The linear regression It’s a method for modeling the linear relationship between a dependent variable Y and one or more explanatory variables.
dependent variable independent
variable
error term
coefficient
We are interested in estimating B (the
coefficient). It captures the effect X has on Y,
holding all other factors fixed.
The linear regressionFor example you want to explain the effect of education on
wages.
Wage EducationExperience
Variation in wage that has to do with educationVariation in wage that has
to do with experience
What is a linear regression?• You have to formulate a hypothesis about the
relationships of interest. • Have some theory behind your assumptions.• There are some essential assumptions and
statistical properties of the regression that you have to consider. Wage
EXAMPLE: THE EFFECT OF EDUCATION ON
WAGES
Example• Database with 994 observations. • 3 variables: wage (in dollars), experience, years of
education.• The equation to estimate:
Wage
Example
Wage
Example: coefficients
Wage
An additional year of education increases wage by $161.68, holding all other factors fixed.
An additional year of experience increases wage by $16.54, holding all other factors fixed.
Example: p-value
Wage
P-Value
But, what is the p-value?
Example: p-value
Wage
With statistics you can’t be 100% certain.
A relatively simple way to interpret P values is to think of them as representing how likely a result would occur by chance.
Example: p-value
Wage
Null-hypothesis: is a hypothesis which the researcher tries to disprove, reject or nullify.
“Education has NO explanatory power over wages”“Men are NOT taller than women on average”
To test the null-hypothesis we use the p-value.
Example: p-value
Wage
The p-value is the probability of being wrong when rejecting the null hypothesis
If your p-value is small < 0.05 you have strong evidence to reject the null hypothesis.
“Men are significantly taller than women, p=0.01.” That means there is a 1% chance that men are NOT actually taller than women and this result happened only because of random chance.
Example
Wage
P-Value
It tells you if the coefficient is statistically significant.With a low p-value (less than 10%, 5% or 1%) you can reject the null hypothesis that the coefficient is equal to zero (it has no explanatory power). In this case,
the coefficients are significant. That means that education and experience have explanatory power on wage.
Example
Wage
R-squared: This indicates how well the explanatory variables explain the variability of the dependent variable.
In this case: 33.8% of the variability of wage is explained by the years of education and years of
experience.
OTHER TYPES OF REGRESSION ANALYSIS
The logistic regression
Wage
Imagine you want to estimate the probability that a person with a college degree is employed.
The linear regression wouldn’t be very useful.
The logistic regression
Wage
Is a regression model where the dependent variable (Y) is categorical. For example (binary):
1= unemployed, 0= employedIt is used to estimate the probability of a binary response based
on one or more independent variables.
The logistic regression
Wage
Explanatory variables:
-Age-Education-Family income-Ocuppation
Logistic regression
Employed
Unemployed
The model would tell you, for example, that a person with a college degree is three times more likely to be employed that a person that only went to highschool.
The logistic regression
Wage
• The coefficients can not be interpreted as the rate of change in the dependent variable.
• You check the sign of the coefficients.
• You can calculate marginal effects or odds ratio (logit).
USING REGRESSION MODELS IN DATA
JOURNALISM
Some examples"Does School Pay Off? How Much?" - El Financiero (Costa Rica),
winner of the Data Journalism Awards 2014.
http://www.elfinancierocr.com/gnfactory/especiales/2015/calculadorasalarial/
Wage
Some examples“Presidential Pardons Heavily Favor Whites” - ProPublica
http://www.propublica.org/article/shades-of-mercy-presidential-forgiveness-heavily-favors-whites
Methodology: http://www.propublica.org/article/how-propublica-analyzed-pardon-data
Wage
Some advice• Statistical analysis can be complex. If you’re not
sure find advice with an expert! • Be transparent with your methodology.• Study a lot! • https://www.coursera.org/ Free courses!
Wage
References-Wooldridge (2010). Introductory Econometrics
-Long (1997). Regression models for categorical and limited dependent variables
-Costa Rica National Survey of Income and Spending (2004).
Wage
THANKS :) @milamila07
schoolofdata.org
top related