In this week's lecture reviewed bivariate and multiple linear regressions. You also learned how they relate to the Potential Outcomes Framework and can be used, under certain conditions, for causal inference.
In this lab session we will:
We will use a package called wooldridge
that contains dozens of datasets used by Jeffrey M. Wooldridge with pedagogical purposes in his book on Introductory Econometrics.
# Load packages. Install them first, in case you don't have them yet.
library(wooldridge) # To get our example's dataset
library(broom) # To format regression output
library(stargazer) # To format regression tables
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(knitr) # To create tables with kable()
library(kableExtra) # To format tables with kable_styling()
Now let's load and take a look at our data.
data("wage1")
?wage1 #Using the help function you can explore what the variables in the dataset contain.
#You can also have a general view:
str(wage1)
head(wage1)
#Or look at the whole thing using the function View(wage1) in your console.
Let's take this dataset as an example to review how to perform a t-test in r (task 4 of your first assignment).
Remember the idea of using a t-test here is to check the statistical significance of a difference in means. In other words, we want to know how likely it is that the mean value of a variable in two groups or samples can be due to random chance (our null hypothesis).
In this case, let's imagine we want to check the statistical significance of the difference in means of years of education (interval variable) between men and women (categorical) in our dataset.
The general syntax for a t-test is simply as follows. The vectors refer to the variables whose mean you want to compare.
t.test(y ~ x, data = your_data)
Is the difference in average years of education between men and women statistically significant?
t.test(educ ~ female, data = wage1)
##
## Welch Two Sample t-test
##
## data: educ by female
## t = 1.9693, df = 517.81, p-value = 0.04946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.001124132 0.940597569
## sample estimates:
## mean in group 0 mean in group 1
## 12.78832 12.31746
#OR
t.test(wage1$educ[wage1$female==1],wage1$educ[wage1$female == 0])
##
## Welch Two Sample t-test
##
## data: wage1$educ[wage1$female == 1] and wage1$educ[wage1$female == 0]
## t = -1.9693, df = 517.81, p-value = 0.04946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.940597569 -0.001124132
## sample estimates:
## mean of x mean of y
## 12.31746 12.78832
Now let's take a look at how can we create a simple and good looking balance table. The idea here is to compare the mean values of different variables across populations or groups. In our case, let's see whether men and women differ in average wages, years of education, years of experience and number of years in their current job.
Here is one way to create the table.
wage1 %>%
dplyr::group_by(female) %>%
dplyr::select(wage, educ, exper, tenure) %>%
dplyr::summarise_all(mean) %>% # uses the summarize function to all the vars. we selected
t() %>% # To transpose the df
as.data.frame() %>%
dplyr::add_rownames("Variable") %>%
dplyr::filter(Variable != "female") %>% # drop the redundant variable
dplyr::rename(Male = "1", Female = "2") %>% # call the columns by what they are
dplyr::mutate(Difference = Female - Male) %>%
dplyr::mutate_if(is.numeric, round, 3) %>%
knitr::kable() %>%
kableExtra::kable_styling()
Variable | Male | Female | Difference |
---|---|---|---|
wage | 7.099 | 4.588 | -2.512 |
educ | 12.788 | 12.317 | -0.471 |
exper | 17.558 | 16.429 | -1.130 |
tenure | 6.474 | 3.615 | -2.859 |
wage1 %>%
dplyr::group_by(nonwhite) %>%
dplyr::select(educ, exper) %>%
dplyr::summarise_all(mean) %>% # uses the summarize function to all the vars. we selected
t() %>% # To transpose the df
as.data.frame() %>%
dplyr::add_rownames("Variable") %>%
dplyr::filter(Variable != "nonwhite") %>% # drop the redundant variable
dplyr::rename(White = "1", Non_white = "2") %>% # call the columns by what they are
dplyr::mutate(Difference = Non_white - White) %>%
dplyr::mutate_if(is.numeric, round, 3) %>%
knitr::kable() %>%
kableExtra::kable_styling()
Variable | White | Non_white | Difference |
---|---|---|---|
educ | 12.642 | 11.870 | -0.772 |
exper | 16.951 | 17.593 | 0.641 |
t.test(educ ~ nonwhite, data = wage1)
##
## Welch Two Sample t-test
##
## data: educ by nonwhite
## t = 1.6471, df = 61.233, p-value = 0.1047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1650514 1.7082090
## sample estimates:
## mean in group 0 mean in group 1
## 12.64195 11.87037
t.test(exper ~ nonwhite, data = wage1)
##
## Welch Two Sample t-test
##
## data: exper by nonwhite
## t = -0.32877, df = 65.737, p-value = 0.7434
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.536204 3.253562
## sample estimates:
## mean in group 0 mean in group 1
## 16.95127 17.59259
#See the relationship with lm()
lm(exper ~ nonwhite, data = wage1) %>% summary()
##
## Call:
## lm(formula = exper ~ nonwhite, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.593 -11.951 -3.772 9.049 33.407
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.9513 0.6252 27.112 <2e-16 ***
## nonwhite 0.6413 1.9514 0.329 0.743
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.58 on 524 degrees of freedom
## Multiple R-squared: 0.0002061, Adjusted R-squared: -0.001702
## F-statistic: 0.108 on 1 and 524 DF, p-value: 0.7426
Imagine that, using the wage1
dataset you are interested in exploring the relationship between education and people's salaries. You want to look for the causal effect that getting one more year of education (educ
) has on people's average wage per hour (wage
).
You start by exploring their relationship:
plot(wage1$educ, wage1$wage)
cor(wage1$educ, wage1$wage)
## [1] 0.4059033
As we saw in the lecture, you can resort to regression to address this question using observational data.
So you formalize the relationship you're after in a first, short model:
\[wage = \alpha + \kappa education + u\]
Remember from the lecture (slide 18), that we can relate a regression equation with the potential outcomes framwework, as follows.
\[\kappa = E[Y_1|D=1] - E[Y_0|D=0]\]
\[\alpha = E[Y_0|D=0] \]
\[\alpha + \kappa = E[Y_1|D=1] \]
Now that we know what we are after, how to get these estimates in R?
The general syntax for running a regression model in R is the following:
your_model <- lm(dependent_variable ~ independent_variable, data = your_dataset)
Now try to create an object with your model, based on the bivariate regression equation we specified above in which wage
is our outcome of interest and educ
is our independent or 'treatment' variable.
naive_model <- lm(wage ~ educ, data = wage1)
You have created an object that contains the all the coefficients, std errors and all other information included in your model. In order to see the estimates, you could use the base R function summary(), but base R is not the best at visualizing regression outputs.
The tidy()
function from the broom
package, instead, constructs a data frame that summarizes the model’s statistical findings. You can see what else you can do with broom
by running: vignette("broom").
See the difference:
summary(naive_model)
##
## Call:
## lm(formula = wage ~ educ, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3396 -2.1501 -0.9674 1.1921 16.6085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.90485 0.68497 -1.321 0.187
## educ 0.54136 0.05325 10.167 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.378 on 524 degrees of freedom
## Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632
## F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
broom::tidy(naive_model)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.905 0.685 -1.32 1.87e- 1
## 2 educ 0.541 0.0532 10.2 2.78e-22
# You can save into an object so you can easily pull the values later
naive_model_df <- broom::tidy(naive_model)
# Remember magrittr can make your life easier
naive_model_df <- lm(wage ~ educ, data = wage1) %>% broom::tidy()
How would you interpret these results?
We can see that the coefficient for educ, (our kappa, our NATE), suggests that one more year of education is associated with a $0.54 increase in hourly wages.
You can also see the value of the intercept (alpha, or E[Y_0|D = 0]) and the significance levels of our coefficients.
What is the difference in expected hourly wages between a person who studies up to highschool (13 years) and a person with a college degree (17 years)?
#Different ways to get to the same:
(coef(naive_model)[1] + coef(naive_model)["educ"] * 17) -
(coef(naive_model)[1] + coef(naive_model)["educ"] * 13)
## (Intercept)
## 2.165437
predict(naive_model, newdata = data.frame(educ = 17)) -
predict(naive_model, newdata = data.frame(educ = 13))
## 1
## 2.165437
coef(naive_model)["educ"] * (17 - 13)
## educ
## 2.165437
predict(naive_model, newdata = data.frame(educ = 25))
## 1
## 12.62913
coef(naive_model)[1] + coef(naive_model)["educ"] * 25
## (Intercept)
## 12.62913
summary(naive_model)$r.squared
## [1] 0.1647575
Looking for ATE.
But you know that this survey data does not fulfill the assumptions needed in order to infer causality without much doubt. It is highly likely that we are picking up the effect of varaibles that affect both Y (wages) and D (levels of education). People do not get more education by random chance: there can be different causes leading to people getting more education and earning more. That is, you suspect that educ
is not independent of the error term.
After a good time reading about the gender pay gap, you admit that gender can be a those confounding variables. So you decide to include the female
variable in the model.
Now your model looks like this:
\[wage = \alpha^l + \kappa^leducation + \beta gender + u\]
Our interpretation of the regression equation in POF terms will not change much. Kappa (long) is still the coefficient we are most interested in: the treatment effect of one more year of educ
on wage
.
But remember that multiple regression only uses the unique variation in each regressor not explained by other regressors, so including an additional term (female
) -if it is associated with education and wage- will affect our treatment effect estimate.
Let's see: run a multiple regression using the model above. The general syntax for it is this:
your_model <- lm(dependent_variable ~ independent_variable1 + independent_variable2, data = your_dataset)
less_naive_model <- lm(wage ~ educ + female, wage1)
less_naive_model_df <- broom::tidy(less_naive_model)
How would you interpret these results now?
What does the coefficient for the female
variable tell us?
What is the expected wage for a woman with 20 years of education?
What is the difference if a man had the same amount of education
In order to see how biased the estimate from our more naïve or 'short' model was, we can compare the regression coefficients between our two models. Let's use the stargazer
package to do so.
The function (also called) stargazer()
allows you to see the resutls of different models side by side. The general syntax is:
stargazer::stargazer(name_of_model_1, name_of_model_2, type = "format of output")
#try out "text" and "html" as outputs
You can very specifically define the format and information included in your tables using stargazer. Check out, for example, this and this
stargazer::stargazer(naive_model, less_naive_model,
type = "text")
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## wage
## (1) (2)
## --------------------------------------------------------------------
## educ 0.541*** 0.506***
## (0.053) (0.050)
##
## female -2.273***
## (0.279)
##
## Constant -0.905 0.623
## (0.685) (0.673)
##
## --------------------------------------------------------------------
## Observations 526 526
## R2 0.165 0.259
## Adjusted R2 0.163 0.256
## Residual Std. Error 3.378 (df = 524) 3.186 (df = 523)
## F Statistic 103.363*** (df = 1; 524) 91.315*** (df = 2; 523)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
remember that to render the html output from stargazer you need to have results=asis
on your chunk
stargazer::stargazer(naive_model, less_naive_model,
type = "html")
Dependent variable: | ||
wage | ||
(1) | (2) | |
educ | 0.541*** | 0.506*** |
(0.053) | (0.050) | |
female | -2.273*** | |
(0.279) | ||
Constant | -0.905 | 0.623 |
(0.685) | (0.673) | |
Observations | 526 | 526 |
R2 | 0.165 | 0.259 |
Adjusted R2 | 0.163 | 0.256 |
Residual Std. Error | 3.378 (df = 524) | 3.186 (df = 523) |
F Statistic | 103.363*** (df = 1; 524) | 91.315*** (df = 2; 523) |
Note: | p<0.1; p<0.05; p<0.01 |
naive_model_df[2,2] - less_naive_model_df[2,2]
## estimate
## 1 0.03490714
# Or, the same
coef(naive_model)["educ"] - coef(less_naive_model)["educ"]
## educ
## 0.03490714
can you think of additional observed confounding variable that could also have an effect on education and wages that you should include in your regression model?
Specify the model to include an additional variable you thought of, see your results in a tidy format using broom::tidy() and include them in your stargazer() table.
How would you interpret the results?