Session 02 - Regression

Welcome

In this week's lecture reviewed bivariate and multiple linear regressions. You also learned how they relate to the Potential Outcomes Framework and can be used, under certain conditions, for causal inference.

In this lab session we will:

Take a step back to review how to create tables and perform t-tests in R
Learn to run regression models using R
Learn how to present to your results in neat tables

We will use a package called wooldridge that contains dozens of datasets used by Jeffrey M. Wooldridge with pedagogical purposes in his book on Introductory Econometrics.

# Load packages. Install them first, in case you don't have them yet.

library(wooldridge) # To get our example's dataset 
library(broom) # To format regression output 
library(stargazer) # To format regression tables 
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(knitr) # To create tables with kable()
library(kableExtra) # To format tables with kable_styling()

Now let's load and take a look at our data.

data("wage1")

?wage1 #Using the help function you can explore what the variables in the dataset contain. 

#You can also have a general view:
str(wage1)
head(wage1)

#Or look at the whole thing using the function View(wage1) in your console.

Review: balance tables and t-test

Let's take this dataset as an example to review how to perform a t-test in r (task 4 of your first assignment).

Remember the idea of using a t-test here is to check the statistical significance of a difference in means. In other words, we want to know how likely it is that the mean value of a variable in two groups or samples can be due to random chance (our null hypothesis).

In this case, let's imagine we want to check the statistical significance of the difference in means of years of education (interval variable) between men and women (categorical) in our dataset.

The general syntax for a t-test is simply as follows. The vectors refer to the variables whose mean you want to compare.

t.test(y ~ x, data = your_data)

Is the difference in average years of education between men and women statistically significant?

t.test(educ ~ female, data = wage1)

## 
##  Welch Two Sample t-test
## 
## data:  educ by female
## t = 1.9693, df = 517.81, p-value = 0.04946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.001124132 0.940597569
## sample estimates:
## mean in group 0 mean in group 1 
##        12.78832        12.31746

#OR 
t.test(wage1$educ[wage1$female==1],wage1$educ[wage1$female == 0])

## 
##  Welch Two Sample t-test
## 
## data:  wage1$educ[wage1$female == 1] and wage1$educ[wage1$female == 0]
## t = -1.9693, df = 517.81, p-value = 0.04946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.940597569 -0.001124132
## sample estimates:
## mean of x mean of y 
##  12.31746  12.78832

Now let's take a look at how can we create a simple and good looking balance table. The idea here is to compare the mean values of different variables across populations or groups. In our case, let's see whether men and women differ in average wages, years of education, years of experience and number of years in their current job.

Here is one way to create the table.

wage1 %>% 
  dplyr::group_by(female) %>% 
  dplyr::select(wage, educ, exper, tenure) %>% 
  dplyr::summarise_all(mean) %>% # uses the summarize function to all the vars. we selected 
  t() %>% # To transpose the df 
  as.data.frame() %>%  
  dplyr::add_rownames("Variable") %>% 
  dplyr::filter(Variable != "female") %>% # drop the redundant variable 
  dplyr::rename(Male = "1", Female = "2") %>% # call the columns by what they are 
  dplyr::mutate(Difference = Female - Male) %>% 
  dplyr::mutate_if(is.numeric, round, 3) %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling()

Variable	Male	Female	Difference
wage	7.099	4.588	-2.512
educ	12.788	12.317	-0.471
exper	17.558	16.429	-1.130
tenure	6.474	3.615	-2.859

Your turn

Create a table to see if white and non-white respondents in the survey are balanced in terms of education levels and job experience.

wage1 %>% 
  dplyr::group_by(nonwhite) %>% 
  dplyr::select(educ, exper) %>% 
  dplyr::summarise_all(mean) %>% # uses the summarize function to all the vars. we selected 
  t() %>% # To transpose the df 
  as.data.frame() %>%  
  dplyr::add_rownames("Variable") %>% 
  dplyr::filter(Variable != "nonwhite") %>% # drop the redundant variable 
  dplyr::rename(White = "1", Non_white = "2") %>% # call the columns by what they are 
  dplyr::mutate(Difference = Non_white - White) %>% 
  dplyr::mutate_if(is.numeric, round, 3) %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling()

Variable	White	Non_white	Difference
educ	12.642	11.870	-0.772
exper	16.951	17.593	0.641

Run a t-test to check the statistical significance of the differences in means that you observe.

t.test(educ ~ nonwhite, data = wage1)

## 
##  Welch Two Sample t-test
## 
## data:  educ by nonwhite
## t = 1.6471, df = 61.233, p-value = 0.1047
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1650514  1.7082090
## sample estimates:
## mean in group 0 mean in group 1 
##        12.64195        11.87037

t.test(exper ~ nonwhite, data = wage1)

## 
##  Welch Two Sample t-test
## 
## data:  exper by nonwhite
## t = -0.32877, df = 65.737, p-value = 0.7434
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.536204  3.253562
## sample estimates:
## mean in group 0 mean in group 1 
##        16.95127        17.59259

#See the relationship with lm()
lm(exper ~ nonwhite, data = wage1) %>% summary()

## 
## Call:
## lm(formula = exper ~ nonwhite, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.593 -11.951  -3.772   9.049  33.407 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.9513     0.6252  27.112   <2e-16 ***
## nonwhite      0.6413     1.9514   0.329    0.743    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.58 on 524 degrees of freedom
## Multiple R-squared:  0.0002061,  Adjusted R-squared:  -0.001702 
## F-statistic: 0.108 on 1 and 524 DF,  p-value: 0.7426

Bivariate linear regression

Imagine that, using the wage1 dataset you are interested in exploring the relationship between education and people's salaries. You want to look for the causal effect that getting one more year of education (educ) has on people's average wage per hour (wage).

You start by exploring their relationship:

plot(wage1$educ, wage1$wage)

cor(wage1$educ, wage1$wage)

## [1] 0.4059033

As we saw in the lecture, you can resort to regression to address this question using observational data.

So you formalize the relationship you're after in a first, short model:

\[wage = \alpha + \kappa education + u\]

Remember from the lecture (slide 18), that we can relate a regression equation with the potential outcomes framwework, as follows.

Kappa, the regression coefficient for our independent variable education, is equal to the Naïve Average Treatment Effect or NATE.

\[\kappa = E[Y_1|D=1] - E[Y_0|D=0]\]

The constant term (intercept) in our regression equation is equal to the expected outcome when not treated, for those untreated.

\[\alpha = E[Y_0|D=0] \]

The sum of the constant term plus the regression coefficient in our bivariate model is equal to the expected outcome when treated, for those treated.

\[\alpha + \kappa = E[Y_1|D=1] \]

Now that we know what we are after, how to get these estimates in R?

Running a linear model in R

The general syntax for running a regression model in R is the following:

your_model <- lm(dependent_variable ~ independent_variable, data = your_dataset)

Now try to create an object with your model, based on the bivariate regression equation we specified above in which wage is our outcome of interest and educ is our independent or 'treatment' variable.

naive_model <- lm(wage ~ educ, data = wage1)

You have created an object that contains the all the coefficients, std errors and all other information included in your model. In order to see the estimates, you could use the base R function summary(), but base R is not the best at visualizing regression outputs.

The tidy() function from the broom package, instead, constructs a data frame that summarizes the model’s statistical findings. You can see what else you can do with broom by running: vignette("broom").

See the difference:

summary(naive_model)

## 
## Call:
## lm(formula = wage ~ educ, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3396 -2.1501 -0.9674  1.1921 16.6085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.90485    0.68497  -1.321    0.187    
## educ         0.54136    0.05325  10.167   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.378 on 524 degrees of freedom
## Multiple R-squared:  0.1648, Adjusted R-squared:  0.1632 
## F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

broom::tidy(naive_model)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -0.905    0.685      -1.32 1.87e- 1
## 2 educ           0.541    0.0532     10.2  2.78e-22

# You can save into an object so you can easily pull the values later

naive_model_df <- broom::tidy(naive_model)

# Remember magrittr can make your life easier

naive_model_df <- lm(wage ~ educ, data = wage1) %>% broom::tidy()

How would you interpret these results?

We can see that the coefficient for educ, (our kappa, our NATE), suggests that one more year of education is associated with a $0.54 increase in hourly wages.
You can also see the value of the intercept (alpha, or E[Y_0|D = 0]) and the significance levels of our coefficients.
What is the difference in expected hourly wages between a person who studies up to highschool (13 years) and a person with a college degree (17 years)?

#Different ways to get to the same:

(coef(naive_model)[1] + coef(naive_model)["educ"] * 17) - 
(coef(naive_model)[1] + coef(naive_model)["educ"] * 13)

## (Intercept) 
##    2.165437

predict(naive_model, newdata = data.frame(educ = 17)) -
predict(naive_model, newdata = data.frame(educ = 13))

##        1 
## 2.165437

coef(naive_model)["educ"] * (17 - 13)

##     educ 
## 2.165437

What is the expected hourly wage for someone with 25 years of education?

predict(naive_model, newdata = data.frame(educ = 25))

##        1 
## 12.62913

coef(naive_model)[1] + coef(naive_model)["educ"] * 25

## (Intercept) 
##    12.62913

Under this model specifications, what percentage of the variation wage is explained by years of education?

summary(naive_model)$r.squared

## [1] 0.1647575

Multiple regression

Looking for ATE.

But you know that this survey data does not fulfill the assumptions needed in order to infer causality without much doubt. It is highly likely that we are picking up the effect of varaibles that affect both Y (wages) and D (levels of education). People do not get more education by random chance: there can be different causes leading to people getting more education and earning more. That is, you suspect that educ is not independent of the error term.

After a good time reading about the gender pay gap, you admit that gender can be a those confounding variables. So you decide to include the female variable in the model.

Now your model looks like this:

\[wage = \alpha^l + \kappa^leducation + \beta gender + u\]

Our interpretation of the regression equation in POF terms will not change much. Kappa (long) is still the coefficient we are most interested in: the treatment effect of one more year of educ on wage.

But remember that multiple regression only uses the unique variation in each regressor not explained by other regressors, so including an additional term (female) -if it is associated with education and wage- will affect our treatment effect estimate.

Let's see: run a multiple regression using the model above. The general syntax for it is this:

your_model <- lm(dependent_variable ~ independent_variable1 + independent_variable2, data = your_dataset)

less_naive_model <- lm(wage ~ educ + female, wage1)

less_naive_model_df <- broom::tidy(less_naive_model)

How would you interpret these results now?

What does the coefficient for the female variable tell us?
What is the expected wage for a woman with 20 years of education?
What is the difference if a man had the same amount of education

Observing biases

In order to see how biased the estimate from our more naïve or 'short' model was, we can compare the regression coefficients between our two models. Let's use the stargazer package to do so.

The function (also called) stargazer() allows you to see the resutls of different models side by side. The general syntax is:

stargazer::stargazer(name_of_model_1, name_of_model_2, type = "format of output") 

#try out "text" and "html" as outputs

You can very specifically define the format and information included in your tables using stargazer. Check out, for example, this and this

Create a stargazer table with both models. Try out using "text" and "html" for the output.

stargazer::stargazer(naive_model, less_naive_model, 
                     type = "text")

## 
## ====================================================================
##                                   Dependent variable:               
##                     ------------------------------------------------
##                                           wage                      
##                               (1)                      (2)          
## --------------------------------------------------------------------
## educ                        0.541***                0.506***        
##                             (0.053)                  (0.050)        
##                                                                     
## female                                              -2.273***       
##                                                      (0.279)        
##                                                                     
## Constant                     -0.905                   0.623         
##                             (0.685)                  (0.673)        
##                                                                     
## --------------------------------------------------------------------
## Observations                  526                      526          
## R2                           0.165                    0.259         
## Adjusted R2                  0.163                    0.256         
## Residual Std. Error     3.378 (df = 524)        3.186 (df = 523)    
## F Statistic         103.363*** (df = 1; 524) 91.315*** (df = 2; 523)
## ====================================================================
## Note:                                    *p<0.1; **p<0.05; ***p<0.01

remember that to render the html output from stargazer you need to have results=asis on your chunk

stargazer::stargazer(naive_model, less_naive_model, 
                     type = "html")


	Dependent variable:

	wage
	(1)	(2)

educ	0.541^***	0.506^***
	(0.053)	(0.050)

female		-2.273^***
		(0.279)

Constant	-0.905	0.623
	(0.685)	(0.673)


Observations	526	526
R²	0.165	0.259
Adjusted R²	0.163	0.256
Residual Std. Error	3.378 (df = 524)	3.186 (df = 523)
F Statistic	103.363^*** (df = 1; 524)	91.315^*** (df = 2; 523)

Note:	p<0.1; p<0.05; p<0.01

How biased was the estimate from our first model compared to our second model?

naive_model_df[2,2] - less_naive_model_df[2,2]

##     estimate
## 1 0.03490714

# Or, the same
coef(naive_model)["educ"] - coef(less_naive_model)["educ"]

##       educ 
## 0.03490714

Your turn

can you think of additional observed confounding variable that could also have an effect on education and wages that you should include in your regression model?
Specify the model to include an additional variable you thought of, see your results in a tidy format using broom::tidy() and include them in your stargazer() table.
How would you interpret the results?

Session 02 - Regression

2020-09-08

Welcome

Review: balance tables and t-test

Your turn

Bivariate linear regression

Running a linear model in R

Multiple regression

Observing biases

Your turn