The Backdoor Criterion and Basics of Regression in R
Welcome
Introduction!
Welcome to our fourth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
During this week's lecture you reviewed bivariate and multiple linear regressions. You also learned how Directed Acyclic Graphs (DAGs) can be leveraged to gather causal estimates.
In this lab session we will:
- Use the ggdaganddagittypackages to assess your modeling strategy
- Review how to run regression models using R
- Illustrate omitted variable and collider bias
Packages
# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.
set.seed(42) #for consistent results in randomization
library(wooldridge) # To get our example's dataset 
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To dagify and plot our DAG objects in R
library(dagitty) # To perform analysis in our DAG objects in R
library(stargazer) # To render nicer regression output
data("wage1") # calls the wage1 dataset from the woorldridge packageWorking with DAGs in R
Last week we learned about the general syntax of the ggdag package:
- We created dagified objects with ggdag::dagify()
- We plotted our DAGs with ggdag::ggdag()
- We discussed how to specify the coordinates of our nodes with a coordinate list
Today, we will learn how the ggdag and dagitty packages can help us illustrate our paths and adjustment sets to fulfill the backdoor criterion
Let's take one of the DAGs from our review slides:
coord_dag <- list(
  x = c(d = 0, p = 0, b = 1, a = 1 , c = 2, y = 2),
  y = c(d = 0, p = 2, b = 1, a = -1, c = 2, y = 0)
)
our_dag <- ggdag::dagify(d ~ p + a, # p and a pointing at d
                         b ~ p + c, # p and c pointing at b
                         y ~ d + a + c, # d, a, and c pointing at y
                         coords = coord_dag, # our coordinates from the list up there
                         exposure = "d", # we declare out exposure variable
                         outcome = "y") # we declare out outcome variable
ggdag::ggdag(our_dag) + 
  theme_dag() # equivalent to theme_void()
Learning about our paths and what adjustments we need
As you have seen, when we dagify a DAG in R a dagitty object is created. These objects tell R that we are dealing with DAGs.
This is very important because in addition to plotting them, we can do analyses on the DAG objects. A package that complements ggdag is the dagitty package.
Today, we will focus on two functions from the dagitty package:
- dagitty::paths(): Returns a list with two components: paths, which gives the actual paths, and open, which shows whether each path is open (d-connected) or closed (d-separated).
- dagitty::adjustmentSets(): Lists the sets of covariates that would allow for unbiased estimation of causal effects, assuming that the causal graph is correct.
We just need to input our DAG object.
Paths
Let's see how the output of the dagitty::paths function looks like:
dagitty::paths(our_dag)## $paths
## [1] "d -> y"                "d <- a -> y"           "d <- p -> b <- c -> y"
## 
## $open
## [1]  TRUE  TRUE FALSEWe see under $paths the three paths we declared during the manual exercise:
- d -> y
- d <- a -> y
- d <- p -> b <- c -> y
Additionally, $open tells us whether each path is open. In this case, we see that the second path is the only open non-causal path, so we would need to condition on a to close it.
We can also use ggdag to present the open paths visually with the ggdag_paths() function, as such:
ggdag::ggdag_paths(our_dag) +
  theme_dag()
Covariate adjustment
In addition to listing all the paths and sorting the backdoors manually, we can use the dagitty::adjustmentSets() function.
With this function, we just need to input our DAG object and it will return the different sets of adjustments.
dagitty::adjustmentSets(our_dag)## { a }For example, in this DAG there is only one option. We need to control for a.
We can also use ggdag to present the open paths visually with the ggdag_adjustment_set() function, as such:
Also, do not forget to set the argument shadow = TRUE, so that the arrows from the adjusted nodes are included.
ggdag::ggdag_adjustment_set(our_dag, shadow = T) +
  theme_dag()
If you want to learn more about DAGs in R
- ggdagdocumentation: https://ggdag.malco.io/
- dagittyvignette: https://cran.r-project.org/web/packages/dagitty/dagitty.pdf
- What is `dagitty: https://cran.r-project.org/web/packages/dagitty/vignettes/dagitty4semusers.html
NOW LET'S MOVE TO REGRESSION
Introduction to Regression
Linear regression is largely used to predict the value of an outcome variable based on one or more input explanatory variables. As we previously discussed, regression addresses a simple mechanical problem, namely, what is our best guess of y given an observed x.
- Regression can be utilized without thinking about causes as a predictive or summarizing tool
- It would not be appropiate to give causal interpretations to any \(\beta\), unless we establish the fulfilment of centain assumptions
Bivariate regression
In bivariate regression, we are modeling a variable \(y\) as a mathematical function of one variable \(x\). We can generalize this in a mathematical equation as such:
\[y = \beta_{0} + \beta{1}x + ϵ\]
Multiple linear regression
In multiple linear regression, we are modeling a variable \(y\) as a mathematical function of multiple variables \((x, z, m)\). We can generalize this in a mathematical equation as such:
\[y = \beta_{0} + \beta_{1}x + \beta_{2}z + \beta_{3}m + ϵ\]
Exploratory questions
Let's illustrate this with an example
We will use the wage1 dataset from the wooldridge package. These are data from the 1976 Current Population Survey used by Jeffrey M. Wooldridge with pedagogical purposes in his book on Introductory Econometrics.
If you want to check the contents of the wage1 data frame, you can type ?wage1 in your console
Visualizing
With regression we can answer EXPLORATORY QUESTIONS. For example:
What is the relationship between education and respondents' salaries?
We can start by exploring the relationship visually with our newly attained ggplot2 skills:
ggplot(wage1, aes(x = educ, y = wage)) +
  geom_point(color = "grey60") +
  geom_smooth(method = "lm", se = F, color = "#CC0055") +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Hourly wage (USD)") 
The lm() function
This question can be formalized mathematically as:
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]
Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education (our explanatory variable). Fortunately for us, R provides us with a very intuitive syntax to model regressions.
The general syntax for running a regression model in R is the following:
your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regressionNow let's create our own model and save it into the model_1 object, based on the bivariate regression we specified above in which wage is our outcome variable, educ is our explanatory variable, and our data come from the wage1 object:
model_1 <- lm(wage ~ educ, data = wage1)summary() and broom::tidy()
We have created an object that contains the coefficients, standard errors and further information from your model. In order to see the estimates, you could use the base R function summary(). This function is very useful when you want to print your results in your console.
Alternatively, you can use the tidy() function from the broom package. The function constructs a data frame that summarizes the model’s statistical findings. You can see what else you can do with broom by running: vignette(“broom”). The broom::tidy() function is useful when you want to store the values for future use (e.g., visualizing them)
Let's try both options in the console up there. You just need to copy this code below the model_1 code.
summary(model_1)
broom::tidy(model_1)Exercise
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + ϵ\]
| Dependent variable: | |
| Hourly wage | |
| Years of education | 0.541*** | 
| (0.053) | |
| Constant | -0.905 | 
| (0.685) | |
| Observations | 526 | 
| R2 | 0.165 | 
| Adjusted R2 | 0.163 | 
| Residual Std. Error | 3.378 (df = 524) | 
| F Statistic | 103.363*** (df = 1; 524) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
How would you interpret the results of our model_1?
- What does the constant mean?
- What does the educcoefficient mean?
- Do these coefficient carry any causal meaning?
Adding more nuance to our models
As we have discussed in previous sessions we live in a very complex world. It is very likely that our exploration of the relationship between education and respondents' salaries is open to multiple sources of bias.
Looking back at 1976 US, can you think of possible variables inside the mix?
How about the sex or the ethnicity of a worker?
Let's explore this visually
What is the relationship between education and respondents' salaries conditional on the sex of the worker?
ggplot(wage1, aes(x = educ, y = wage, color = as.factor(female))) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Years of education",
       y = "Hourly wage (USD)",
       color = "Female") +
  theme(legend.position = "bottom")
Check what happens when we replace the color = as.factor(female) for color = female
- What insights can we gather from this graph?
Multiple linear regression
This question can be formalized mathematically as:
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]
Our interest here would be to build a model that predicts the hourly wage of a respondent (our outcome variable) using the years of education and their sex (our explanatory variables).
Let's remember the syntax for running a regression model in R:
your_model_biv <- lm(outcome_variable ~ explanarory_variable, data = your_dataset) #for a bivariate regression
your_model_mult <- lm(outcome_variable ~ explanarory_variable_1 + explanarory_variable_2, data = your_dataset) #for multiple regressionNow let's create our own model, save it into the model_2 object, and print the results based on the formula regression we specified above in which wage is our outcome variable, educ and female are our explanatory variables, and our data come from the wage1 object:
model_2 <- lm(wage ~ educ + female, data = wage1)
summary(model_2)## 
## Call:
## lm(formula = wage ~ educ + female, data = wage1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9890 -1.8702 -0.6651  1.0447 15.4998 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.62282    0.67253   0.926    0.355    
## educ         0.50645    0.05039  10.051  < 2e-16 ***
## female      -2.27336    0.27904  -8.147 2.76e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.186 on 523 degrees of freedom
## Multiple R-squared:  0.2588, Adjusted R-squared:  0.256 
## F-statistic: 91.32 on 2 and 523 DF,  p-value: < 2.2e-16Exercise
\[Hourly\ wage = \beta_0 + \beta_1Years\ of\ education + \beta_2Female + ϵ\]
| Dependent variable: | |
| Hourly wage | |
| Years of education | 0.506*** | 
| (0.050) | |
| Female | -2.273*** | 
| (0.279) | |
| Constant | 0.623 | 
| (0.673) | |
| Observations | 526 | 
| R2 | 0.259 | 
| Adjusted R2 | 0.256 | 
| Residual Std. Error | 3.186 (df = 523) | 
| F Statistic | 91.315*** (df = 2; 523) | 
| Note: | p<0.1; p<0.05; p<0.01 | 
How would you interpret the results of our model_2?
- What does the constant mean?
- What does the educcoefficient mean?
- What does the femalecoefficient mean?
Predicting from our models
As we discussed previously, when we do not have our causal inference hats on, the main goal of linear regression is to predict an outcome value on the basis of one or multiple predictor variables.
R has a generic function predict() that helps us arrive at the predicted values on the basis of our explanatory variables.
The syntax of predict() is the following:
predict(name_of_the_model, newdata = data.frame(explanatory1 = value, explanatory2 = value))Say that based on our
model_2, we are interested in the expected average hourly wage of a woman with 15 years of education.
predict(model_2, newdata = data.frame(educ = 15, female = 1))##        1 
## 5.946237- What does this result tell us?
- What happens when you change femaleto 0? What does the result mean?
- Can you think of a way to find the difference in the expected hourly wage between a male with 16 years of education and a female with 17?
predict(model_2, newdata = data.frame(educ = 16, female = 0)) - predict(model_2, newdata = data.frame(educ = 15, female =0))##         1 
## 0.5064521Quiz
Here are some questions for you. Note that there are multiple ways to reach the same answer:
What is the expected hourly wage of a male with 15 years of education?
- $8.22
- $9.50
- 5.34
- $3
How much more on average does a male worker earn than a female counterpart?",
- $2.27
- In our data, males on average earn less than females
- $1.20
- $4.50
How much more is a worker expected to earn for every additional year of education, keeping sex constant?
- $0.90
- $1.20
- $0.5
DAGs and modeling

As we can remember from our slides, we were introduced to a set of key rules in understanding how to employ DAGs to guide our modeling strategy.
- A path is open or unblocked at non-colliders (confounders or mediators)
- A path is (naturally) blocked at colliders
- An open path induces statistical association between two variables
- Absence of an open path implies statistical independence
- Two variables are d-connected if there is an open path between them
- Two variables are d-separated if the path between them is blocked
In this portion of the tutorial we will demonstrate how different bias come to work when we model our relationships of interest.
What happens when we control for a collider?
The case for beauty, talent, and celebrity (What happens when we control for a collider?)

As it is showcased from our DAG, we assume that earning celebrity status is a function of an individuals beauty and talent.
We will simulate data that reflects this assumptions. In our world, someone gains celebrity status if the sum of units of beauty and celebrity are greater than 8.
# beauty - 1000 observations with mean 5 units of beauty and sd 1.5 (arbitrary scale)
beauty <- rnorm(1000, 5, 1.5)
# talent - 1000 observations with mean 3 units of talent and sd 1 (arbitrary scale)
talent <- rnorm(1000, 3, 1)
# celebrity - binary
celebrity_status <-  ifelse(beauty + talent > 8, "Celebrity" , "Not Celebrity") # celebrity if the sum of units  are greater than 8
celebrity_df <- dplyr::tibble(beauty, talent, celebrity_status) # we make a df with our values
head(celebrity_df, 10)## # A tibble: 10 x 3
##    beauty talent celebrity_status
##     <dbl>  <dbl> <chr>           
##  1   7.06 5.33   Celebrity       
##  2   4.15 3.52   Not Celebrity   
##  3   5.54 3.97   Celebrity       
##  4   5.95 3.38   Celebrity       
##  5   5.61 2.00   Not Celebrity   
##  6   4.84 2.40   Not Celebrity   
##  7   7.27 3.17   Celebrity       
##  8   4.86 0.0715 Not Celebrity   
##  9   8.03 2.15   Celebrity       
## 10   4.91 3.80   CelebrityIn this case, as our simulation suggest, we have a collider structure. We can see that celebrity can be a function of beauty or talent. Also, we can infer from the way we defined the variables that beauty and talent are d-separated (ie. the path between them is closed because celebrity is a collider).
Say you are interested in researching the relationship between beauty and talent for your Master's thesis, while doing your literature review you encounter a series of papers that find a negative relationship between the two and state that more beautiful people tend to be less talented. The model that these teams of the researchers used was the following:
\[Y_{Talent} = \beta_0 + \beta_1Beauty + \beta_2Celebrity\]
Your scientific hunch makes you believe that celebrity is a collider and that by controlling for it in their models, the researchers are inducing collider bias, or endogenous bias. You decide to move forward with your thesis by laying out a criticism to previous work on the field, given that you consider the formalization of their models is erroneous. You utilize the same data previous papers used, but based on your logic, you do not control for celebrity status. This is what you find:
True model
true_model_celebrity <- lm(talent ~ beauty, data = celebrity_df)
summary(true_model_celebrity)## 
## Call:
## lm(formula = talent ~ beauty, data = celebrity_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9225 -0.6588 -0.0083  0.6628  3.5877 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.962209   0.107595  27.531   <2e-16 ***
## beauty      0.006545   0.020755   0.315    0.753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9865 on 998 degrees of freedom
## Multiple R-squared:  9.964e-05,  Adjusted R-squared:  -0.0009023 
## F-statistic: 0.09945 on 1 and 998 DF,  p-value: 0.7526ggplot(celebrity_df, aes(x=beauty, 
                         y=talent)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent")
Biased model from previous literature
Let's see:
biased_model_celibrity <- lm(talent ~ beauty + celebrity_status, data = celebrity_df)
summary(biased_model_celibrity)## 
## Call:
## lm(formula = talent ~ beauty + celebrity_status, data = celebrity_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4244 -0.5394  0.0110  0.5064  2.9429 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    5.37834    0.13983   38.46   <2e-16 ***
## beauty                        -0.32668    0.02265  -14.43   <2e-16 ***
## celebrity_statusNot Celebrity -1.51375    0.06808  -22.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.807 on 997 degrees of freedom
## Multiple R-squared:  0.3316, Adjusted R-squared:  0.3302 
## F-statistic: 247.3 on 2 and 997 DF,  p-value: < 2.2e-16ggplot(celebrity_df, aes(x=beauty, y=talent, color = celebrity_status)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Beauty",
       y = "Talent",
       color = "")
As we can see, by controlling for a collider, the previous literature was inducing to a non-existent association between beauty and talent, also known as collider or endogenous bias.
What happens when we fail to control for a confounder?
Shoe size and salary (What happens when we fail to control for a confounder?)

# sex - replicate male and female 500 times each
sex <- rep(c("Male", "Female"), each = 500) 
# shoe size - random number with mean 38 and sd 4, plus 4 if the observation is male
shoesize <- rnorm(1000, 38, 2) +  (4 * as.numeric(sex == "Male"))
# salary - a random number with mean 25 and sd 2, plus 5 if the observation is male
salary <- rnorm(1000, 25, 2) + (5 * as.numeric(sex == "Male"))
salary_df <- dplyr::tibble(sex, shoesize, salary)
head(salary_df, 10)## # A tibble: 10 x 3
##    sex   shoesize salary
##    <chr>    <dbl>  <dbl>
##  1 Male      42.5   28.6
##  2 Male      41.4   28.4
##  3 Male      38.6   29.2
##  4 Male      38.0   27.7
##  5 Male      39.4   32.2
##  6 Male      42.7   28.2
##  7 Male      41.7   32.6
##  8 Male      40.5   27.1
##  9 Male      40.4   31.7
## 10 Male      43.1   28.8Say now one of your peers tells you about this new study that suggests that shoe size has an effect on an individuals' salary. You are a bit skeptic and read it. The model that these researchers apply is the following:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize\]
Your scientific hunch makes you believe that this relationship could be confounded by the sex of the respondent. You think that by failing to control for sex in their models, the researchers are inducing omitted variable bias. You decide to open their replication files and control for sex. This is what you find:
\[Y_{Salary} = \beta_0 + \beta_1ShoeSize + \beta_2Sex\]
True model
true_model_salary <- lm(salary ~ shoesize + sex, data = salary_df)
summary(true_model_salary)## 
## Call:
## lm(formula = salary ~ shoesize + sex, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2341 -1.3698 -0.0501  1.3595  6.4303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.73879    1.15886  22.210   <2e-16 ***
## shoesize    -0.02030    0.03044  -0.667    0.505    
## sexMale      5.05924    0.17616  28.720   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.981 on 997 degrees of freedom
## Multiple R-squared:  0.6129, Adjusted R-squared:  0.6121 
## F-statistic: 789.2 on 2 and 997 DF,  p-value: < 2.2e-16ggplot(salary_df, aes(x=shoesize, y=salary, color = sex)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  labs(x = "Shoe size",
       y = "Salary",
       color = "")
Biased model from previous literature
biased_model_salary <- lm(salary ~ shoesize, data = salary_df)
summary(biased_model_salary)## 
## Call:
## lm(formula = salary ~ shoesize, data = salary_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8777 -1.9101 -0.0511  1.8496  7.9774 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.68865    1.17280   3.145  0.00171 ** 
## shoesize     0.59429    0.02925  20.319  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.676 on 998 degrees of freedom
## Multiple R-squared:  0.2926, Adjusted R-squared:  0.2919 
## F-statistic: 412.9 on 1 and 998 DF,  p-value: < 2.2e-16ggplot(salary_df, aes(x=shoesize, y=salary)) +
  geom_point() +
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Shoe size",
       y = "Salary")
As we can see, by failing to control for a confounder, the previous literature was creating a non-existent association between shoe size and salary, incurring in ommited variable bias.