Welcome to our fifth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
During this week’s lecture you reviewed randomization in experimental setups. You also learned how matching can be leveraged to gather causal estimates.
In this lab session we will:
MatchIt
packagegml()
# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To create plots (this package is also loaded by library(tidyverse))
library(purrr) # To repeat code across our list in the balance table purrr::map() (this package is also loaded by library(tidyverse))
library(broom) # To format regression output
library(stargazer) # To format model output
library(knitr) # To create HTML tables with kable()
library(kableExtra) # To format the HTML tables
library(MatchIt) # To match
Today we will work with data from the Early Childhood Longitudinal Study (ECLS).
<-
ecls_df ::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%204/data/ecls.csv") %>%
readr::select(-childid, -race, -w3daded,
dplyr-w3momed, -w3inccat) #drop these columns (-)
names(ecls_df) #checking the names of the variables in the data frame
## [1] "catholic" "race_white" "race_black" "race_hispanic"
## [5] "race_asian" "p5numpla" "p5hmage" "p5hdage"
## [9] "w3daded_hsb" "w3momed_hsb" "w3momscr" "w3dadscr"
## [13] "w3income" "w3povrty" "p5fstamp" "c5r2mtsc"
## [17] "c5r2mtsc_std"
(Example inspired by Simon Ejdemyr: https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html)
MatchIt
: https://cran.r-project.org/web/packages/MatchIt/vignettes/matchit.pdfCobalt
: (optional library for matching plots and extra features): https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.htmlkableExtra
: (for formatting tables): https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.htmlStargazer
: (for formatting model output): https://www.jakeruss.com/cheatsheets/stargazer/As you may have seen in this week’s application paper, balance tables feature prominently in work that utilizes matching.
In binary treatment setups, balance tables present an overview of the difference in means for the groups across covariates.
Let’s take the dataset as an example to review how to compare differences in means and build balance tables in R. There are multiple ways to explore these kinds of questions. In this lab we will leverage t-tests to check the statistical significance of the difference in means.
In other words, we want to know whether the observed differences in average value of a variable between the two groups or samples can be due to random chance (our null hypothesis).
In this case, let’s imagine we want to check the statistical significance of the differences in the composition of the catholic and public school samples in the w3povrty
(under the poverty line) variable.
The general syntax for a t-test is simply as follows. The vectors refer to the variables whose mean you want to compare.
t.test(y ~ x, data = your_data)
t.test(w3povrty ~ catholic, data = ecls_df)
##
## Welch Two Sample t-test
##
## data: w3povrty by catholic
## t = 26.495, df = 4034.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## 0.1678107 0.1946307
## sample estimates:
## mean in group 0 mean in group 1
## 0.21918633 0.03796562
Now let’s take a look at how can we create a simple and good looking balance table. The idea here is to compare the mean values of different variables across populations or groups. In our case, let’s see whether the catholic and public school groups differ across the covariates in the dataset:
Here is one way to create the table (you can adapt this code for the assignment).
# create a list with the covariates
<- c("race_white", "race_black", "race_hispanic", "race_asian", "p5numpla",
list_cov "p5hmage", "p5hdage", "w3daded_hsb", "w3momed_hsb", "w3momscr", "w3dadscr",
"w3income", "w3povrty", "p5fstamp", "c5r2mtsc", "c5r2mtsc_std")
%>% # our data frame
ecls_df ::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
dplyr::map(1) %>% # maps into a list
purrr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables"
dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
dplyr::kable(col.names = c("Variable", "Control (Catholic = 0)", "Treat (Catholic = 1)", "P value")) %>% # create kable table and rename headings
knitr::kable_styling() # style kable table for our knitted document kableExtra
Variable | Control (Catholic = 0) | Treat (Catholic = 1) | P value |
---|---|---|---|
race_white | 0.556 | 0.725 | 0.000 |
race_black | 0.136 | 0.054 | 0.000 |
race_hispanic | 0.181 | 0.131 | 0.000 |
race_asian | 0.066 | 0.052 | 0.025 |
p5numpla | 1.133 | 1.093 | 0.000 |
p5hmage | 37.561 | 39.575 | 0.000 |
p5hdage | 40.392 | 41.988 | 0.000 |
w3daded_hsb | 0.488 | 0.259 | 0.000 |
w3momed_hsb | 0.464 | 0.227 | 0.000 |
w3momscr | 43.114 | 47.492 | 0.000 |
w3dadscr | 42.713 | 46.356 | 0.000 |
w3income | 54889.159 | 82074.301 | 0.000 |
w3povrty | 0.219 | 0.038 | 0.000 |
p5fstamp | 0.129 | 0.015 | 0.000 |
c5r2mtsc | 50.209 | 52.389 | 0.000 |
c5r2mtsc_std | -0.031 | 0.194 | 0.000 |
In this tutorial we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use matching to get more credible causal estimates of Catholic schooling.
%>%
ecls_df ::group_by(catholic) %>%
dplyr::summarize(n_students = n(),
dplyravg_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>%
round(3) %>% # round the results
::kable() %>% # create kable table
knitr::kable_styling() # view kable table kableExtra
catholic | n_students | avg_math | std_error |
---|---|---|---|
0 | 9568 | -0.031 | 0.010 |
1 | 1510 | 0.194 | 0.022 |
We can see that we have many more students that did not attend Catholic school than those who did. Also, the Catholic school students have a higher average math score.
We can naively compare the students on their standardized math scores (c5r2mtsc_std). As you remember, the Naive Average Treatment Effect (NATE) is the difference in the means of the observed outcomes of the two groups:
\[NATE = E(y^1 | D = 1) - E(y^0 | D = 0)\]
In this case, the NATE would be difference between the average math scores.
Do you remember what we did in the last section?
We could subtract the avg_math
for the catholic = 1 and the avg_math
for the catholic = 0
\[NATE = 0.194 - (-0.031)\] \[NATE = 0.194 + 0.031 = 0.225\]
t.test()
t.test(c5r2mtsc_std ~ catholic, data = ecls_df)
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = -9.1069, df = 2214.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.2727988 -0.1761292
## sample estimates:
## mean in group 0 mean in group 1
## -0.03059583 0.19386817
lm()
lm(c5r2mtsc_std ~ catholic, data = ecls_df) %>%
::stargazer(.,type = "text") stargazer
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic 0.224***
## (0.028)
##
## Constant -0.031***
## (0.010)
##
## -----------------------------------------------
## Observations 11,078
## R2 0.006
## Adjusted R2 0.006
## Residual Std. Error 0.997 (df = 11076)
## F Statistic 66.096*** (df = 1; 11076)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
MatchIt
MatchIt
is designed for causal inference with a dichotomous treatment variable and a set of pretreatment control variables. Any number or type of dependent variables can be used.
We have three steps:
MatchIt::matchit()
MatchIt::match.data()
or MatchIt::get_matches()
The basic syntax is as follows:
<- MatchIt::matchit(treatment ~ x1 + x2, data = mydata) # NOTE. We include treatment ~ covariates
match_process <- MatchIt::get_matches(match_process) #when matching with replacement
matched_df <- MatchIt::match.data(match_process) #for other cases
matched_df <- lm(outcome ~ trearment, data = matched_df) matched_model
where treat is the dichotomous treatment variable, and x1 and x2 are pre-treatment covariates, all of which are contained in the data frame mydata.
As you can see, the outcome variable is not included in the matching procedure.
MatchIt
is capable of using several matching methods:
Exact (method = “exact”): The simplest version of matching is exact. This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.
Subclassification (method = “subclass”): When there are many covariates (or some covariates can take a large number of values), finding sufficient exact matches will often be impossible. The goal of subclassification is to form subclasses, such that in each of them the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.
Nearest Neighbor (method = “nearest”): Nearest neighbor matching selects the best control matches for each individual in the treatment group. Matching is done using a distance measure (propensity score) specified by the distance option (default = logit).
As well as optimal matching, full matching, genetic matching, and coarsened exact matching, all of which are detailed in the documentation.
A few additional arguments are important to know about:
distance: this refers to propensity scores. There are many options for how to calculate these within MatchIt.
discard: specifies whether to discard units that fall outside some measure of support of the distance measure (default is “none”, discard no units). For example, if some treated units have extremely high propensity scores that are higher than any control units, we could drop those.
replace: a logical value indicating whether each control unit can be matched to more than one treated unit (default is replace = FALSE, each control unit is used at most once).
ratio: the number of control units to match to each treated unit (default is ratio = 1).
There are also some optional arguments for most of the matching methods, which you can read about in the documentation if you are interested.
We can use a combination of the results from our balance table and theory to identify which variables to use for matching. Let’s perform an exact match with:
# first we must omit missing values (MatchIt does not allow missings)
<- ecls_df %>%
match_data ::select(catholic, c5r2mtsc_std, race_white, p5hmage,
dplyr%>%
w3income, p5numpla, w3momed_hsb) na.omit()
str(match_data) #Let's see how many observations we have
## tibble [9,267 × 7] (S3: tbl_df/tbl/data.frame)
## $ catholic : num [1:9267] 0 0 0 1 0 0 0 0 0 0 ...
## $ c5r2mtsc_std: num [1:9267] 0.982 0.594 0.491 1.451 2.596 ...
## $ race_white : num [1:9267] 1 1 1 1 1 1 1 0 1 1 ...
## $ p5hmage : num [1:9267] 47 41 43 38 47 30 41 38 28 31 ...
## $ w3income : num [1:9267] 62500 45000 62500 87500 150000 ...
## $ p5numpla : num [1:9267] 1 1 1 1 1 1 1 1 1 1 ...
## $ w3momed_hsb : num [1:9267] 0 0 0 0 0 0 0 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:1811] 3 20 22 27 33 36 45 48 52 55 ...
## ..- attr(*, "names")= chr [1:1811] "3" "20" "22" "27" ...
# perform exact match
<- MatchIt::matchit(catholic ~ race_white + p5hmage + w3income +
exact_match + w3momed_hsb,
p5numpla method = "exact",
data = match_data)
# Try seeing the output in the console with summary(exact_match)
# grab the matched data into a new data frame
<- MatchIt::match.data(exact_match)
data_exact_match
str(data_exact_match)
## matchdata [5,405 × 9] (S3: matchdata/tbl_df/tbl/data.frame)
## $ catholic : num [1:5405] 0 0 0 1 0 0 0 0 0 0 ...
## $ c5r2mtsc_std: num [1:5405] 0.982 0.594 0.491 1.451 2.596 ...
## $ race_white : num [1:5405] 1 1 1 1 1 1 1 1 1 1 ...
## $ p5hmage : num [1:5405] 47 41 43 38 47 41 38 38 27 40 ...
## $ w3income : num [1:5405] 62500 45000 62500 87500 150000 ...
## $ p5numpla : num [1:5405] 1 1 1 1 1 1 1 1 1 1 ...
## $ w3momed_hsb : num [1:5405] 0 0 0 0 0 0 0 0 1 1 ...
## $ weights : Named num [1:5405] 0.147 1.744 1.031 1 0.191 ...
## ..- attr(*, "names")= chr [1:5405] "1" "2" "3" "4" ...
## $ subclass : Factor w/ 460 levels "1","2","3","4",..: 381 153 122 1 365 27 221 15 8 249 ...
## ..- attr(*, "names")= chr [1:5405] "1" "2" "3" "4" ...
## - attr(*, "na.action")= 'omit' Named int [1:1811] 3 20 22 27 33 36 45 48 52 55 ...
## ..- attr(*, "names")= chr [1:1811] "3" "20" "22" "27" ...
## - attr(*, "weights")= chr "weights"
## - attr(*, "subclass")= chr "subclass"
# estimate effect again with new data frame
<- lm(c5r2mtsc_std ~ catholic, data = data_exact_match)
exact_match_model
::stargazer(exact_match_model, type = "text") stargazer
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.101***
## (0.030)
##
## Constant 0.319***
## (0.015)
##
## -----------------------------------------------
## Observations 5,405
## R2 0.002
## Adjusted R2 0.002
## Residual Std. Error 0.934 (df = 5403)
## F Statistic 11.340*** (df = 1; 5403)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Now we can see that the mean in the group that did not attend Catholic school is actually about 0.10 higher than the mean for those who did. The results are statistically significant given that the confidence interval does not contain zero, and we have a fairly small p-value.
If we want to perform non-exact matching, we can use propensity scores. We can generate these manually using a logit model on the unmatched data set.
When we extract propensity scores, we model the propensity of each unit of falling under the treatment group based on their values on a set of covariates. This is how we would do this manually:
# create a new column with income by the thousands for more interpretable output
<- ecls_df %>%
ecls_df ::mutate(w3income_1k = w3income / 1000)
dplyr
# estimate logit model
<- glm(catholic ~ race_white + w3income_1k +
m_ps + p5numpla + w3momed_hsb,
p5hmage family = binomial(link = "logit"), # you can also use a probit link here
data = ecls_df)
# extract predicted probabilities
# type = "response" option tells R to output probabilities of the form P(Y = 1|X)
<- dplyr::tibble(pr_score = predict(m_ps, type = "response"),
prs_df catholic = m_ps$model$catholic) # the actual values
Let’s plot the propensity scores by treatment group to explore common support:
# Histogram
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_histogram(alpha = 0.5) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Count",
fill = "Catholic School Attendance")
# Density plot
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_density(alpha = 0.5) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Density",
fill = "Catholic School Attendance")
# Jittered point plot
ggplot(prs_df, aes(x = pr_score, y = factor(catholic), color = factor(catholic))) +
geom_jitter() +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Propensity Score Distribution: Treatment and Control Groups",
x = "Propensity Score",
y = "Group",
color = "Catholic School Attendance")
MatchIt
can generate propensity scores itself, so we don’t need to manually go through the process above. Let’s try putting together a non-exact matching formula yourself! Try:
<- MatchIt::matchit(catholic ~ race_white + w3income + p5hmage +
one_match + w3momed_hsb,
p5numpla method = "nearest",
ratio = 1,
replace = TRUE,
data = match_data)
summary(one_match)
##
## Call:
## MatchIt::matchit(formula = catholic ~ race_white + w3income +
## p5hmage + p5numpla + w3momed_hsb, data = match_data, method = "nearest",
## replace = TRUE, ratio = 1)
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.1927 0.1379 0.6486 1.0007 0.2086
## race_white 0.7411 0.5914 0.3418 . 0.1497
## w3income 82568.9357 55485.0210 0.5777 1.1373 0.1565
## p5hmage 39.5932 37.5658 0.3874 0.6383 0.0408
## p5numpla 1.0917 1.1298 -0.1242 0.6132 0.0076
## w3momed_hsb 0.2234 0.4609 -0.5703 . 0.2375
## eCDF Max
## distance 0.3109
## race_white 0.1497
## w3income 0.3062
## p5hmage 0.1893
## p5numpla 0.0277
## w3momed_hsb 0.2375
##
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.1927 0.1927 0.0000 0.9954 0.0000
## race_white 0.7411 0.7493 -0.0186 . 0.0081
## w3income 82568.9357 81775.6653 0.0169 1.0052 0.0036
## p5hmage 39.5932 39.6169 -0.0045 1.0179 0.0015
## p5numpla 1.0917 1.0777 0.0459 1.1580 0.0031
## w3momed_hsb 0.2234 0.2226 0.0018 . 0.0007
## eCDF Max Std. Pair Dist.
## distance 0.0030 0.0001
## race_white 0.0081 0.0625
## w3income 0.0081 0.0396
## p5hmage 0.0074 0.1131
## p5numpla 0.0126 0.0846
## w3momed_hsb 0.0007 0.0586
##
## Percent Balance Improvement:
## Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## distance 100.0 -552.2 100.0 99.0
## race_white 94.6 . 94.6 94.6
## w3income 97.1 95.9 97.7 97.3
## p5hmage 98.8 96.0 96.2 96.1
## p5numpla 63.1 70.0 59.2 54.6
## w3momed_hsb 99.7 . 99.7 99.7
##
## Sample Sizes:
## Control Treated
## All 7915. 1352
## Matched (ESS) 187.29 1352
## Matched 510. 1352
## Unmatched 7405. 0
## Discarded 0. 0
We can interpret the resulting output as follows:
Now, let’s plot the propensity scores for the treated and untreated units.
# simple plot - check out the cobalt package for nicer options, or use ggplot2 to create your own!
plot(one_match, type = "hist")
Let’s extract the data from one_match and creating a balance table like the one we did before, just this time using the new data. Scroll down for the answer when you’re ready.
# grab data set
<- MatchIt::get_matches(one_match)
data_prop_match
# create list of covariates for the table
<- c("race_white", "p5hmage", "w3income", "p5numpla", "w3momed_hsb")
list_cov
%>% # our data frame
data_prop_match ::summarize_at(list_cov, funs(list(broom::tidy(t.test(. ~ catholic))))) %>% # sequentially run t-tests across all the covariates in the list_cov (note that you have to change the "treatment")
dplyr::map(1) %>% # maps into a list
purrr::bind_rows(.id='variables') %>% # binds list into a single data frame and names the id column "variables"
dplyr::select(variables, estimate1, estimate2, p.value) %>% # select only the names, group means, and p-values
dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric variables to three places
dplyr::kable(col.names = c("Variable", "Control (Catholic = 0)", "Treat (Catholic = 1)", "P value")) %>% # create kable table and rename headings
knitr::kable_styling() # style kable table for our knitted document kableExtra
Variable | Control (Catholic = 0) | Treat (Catholic = 1) | P value |
---|---|---|---|
race_white | 0.749 | 0.741 | 0.628 |
p5hmage | 39.617 | 39.593 | 0.906 |
w3income | 81775.665 | 82568.936 | 0.659 |
p5numpla | 1.078 | 1.092 | 0.216 |
w3momed_hsb | 0.223 | 0.223 | 0.963 |
Those means look very close. Hooray.
Finally, let’s estimate on the matched data set:
<- lm(c5r2mtsc_std ~ catholic, data = data_prop_match)
prop_match_model ::stargazer(prop_match_model, type = "text") stargazer
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.038
## (0.037)
##
## Constant 0.248***
## (0.026)
##
## -----------------------------------------------
## Observations 2,704
## R2 0.0004
## Adjusted R2 0.00003
## Residual Std. Error 0.955 (df = 2702)
## F Statistic 1.068 (df = 1; 2702)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
As with the exact matching, we can see that those that did not attend Catholic school performed better on the test than those who did. Still, we see that our results in this instance are not statistically significant.