Example inspired by Simon Ejdemyr: https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html
library(dplyr) # for data wrangling
library(readr) # for loading the .csv data
library(ggplot2) # for creating plots
library(stargazer) # for formatting model output
library(kableExtra) # for formatting data frames
library(MatchIt) # for matching
set.seed(123) # for consistent results
First, let's read in the data. We will remove a few variables that won't be useful for our analysis. Then, we will compare the students pre-matching on their standardized math scores (c5r2mtsc_std).
ecls <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%204/data/ecls.csv")
ecls <- ecls %>%
dplyr::select(-childid, -race, -w3daded,
-w3momed, -w3inccat) #drop these columns (-)
ecls %>%
dplyr::group_by(catholic) %>%
dplyr::summarize(n_students = n(),
mean_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>%
round(3) %>% # round the results
kable() %>% # create kable table
kable_styling() # view kable table
catholic | n_students | mean_math | std_error |
---|---|---|---|
0 | 9568 | -0.031 | 0.010 |
1 | 1510 | 0.194 | 0.022 |
We can see that we have many more students that did not attend Catholic school than those who did, and the Catholic school students have a mean math score that is higher.
First, let's calculate the NATE of Catholic school on student achievement (without adjusting for any covariates) to compare the means of the two groups.
summary(lm(c5r2mtsc_std ~ catholic, data = ecls))
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = ecls)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5299 -0.6634 0.0609 0.6770 3.4643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03060 0.01019 -3.002 0.00269 **
## catholic 0.22446 0.02761 8.130 4.75e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9971 on 11076 degrees of freedom
## Multiple R-squared: 0.005932, Adjusted R-squared: 0.005842
## F-statistic: 66.1 on 1 and 11076 DF, p-value: 4.752e-16
Now, let's create a balance table to identify the relevant co-variates in the unmatched dataset.
ecls %>%
dplyr::group_by(catholic) %>%
dplyr::summarize_all(funs(mean(., na.rm = T))) %>% # summarize the mean of every variable
t() %>% # transpose data
as.data.frame() %>% # after transposing, convert from matrix to df
add_rownames("variable") %>% # rownames to explicit column
dplyr::rename(no_catholic = "1", catholic = "2") %>% # rename columns that are created by as.data.frame
dplyr::mutate(difference = catholic - no_catholic,
differencePerc = difference / (catholic + no_catholic)) %>% # create new columns for difference between groups
dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric columns
arrange(desc(differencePerc)) %>%
knitr::kable() %>% # create kable table
kableExtra::kable_styling() # view kable table
variable | no_catholic | catholic | difference | differencePerc |
---|---|---|---|---|
c5r2mtsc_std | -0.031 | 0.194 | 0.224 | 1.375 |
catholic | 0.000 | 1.000 | 1.000 | 1.000 |
w3income | 54889.159 | 82074.301 | 27185.141 | 0.198 |
race_white | 0.556 | 0.725 | 0.169 | 0.132 |
w3momscr | 43.114 | 47.492 | 4.379 | 0.048 |
w3dadscr | 42.713 | 46.356 | 3.643 | 0.041 |
p5hmage | 37.561 | 39.575 | 2.014 | 0.026 |
c5r2mtsc | 50.209 | 52.389 | 2.180 | 0.021 |
p5hdage | 40.392 | 41.988 | 1.596 | 0.019 |
p5numpla | 1.133 | 1.093 | -0.040 | -0.018 |
race_asian | 0.066 | 0.052 | -0.014 | -0.118 |
race_hispanic | 0.181 | 0.131 | -0.050 | -0.161 |
w3daded_hsb | 0.488 | 0.259 | -0.229 | -0.306 |
w3momed_hsb | 0.464 | 0.227 | -0.237 | -0.343 |
race_black | 0.136 | 0.054 | -0.083 | -0.436 |
w3povrty | 0.219 | 0.038 | -0.181 | -0.705 |
p5fstamp | 0.129 | 0.015 | -0.114 | -0.788 |
We can see that the largest magnitude differences are for:
MatchIt
MatchIt
is designed for causal inference with a dichotomous treatment variable and a set of pretreatment control variables. Any number or type of dependent variables can be used.
The basic syntax is as follows:
matched <- matchit(treat ~ x1 + x2, data = mydata)
MatchIt is capable of using several matching methods:
Exact (method = "exact"): The simplest version of matching is exact. This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.
Subclassification (method = "subclass"): When there are many covariates (or some covariates can take a large number of values), finding sufficient exact matches will often be impossible. The goal of subclassification is to form subclasses, such that in each of them the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.
As well as optimal matching, full matching, genetic matching, and coarsened exact matching, all of which are detailed in the documentation.
A few additional arguments are important to know about:
distance: this refers to propensity scores. There are many options for how to calculate these within MatchIt.
discard: specifies whether to discard units that fall outside some measure of support of the distance measure (default is "none", discard no units). For example, if some treated units have extremely high propensity scores that are higher than any control units, we could drop those.
replace: a logical value indicating whether each control unit can be matched to more than one treated unit (default is replace = FALSE, each control unit is used at most once).
ratio: the number of control units to match to each treated unit (default is ratio = 1).
There are also some optional arguments for most of the matching methods, which you can read about in the documentation if you are interested.
We can use a combination of the results from our balance table and theory to identify which variables to use for matching. Let's perform an exact match with:
# first we must omit missing values (MatchIt does not allow missings)
match_data <- ecls %>%
dplyr::select(catholic, c5r2mtsc_std, race_white, p5hmage,
w3income, p5numpla, w3momed_hsb) %>%
na.omit()
str(match_data) #Let's see how many observations do we have
## tibble [9,267 × 7] (S3: tbl_df/tbl/data.frame)
## $ catholic : num [1:9267] 0 0 0 1 0 0 0 0 0 0 ...
## $ c5r2mtsc_std: num [1:9267] 0.982 0.594 0.491 1.451 2.596 ...
## $ race_white : num [1:9267] 1 1 1 1 1 1 1 0 1 1 ...
## $ p5hmage : num [1:9267] 47 41 43 38 47 30 41 38 28 31 ...
## $ w3income : num [1:9267] 62500 45000 62500 87500 150000 ...
## $ p5numpla : num [1:9267] 1 1 1 1 1 1 1 1 1 1 ...
## $ w3momed_hsb : num [1:9267] 0 0 0 0 0 0 0 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:1811] 3 20 22 27 33 36 45 48 52 55 ...
## ..- attr(*, "names")= chr [1:1811] "3" "20" "22" "27" ...
# perform exact match
exact_match <- MatchIt::matchit(catholic ~ race_white + p5hmage + w3income +
p5numpla + w3momed_hsb,
method = "exact",
data = match_data)
# Try seeing the output in the console with summary(exact_match)
# grab the matched data
data_exact_match <- MatchIt::match.data(exact_match)
str(data_exact_match)
## 'data.frame': 5405 obs. of 9 variables:
## $ catholic : num 0 0 0 1 0 0 0 0 0 0 ...
## $ c5r2mtsc_std: num 0.982 0.594 0.491 1.451 2.596 ...
## $ race_white : num 1 1 1 1 1 1 1 1 1 1 ...
## $ p5hmage : num 47 41 43 38 47 41 38 38 27 40 ...
## $ w3income : num 62500 45000 62500 87500 150000 ...
## $ p5numpla : num 1 1 1 1 1 1 1 1 1 1 ...
## $ w3momed_hsb : num 0 0 0 0 0 0 0 0 1 1 ...
## $ weights : num 0.147 1.744 1.031 1 0.191 ...
## $ subclass : int 381 153 122 1 365 27 221 15 8 249 ...
# estimate effect again
summary(lm(c5r2mtsc_std ~ catholic, data = data_exact_match))
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = data_exact_match)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5618 -0.5838 0.0506 0.6410 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.31932 0.01453 21.981 < 2e-16 ***
## catholic -0.10072 0.02991 -3.368 0.000764 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9336 on 5403 degrees of freedom
## Multiple R-squared: 0.002094, Adjusted R-squared: 0.00191
## F-statistic: 11.34 on 1 and 5403 DF, p-value: 0.0007638
Now we can see that the mean in the group that did not attend Catholic school is actually about 0.10 higher than the mean for those who did. The results are statistically significant given that the confidence interval does not contain zero, and we have a fairly small p-value.
If we want to perform non-exact matching, we can use propensity scores. We can generate these manually using a probit model on the unmatched data set.
# create a new column with income by the thousands for more interpretable output
ecls <- ecls %>%
dplyr::mutate(w3income_1k = w3income / 1000)
# estimate logit model
m_ps <- glm(catholic ~ race_white + w3income_1k +
p5hmage + p5numpla + w3momed_hsb,
family = binomial(link = "probit"), # you can also use a logit link here
data = ecls)
# extract predicted probabilities
# type = "response" option tells R to output probabilities of the form P(Y = 1|X)
prs_df <- dplyr::tibble(pr_score = predict(m_ps, type = "response"),
catholic = m_ps$model$catholic) # the actual values
Let's plot the propensity scores by treatment group to explore common support:
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_histogram(alpha = 0.5) +
theme_minimal() +
labs(x = "Propensity Score Distribution: Treatment and Control Groups",
fill = "Catholic School Attendance")
ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
geom_density(alpha = 0.5) +
theme_minimal() +
labs(x = "Propensity Score Distribution: Treatment and Control Groups",
fill = "Catholic School Attendance")
MatchIt
can generate propensity scores itself, so we don't need to manually go through the process above. Let's try putting together a non-exact matching formula yourself! Try:
one_match <- MatchIt::matchit(catholic ~ race_white + w3income + p5hmage +
p5numpla + w3momed_hsb,
method = "nearest",
ratio = 1,
replace = TRUE,
data = match_data)
summary(one_match)
##
## Call:
## MatchIt::matchit(formula = catholic ~ race_white + w3income +
## p5hmage + p5numpla + w3momed_hsb, data = match_data, method = "nearest",
## ratio = 1, replace = TRUE)
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.1927 0.1379 0.0845 0.0549 5.67e-02
## race_white 0.7411 0.5914 0.4916 0.1497 0.00e+00
## w3income 82568.9357 55485.0210 43961.0872 27083.9146 2.50e+04
## p5hmage 39.5932 37.5658 6.5506 2.0274 2.00e+00
## p5numpla 1.0917 1.1298 0.3910 -0.0380 0.00e+00
## w3momed_hsb 0.2234 0.4609 0.4985 -0.2375 0.00e+00
## eQQ Mean eQQ Max
## distance 0.0548 7.60e-02
## race_white 0.1501 1.00e+00
## w3income 27069.1775 6.25e+04
## p5hmage 2.2544 7.00e+00
## p5numpla 0.0399 2.00e+00
## w3momed_hsb 0.2374 1.00e+00
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med eQQ Mean
## distance 0.1927 0.1927 0.0846 0.0000 0.0037 0.0036
## race_white 0.7411 0.7493 0.4336 -0.0081 0.0000 0.0000
## w3income 82568.9357 81775.6653 46655.3555 793.2703 0.0000 2221.9818
## p5hmage 39.5932 39.6169 5.1757 -0.0237 0.0000 0.1434
## p5numpla 1.0917 1.0777 0.2839 0.0141 0.0000 0.0139
## w3momed_hsb 0.2234 0.2226 0.4162 0.0007 0.0000 0.0130
## eQQ Max
## distance 1.39e-02
## race_white 0.00e+00
## w3income 6.25e+04
## p5hmage 9.00e+00
## p5numpla 1.00e+00
## w3momed_hsb 1.00e+00
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 99.9987 93.426 93.4405 81.6882
## race_white 94.5656 0.000 100.0000 100.0000
## w3income 97.0711 100.000 91.7915 0.0000
## p5hmage 98.8326 100.000 93.6413 -28.5714
## p5numpla 63.0544 0.000 65.1961 50.0000
## w3momed_hsb 99.6886 0.000 94.5111 0.0000
##
## Sample sizes:
## Control Treated
## All 7915 1352
## Matched 1151 1352
## Unmatched 6764 0
## Discarded 0 0
We can interpret the resulting output as follows:
Now, let's plot the propensity scores for the treated and untreated units.
# simple plot - check out the cobalt package for fancier options, or use ggplot2 to create your own!
plot(one_match, type = "hist")
# grab data set
data_prop_match <- MatchIt::match.data(one_match)
# check balance
data_prop_match %>%
dplyr::group_by(catholic) %>%
dplyr::summarise_all(funs(mean)) %>%
t() %>%
as.data.frame() %>%
add_rownames("variable") %>%
dplyr::rename(no_catholic = "1", catholic = "2") %>%
dplyr::mutate(difference = catholic - no_catholic,
differencePerc = difference / (catholic + no_catholic)) %>%
dplyr::mutate_if(is.numeric, round, 3) %>%
knitr::kable() %>%
kableExtra::kable_styling()
variable | no_catholic | catholic | difference | differencePerc |
---|---|---|---|---|
catholic | 0.000 | 1.000 | 1.000 | 1.000 |
c5r2mtsc_std | 0.376 | 0.210 | -0.166 | -0.283 |
race_white | 0.741 | 0.741 | 0.000 | 0.000 |
p5hmage | 39.492 | 39.593 | 0.101 | 0.001 |
w3income | 80248.128 | 82568.936 | 2320.808 | 0.014 |
p5numpla | 1.079 | 1.092 | 0.013 | 0.006 |
w3momed_hsb | 0.236 | 0.223 | -0.013 | -0.028 |
distance | 0.189 | 0.193 | 0.003 | 0.009 |
weights | 1.000 | 1.000 | 0.000 | 0.000 |
Those means look very close. Hooray.
Finally, estimate the treatment effect on the matched data set:summary(lm(c5r2mtsc_std ~ catholic, data = data_prop_match))
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = data_prop_match)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5271 -0.5706 0.0549 0.6095 3.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37552 0.02680 14.013 < 2e-16 ***
## catholic -0.16585 0.03646 -4.548 5.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9092 on 2501 degrees of freedom
## Multiple R-squared: 0.008204, Adjusted R-squared: 0.007808
## F-statistic: 20.69 on 1 and 2501 DF, p-value: 5.661e-06
As with the exact matching, we can see that those that did not attend Catholic school performed better on the test than those who did, and the results are statistically significant.
We can see the direction and magnitude of the covariate effects using a simple linear model:
lm_matched <- lm(c5r2mtsc_std ~ catholic + race_white + p5hmage +
I(w3income / 1000) + p5numpla + w3momed_hsb,
data = data_prop_match)
# use the stargazer package to view the output
# NOTE: you MUST include results = "asis" in the chunk header for this to be visible once knitted
stargazer(lm_matched, type = "text")
=============================================== Dependent variable:
--------------------------- c5r2mtsc_std
----------------------------------------------- catholic -0.178***
(0.035)
race_white 0.315***
(0.040)
p5hmage 0.012***
(0.003)
I(w3income/1000) 0.003***
(0.0004)
p5numpla -0.051
(0.058)
w3momed_hsb -0.287***
(0.042)
Constant -0.477***
(0.154)
Observations 2,503
R2 0.114
Adjusted R2 0.112
Residual Std. Error 0.860 (df = 2496)
F Statistic 53.716*** (df = 6; 2496)
=============================================== Note: p<0.1; p<0.05; p<0.01