Session 05 - Matching

The Effect of Catholic School on Student Achievement

Example inspired by Simon Ejdemyr: https://sejdemyr.github.io/r-tutorials/statistics/tutorial8.html

Reference links:

MatchIt: https://cran.r-project.org/web/packages/MatchIt/vignettes/matchit.pdf
Cobalt (optional library for matching plots and extra features): https://cran.r-project.org/web/packages/cobalt/vignettes/cobalt_A0_basic_use.html
kableExtra (for formatting data frames): https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Stargazer (for formatting model outputs): https://www.jakeruss.com/cheatsheets/stargazer/
Video overview of matching concepts: https://fr.coursera.org/lecture/crash-course-in-causality/overview-of-matching-JQfPC

library(dplyr) # for data wrangling
library(readr) # for loading the .csv data
library(ggplot2) # for creating plots
library(stargazer) # for formatting model output
library(kableExtra) # for formatting data frames
library(MatchIt) # for matching

set.seed(123) # for consistent results

Data Prep and EDA

First, let's read in the data. We will remove a few variables that won't be useful for our analysis. Then, we will compare the students pre-matching on their standardized math scores (c5r2mtsc_std).

Exploration of the data

ecls <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%204/data/ecls.csv")

ecls <- ecls %>% 
  dplyr::select(-childid, -race, -w3daded,
         -w3momed, -w3inccat) #drop these columns (-)

ecls %>%
  dplyr::group_by(catholic) %>%
  dplyr::summarize(n_students = n(),
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>% 
  round(3) %>% # round the results
  kable() %>% # create kable table
  kable_styling() # view kable table

catholic	n_students	mean_math	std_error
0	9568	-0.031	0.010
1	1510	0.194	0.022

We can see that we have many more students that did not attend Catholic school than those who did, and the Catholic school students have a mean math score that is higher.

NATE

First, let's calculate the NATE of Catholic school on student achievement (without adjusting for any covariates) to compare the means of the two groups.

summary(lm(c5r2mtsc_std ~ catholic, data = ecls))

## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = ecls)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5299 -0.6634  0.0609  0.6770  3.4643 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03060    0.01019  -3.002  0.00269 ** 
## catholic     0.22446    0.02761   8.130 4.75e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9971 on 11076 degrees of freedom
## Multiple R-squared:  0.005932,   Adjusted R-squared:  0.005842 
## F-statistic:  66.1 on 1 and 11076 DF,  p-value: 4.752e-16

Balance Table

Now, let's create a balance table to identify the relevant co-variates in the unmatched dataset.

ecls %>%
  dplyr::group_by(catholic) %>%
  dplyr::summarize_all(funs(mean(., na.rm = T))) %>% # summarize the mean of every variable
  t() %>% # transpose data
  as.data.frame() %>% # after transposing, convert from matrix to df
  add_rownames("variable") %>% # rownames to explicit column
  dplyr::rename(no_catholic = "1", catholic = "2") %>% # rename columns that are created by as.data.frame
  dplyr::mutate(difference = catholic - no_catholic,
         differencePerc = difference / (catholic + no_catholic)) %>% # create new columns for difference between groups
  dplyr::mutate_if(is.numeric, round, 3) %>% # round numeric columns
  arrange(desc(differencePerc)) %>% 
  knitr::kable() %>% # create kable table
  kableExtra::kable_styling() # view kable table

variable	no_catholic	catholic	difference	differencePerc
c5r2mtsc_std	-0.031	0.194	0.224	1.375
catholic	0.000	1.000	1.000	1.000
w3income	54889.159	82074.301	27185.141	0.198
race_white	0.556	0.725	0.169	0.132
w3momscr	43.114	47.492	4.379	0.048
w3dadscr	42.713	46.356	3.643	0.041
p5hmage	37.561	39.575	2.014	0.026
c5r2mtsc	50.209	52.389	2.180	0.021
p5hdage	40.392	41.988	1.596	0.019
p5numpla	1.133	1.093	-0.040	-0.018
race_asian	0.066	0.052	-0.014	-0.118
race_hispanic	0.181	0.131	-0.050	-0.161
w3daded_hsb	0.488	0.259	-0.229	-0.306
w3momed_hsb	0.464	0.227	-0.237	-0.343
race_black	0.136	0.054	-0.083	-0.436
w3povrty	0.219	0.038	-0.181	-0.705
p5fstamp	0.129	0.015	-0.114	-0.788

We can see that the largest magnitude differences are for:

variables on race
w3momed_hsb: dummy variable with 1 = mother's education level is high school or below, 0 = some college or above
w3income: family income
w3povrty: poverty score dummy
p5fstamp: food stamp dummy

Matching with `MatchIt`

MatchIt is designed for causal inference with a dichotomous treatment variable and a set of pretreatment control variables. Any number or type of dependent variables can be used.

The basic syntax is as follows:


matched <- matchit(treat ~ x1 + x2, data = mydata)

where treat is the dichotomous treatment variable, and x1 and x2 are pre-treatment co-variates, all of which are contained in the data framemydata. As you can see, the outcome variable is not included in the matching procedure.

MatchIt is capable of using several matching methods:

Exact (method = "exact"): The simplest version of matching is exact. This technique matches each treated unit to all possible control units with exactly the same values on all the covariates, forming subclasses such that within each subclass all units (treatment and control) have the same covariate values.
Subclassification (method = "subclass"): When there are many covariates (or some covariates can take a large number of values), finding sufficient exact matches will often be impossible. The goal of subclassification is to form subclasses, such that in each of them the distribution (rather than the exact values) of covariates for the treated and control groups are as similar as possible.
Nearest Neighbor (method = "nearest"): Nearest neighbor matching selects the best control matches for each individual in the treatment group. Matching is done using a distance measure (propensity score) specified by the distance option (default = logit).
As well as optimal matching, full matching, genetic matching, and coarsened exact matching, all of which are detailed in the documentation.

A few additional arguments are important to know about:

distance: this refers to propensity scores. There are many options for how to calculate these within MatchIt.
discard: specifies whether to discard units that fall outside some measure of support of the distance measure (default is "none", discard no units). For example, if some treated units have extremely high propensity scores that are higher than any control units, we could drop those.
replace: a logical value indicating whether each control unit can be matched to more than one treated unit (default is replace = FALSE, each control unit is used at most once).
ratio: the number of control units to match to each treated unit (default is ratio = 1).
There are also some optional arguments for most of the matching methods, which you can read about in the documentation if you are interested.

Exact Matching

We can use a combination of the results from our balance table and theory to identify which variables to use for matching. Let's perform an exact match with:

race_white: Is the student white (1) or not (0)?
p5hmage: Mother’s age
w3income: Family income
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?

# first we must omit missing values (MatchIt does not allow missings)
match_data <- ecls %>% 
  dplyr::select(catholic, c5r2mtsc_std, race_white, p5hmage, 
         w3income, p5numpla, w3momed_hsb) %>% 
  na.omit()

str(match_data) #Let's see how many observations do we have

## tibble [9,267 × 7] (S3: tbl_df/tbl/data.frame)
##  $ catholic    : num [1:9267] 0 0 0 1 0 0 0 0 0 0 ...
##  $ c5r2mtsc_std: num [1:9267] 0.982 0.594 0.491 1.451 2.596 ...
##  $ race_white  : num [1:9267] 1 1 1 1 1 1 1 0 1 1 ...
##  $ p5hmage     : num [1:9267] 47 41 43 38 47 30 41 38 28 31 ...
##  $ w3income    : num [1:9267] 62500 45000 62500 87500 150000 ...
##  $ p5numpla    : num [1:9267] 1 1 1 1 1 1 1 1 1 1 ...
##  $ w3momed_hsb : num [1:9267] 0 0 0 0 0 0 0 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:1811] 3 20 22 27 33 36 45 48 52 55 ...
##   ..- attr(*, "names")= chr [1:1811] "3" "20" "22" "27" ...

# perform exact match
exact_match <- MatchIt::matchit(catholic ~ race_white + p5hmage + w3income +
                         p5numpla + w3momed_hsb, 
                       method = "exact", 
                       data = match_data)

# Try seeing the output in the console with summary(exact_match)

# grab the matched data
data_exact_match <- MatchIt::match.data(exact_match)

str(data_exact_match)

## 'data.frame':    5405 obs. of  9 variables:
##  $ catholic    : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ c5r2mtsc_std: num  0.982 0.594 0.491 1.451 2.596 ...
##  $ race_white  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ p5hmage     : num  47 41 43 38 47 41 38 38 27 40 ...
##  $ w3income    : num  62500 45000 62500 87500 150000 ...
##  $ p5numpla    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ w3momed_hsb : num  0 0 0 0 0 0 0 0 1 1 ...
##  $ weights     : num  0.147 1.744 1.031 1 0.191 ...
##  $ subclass    : int  381 153 122 1 365 27 221 15 8 249 ...

# estimate effect again
summary(lm(c5r2mtsc_std ~ catholic, data = data_exact_match))

## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = data_exact_match)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5618 -0.5838  0.0506  0.6410  3.1143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.31932    0.01453  21.981  < 2e-16 ***
## catholic    -0.10072    0.02991  -3.368 0.000764 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9336 on 5403 degrees of freedom
## Multiple R-squared:  0.002094,   Adjusted R-squared:  0.00191 
## F-statistic: 11.34 on 1 and 5403 DF,  p-value: 0.0007638

Now we can see that the mean in the group that did not attend Catholic school is actually about 0.10 higher than the mean for those who did. The results are statistically significant given that the confidence interval does not contain zero, and we have a fairly small p-value.

Propensity Scores

If we want to perform non-exact matching, we can use propensity scores. We can generate these manually using a probit model on the unmatched data set.

# create a new column with income by the thousands for more interpretable output
ecls <- ecls %>% 
  dplyr::mutate(w3income_1k = w3income / 1000) 

# estimate logit model
m_ps <- glm(catholic ~ race_white + w3income_1k + 
            p5hmage + p5numpla + w3momed_hsb,
            family = binomial(link = "probit"), # you can also use a logit link here
            data = ecls)

# extract predicted probabilities
# type = "response" option tells R to output probabilities of the form P(Y = 1|X)
prs_df <- dplyr::tibble(pr_score = predict(m_ps, type = "response"),
                     catholic = m_ps$model$catholic) # the actual values

Let's plot the propensity scores by treatment group to explore common support:

ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
  geom_histogram(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Propensity Score Distribution: Treatment and Control Groups",
       fill = "Catholic School Attendance")

ggplot(prs_df, aes(x = pr_score, fill = factor(catholic))) +
  geom_density(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Propensity Score Distribution: Treatment and Control Groups",
       fill = "Catholic School Attendance")

Non-Exact Matching

MatchIt can generate propensity scores itself, so we don't need to manually go through the process above. Let's try putting together a non-exact matching formula yourself! Try:

nearest neighbor matching
with replacement
with a one-to-one ratio
on the match_data dataset

All the info you need to complete this is available in the Exploring Matching section. Scroll down for the solution when you're ready ;-)

one_match <- MatchIt::matchit(catholic ~ race_white + w3income + p5hmage +
                       p5numpla + w3momed_hsb,
                     method = "nearest", 
                     ratio = 1, 
                     replace = TRUE,
                     data = match_data)

summary(one_match)

## 
## Call:
## MatchIt::matchit(formula = catholic ~ race_white + w3income + 
##     p5hmage + p5numpla + w3momed_hsb, data = match_data, method = "nearest", 
##     ratio = 1, replace = TRUE)
## 
## Summary of balance for all data:
##             Means Treated Means Control SD Control  Mean Diff  eQQ Med
## distance           0.1927        0.1379     0.0845     0.0549 5.67e-02
## race_white         0.7411        0.5914     0.4916     0.1497 0.00e+00
## w3income       82568.9357    55485.0210 43961.0872 27083.9146 2.50e+04
## p5hmage           39.5932       37.5658     6.5506     2.0274 2.00e+00
## p5numpla           1.0917        1.1298     0.3910    -0.0380 0.00e+00
## w3momed_hsb        0.2234        0.4609     0.4985    -0.2375 0.00e+00
##               eQQ Mean  eQQ Max
## distance        0.0548 7.60e-02
## race_white      0.1501 1.00e+00
## w3income    27069.1775 6.25e+04
## p5hmage         2.2544 7.00e+00
## p5numpla        0.0399 2.00e+00
## w3momed_hsb     0.2374 1.00e+00
## 
## 
## Summary of balance for matched data:
##             Means Treated Means Control SD Control Mean Diff eQQ Med  eQQ Mean
## distance           0.1927        0.1927     0.0846    0.0000  0.0037    0.0036
## race_white         0.7411        0.7493     0.4336   -0.0081  0.0000    0.0000
## w3income       82568.9357    81775.6653 46655.3555  793.2703  0.0000 2221.9818
## p5hmage           39.5932       39.6169     5.1757   -0.0237  0.0000    0.1434
## p5numpla           1.0917        1.0777     0.2839    0.0141  0.0000    0.0139
## w3momed_hsb        0.2234        0.2226     0.4162    0.0007  0.0000    0.0130
##              eQQ Max
## distance    1.39e-02
## race_white  0.00e+00
## w3income    6.25e+04
## p5hmage     9.00e+00
## p5numpla    1.00e+00
## w3momed_hsb 1.00e+00
## 
## Percent Balance Improvement:
##             Mean Diff. eQQ Med eQQ Mean  eQQ Max
## distance       99.9987  93.426  93.4405  81.6882
## race_white     94.5656   0.000 100.0000 100.0000
## w3income       97.0711 100.000  91.7915   0.0000
## p5hmage        98.8326 100.000  93.6413 -28.5714
## p5numpla       63.0544   0.000  65.1961  50.0000
## w3momed_hsb    99.6886   0.000  94.5111   0.0000
## 
## Sample sizes:
##           Control Treated
## All          7915    1352
## Matched      1151    1352
## Unmatched    6764       0
## Discarded       0       0

We can interpret the resulting output as follows:

Summary of balance for all data: Comparison of the means for all the data without matching
Summary of balance for matched data: Comparison of means for matched data. Looking for them to become similar.
Percent balance improvement: Higher is better, close to 100 is ideal.
Sample sizes: How many units were matched in the control/treatment groups.

Now, let's plot the propensity scores for the treated and untreated units.

# simple plot - check out the cobalt package for fancier options, or use ggplot2 to create your own!
plot(one_match, type = "hist")

Try extracting the data from one_match and creating a balance table like the one we did before, just this time using the new data. Scroll down for the answer when you're ready.

# grab data set
data_prop_match <- MatchIt::match.data(one_match)

# check balance
data_prop_match %>%
  dplyr::group_by(catholic) %>%
  dplyr::summarise_all(funs(mean)) %>%
  t() %>% 
  as.data.frame() %>% 
  add_rownames("variable") %>% 
  dplyr::rename(no_catholic = "1", catholic = "2") %>% 
  dplyr::mutate(difference = catholic - no_catholic,
         differencePerc = difference / (catholic + no_catholic)) %>% 
  dplyr::mutate_if(is.numeric, round, 3) %>% 
  knitr::kable() %>% 
  kableExtra::kable_styling()

variable	no_catholic	catholic	difference	differencePerc
catholic	0.000	1.000	1.000	1.000
c5r2mtsc_std	0.376	0.210	-0.166	-0.283
race_white	0.741	0.741	0.000	0.000
p5hmage	39.492	39.593	0.101	0.001
w3income	80248.128	82568.936	2320.808	0.014
p5numpla	1.079	1.092	0.013	0.006
w3momed_hsb	0.236	0.223	-0.013	-0.028
distance	0.189	0.193	0.003	0.009
weights	1.000	1.000	0.000	0.000

Those means look very close. Hooray.

Finally, estimate the treatment effect on the matched data set:

summary(lm(c5r2mtsc_std ~ catholic, data = data_prop_match))

## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = data_prop_match)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5271 -0.5706  0.0549  0.6095  3.0581 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.37552    0.02680  14.013  < 2e-16 ***
## catholic    -0.16585    0.03646  -4.548 5.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9092 on 2501 degrees of freedom
## Multiple R-squared:  0.008204,   Adjusted R-squared:  0.007808 
## F-statistic: 20.69 on 1 and 2501 DF,  p-value: 5.661e-06

As with the exact matching, we can see that those that did not attend Catholic school performed better on the test than those who did, and the results are statistically significant.

We can see the direction and magnitude of the covariate effects using a simple linear model:

lm_matched <- lm(c5r2mtsc_std ~ catholic + race_white + p5hmage +
                  I(w3income / 1000) + p5numpla + w3momed_hsb, 
                 data = data_prop_match)

# use the stargazer package to view the output
# NOTE: you MUST include results = "asis" in the chunk header for this to be visible once knitted
stargazer(lm_matched, type = "text")

=============================================== Dependent variable:
--------------------------- c5r2mtsc_std
----------------------------------------------- catholic -0.178***
(0.035)

race_white 0.315***
(0.040)

p5hmage 0.012***
(0.003)

I(w3income/1000) 0.003***
(0.0004)

p5numpla -0.051
(0.058)

w3momed_hsb -0.287***
(0.042)

Constant -0.477***
(0.154)

Observations 2,503
R2 0.114
Adjusted R2 0.112
Residual Std. Error 0.860 (df = 2496)
F Statistic 53.716*** (df = 6; 2496)
=============================================== Note: p<0.1; p<0.05; p<0.01