Session 07 - Difference-in-Differences

Oct 20, 2020
10 min read

Measuring the effect of a soda tax on sugar-added drink consumption

After the very succesful impact evaluations you have performed in the past weeks, you are contacted by the local government of Pawnee, Indiana. The city is interested in your advice to assess a policy intervention passed with the support of councilwoman Leslie Knope.

The city of Pawnee has been at the spotlight recently, as it has come to be known as the child obesity and diabetes capital of the state of Indiana. Some of the constituents of the city point at the fast food culture and soda sizes across the restaurants in town as a source of the problem. The largest food chain in Pawnee, Paunch Burger, offers its smallest soda size at a whopping 64oz (about 1.9 liters).

The "soda tax", as it came to be known, came to effect initially at a couple of districts. Fortunately for you, based on an archaic law, residents of Indiana have to demonstrate their residence in the district they intend to dine before being served at any of the restaurants. The latter fact means that Pawnee inhabitants can only buy sugar-added drinks in their respective home district.

1.1 Exploring the data

set.seed(42) #for consistent results

library(dplyr) # to wrangle our data
library(tidyr) # to wrangle our data - pivot_longer()
library(ggplot2) # to render our graphs
library(readr) # for loading the .csv data
library(lmtest) # to gather our clustered standard errors - coeftest()
library(plm)  # to gather our clustered standard errors - vcovHC()
library(kableExtra) # to render better formatted tables
library(stargazer) # for formatting your model output

soda_tax_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/hertiestats2/master/data/soda_tax_df.csv") # simulated data

names(soda_tax_df) # to check the names of the variables in our data

## [1] "id"        "district"  "treatment" "pre_tax"   "post_tax"

Our dataset soda_tax_df, contains the following information:

ìd: A unique number identifier for each of the 7,500 inhabitants of Pawnee
district: The name of the district in which the corresponding unit lives
treatment: A binary variable that signals whether the subject lived in a district where the tax was implemented
pre_tax: The weekly sugar-added drink consumption in ounces before the tax was imposed
post_tax: The weekly sugar-added drink consuption in ounces after the tax was imposed

Let's check how many units are there in every district and which districts have treated units

We can use the base R table() function, which performs categorical tabulation of data with the variable and its frequency, to check how many people live on each district.

table(soda_tax_df$district) %>%
  knitr::kable(col.names = c("District", "Inhabitants")) %>% # create kable table
  kableExtra::kable_styling() # view kable table

District	Inhabitants
City Hall	1250
Old Eagleton	1250
River Park	1250
Snake Lounge	1250
The Pit	1250
Wamapoke Quarter	1250

Even more, we can create a two-way cross-table to check how many treated units live in each district.

table(soda_tax_df$treatment, soda_tax_df$district) %>% 
  knitr::kable() %>% # create kable table
  kableExtra::kable_styling() # view kable table

	City Hall	Old Eagleton	River Park	Snake Lounge	The Pit	Wamapoke Quarter
0	0	0	0	1250	1250	1250
1	1250	1250	1250	0	0	0

Let's check the distribution of sugar-added drinks consumption at different points in time

We can also utilize summary() to check the distribution of our variables. In this case, we can look at the distribution of our variable of interest at the two points in time:

Pre-tax consumption levels

summary(soda_tax_df$pre_tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   400.6   506.6   509.7   610.8  1687.6

ggplot(soda_tax_df, aes(x = pre_tax)) +
  geom_vline(xintercept = 509.7, linetype = "dashed") +
  geom_density(fill = "#f1a340", alpha = 0.3) +
  theme_minimal() +
  labs(x = "Pre-tax soda consumtion (oz)",
       y = "Density")+
  ggtitle("Soda consumption before the tax, in all districts")

Post-tax consumption levels

summary(soda_tax_df$post_tax)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   327.0   445.9   449.8   566.1  1705.6

ggplot(soda_tax_df, aes(x = post_tax)) +
  geom_density(fill = "#998ec3", alpha = 0.3) +
  geom_vline(xintercept = 449.8, linetype = "dashed") +
  theme_minimal() +
  labs(x = "Post-tax soda consumtion (oz)",
       y = "Density") +
  ggtitle("Soda consumption after the tax, in all districts")

1.2 Modeling and estimating

So far we have ignored time in our estimations. Up until this point, most of the tools we have learned produce estimates of the counterfactual through explicit assignment rules that work randomly or as-if-randomly (e.g. randomized experimental, regression discountinuity, and instrumental variable set-ups).

Difference-in-differences compares the changes in outcomes over time between units under different treatment states. This allows us to correct for any differences between the treatment and comparison groups that are constant over time assuming that the trends in time are parallel.

a. Calculating without time

If we did not have the pre_tax baseline measure, we would likely utilize the post_tax to explore the average effect on the treated. In this case, we would model this as:

after_model <- lm(post_tax ~ treatment, data = soda_tax_df)

stargazer(after_model, type = "text")

=============================================== Dependent variable:
--------------------------- post_tax
----------------------------------------------- treatment -146.918***
(3.798)

Constant 523.273***
(2.686)

Observations 7,500
R2 0.166
Adjusted R2 0.166
Residual Std. Error 164.465 (df = 7498)
F Statistic 1,496.245*** (df = 1; 7498) =============================================== Note: p<0.1; p<0.05; p<0.01

We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 146.9 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us a comparison of the treatment and control groups after treatment.

To believe the results of our after_model, we would need to believe that the mean ignorability of treatment assignment assumption is fulfilled. We would have to think carefully about possible factors that could differentiate our treatment and control groups. We use a treatment indicator based on the districts where the measure was able to be implemented. Treatment was not fully randomly assigned, so there may be lots of potential confounders that create baseline differences in the scores for those living in Old Eagleton compared to those in Snake Lounge, which also affect the after-treatment comparisons.

If we think about the mechanics behind this naive calculation, we are just comparing the average observed outcomes for those treated and not treated after the tax was imposed:

soda_tax_df %>% 
  dplyr::group_by(treatment) %>% 
  dplyr::summarize(mean = mean(post_tax)) %>% 
  knitr::kable(col.names = c("Treatment", "Average after tax")) %>%
  kableExtra::kable_styling()

Treatment	Average after tax
0	523.2726
1	376.3548

b. Including the time dimension (First differences on treatment indicator)

We can introduce the time component to our calculation by incorporating the pre-treatment levels of sugar-added drink consumption, which gives us the diff-in-diff estimand. We could calculate this in a fairly straightforward manner by creating a variable assessing the change:

change: The difference in sugar-added drink consuption between post- and pre-tax

soda_tax_df <- soda_tax_df %>%
  dplyr::mutate(change = post_tax - pre_tax) #

did_model <- lm(change ~ treatment, data = soda_tax_df)

stargazer(did_model, after_model, type = "text")

============================================================ Dependent variable:
---------------------------- change post_tax
(1) (2)
------------------------------------------------------------ treatment -149.744*** -146.918*** (0.246) (3.798)

Constant 14.967*** 523.273*** (0.174) (2.686)

Observations 7,500 7,500
R2 0.980 0.166
Adjusted R2 0.980 0.166
Residual Std. Error (df = 7498) 10.671 164.465
F Statistic (df = 1; 7498) 369,242.400*** 1,496.245 ============================================================ Note: p<0.1; p<0.05; ***p<0.01

We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 149.7 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us the change, or difference, in sugar-added drink consumption for treatment and control groups.

To believe the results of our did_model, we would need to believe that there are parallel trends between the two groups.

Note: when simulating the data the post_tax was defined as: \(post_tax = 15 + pre\_tax - 150(treatment) + error\)

c. Including the time dimension (Regression formulation of the DiD model)

Remember the formula from the lecture where we estimate the diff-in-diff effect with time and treatment dummies? We can re-format our data to gather our diff-in-diff estimand

\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it} \]

where \(D^*_i\) tell us if subject \(i\) is in the treatment group and \(P_t\) indicates the point in time (1 for post)

We need to convert our data to a long format to render the time and treatment dummy variables. This is how our data look like:

head(soda_tax_df, 6)

## # A tibble: 6 x 6
##      id district     treatment pre_tax post_tax change
##   <dbl> <chr>            <dbl>   <dbl>    <dbl>  <dbl>
## 1     1 Snake Lounge         0   1688.    1706.  17.9 
## 2     2 Snake Lounge         0    427.     438.  11.0 
## 3     3 Snake Lounge         0    566.     560.  -6.80
## 4     4 Snake Lounge         0    607.     624.  17.0 
## 5     5 Snake Lounge         0    573.     607.  34.2 
## 6     6 Snake Lounge         0    496.     502.   5.72

We will utilize the pivot_longer() function from tidyr to format our data frame.

soda_tax_df_long <- soda_tax_df %>%
  dplyr::select(-change) %>% # select the columns we are interested in
  tidyr::pivot_longer(cols = c(pre_tax, post_tax), names_to = "period", values_to = "soda_drank") %>% # grab columns, put names to new variable period and values to new variable soda_drank
  dplyr::mutate(after_tax = ifelse(period == "post_tax", 1, 0)) # create dummy for period

  head(soda_tax_df_long, 6)

## # A tibble: 6 x 6
##      id district     treatment period   soda_drank after_tax
##   <dbl> <chr>            <dbl> <chr>         <dbl>     <dbl>
## 1     1 Snake Lounge         0 pre_tax       1688.         0
## 2     1 Snake Lounge         0 post_tax      1706.         1
## 3     2 Snake Lounge         0 pre_tax        427.         0
## 4     2 Snake Lounge         0 post_tax       438.         1
## 5     3 Snake Lounge         0 pre_tax        566.         0
## 6     3 Snake Lounge         0 post_tax       560.         1

We can see that under our long format, we have two entries for every individual. We have our variable after_tax, which represents \(P_t\), where 0 and 1 are pre- and post-tax periods respectively. We can now render our regression based on the formula:

\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it}\]

did_long <- lm(soda_drank ~ treatment + after_tax + treatment * after_tax, data = soda_tax_df_long) #running our model

did_long_clustered_se <- coeftest(did_long, vcov=vcovHC(did_long,type="HC0",cluster="district")) #clustering out standard errors at the district level

stargazer(did_long_clustered_se, type = "text")

=============================================== Dependent variable:
---------------------------

treatment 2.827
(3.799)

after_tax 14.967***
(3.836)

treatment:after_tax -149.744***
(5.372)

Constant 508.306***
(2.708)

===============================================

Note: p<0.1; p<0.05; p<0.01

(We can think about the results of this regression in terms of the lab's slide 10: we are mostly interested in the coefficient for the interaction term in our regression equation).

soda_tax_df_long %>% 
  dplyr::group_by(period, treatment) %>% #group by period and treatment
  dplyr::summarize(soda_consumed = mean(soda_drank)) %>% #render averages
  tidyr::pivot_wider(names_from = period, values_from = soda_consumed) %>%
  dplyr::select(treatment, pre_tax, post_tax) %>%
  dplyr::arrange(desc(treatment)) %>%
  knitr::kable(col.names = c("Treatment", "Pre-tax", "Post-tax"),
               digits = 2) %>%
  kableExtra::kable_styling(full_width = F) %>%
  kableExtra::add_header_above(c("", "Period" = 2)) #add header for period

	Period
Treatment	Pre-tax	Post-tax
1	511.13	376.35
0	508.31	523.27

soda_tax_df %>% 
  select(-id, -district) %>% 
  group_by(treatment) %>%
  summarize_all(funs(mean)) %>% 
  pivot_longer(cols = c(pre_tax, post_tax), 
               names_to = "period", 
               values_to = "soda_drank") %>% 
  mutate(time = ifelse(period == "post_tax", 2, 1)) %>% 
  group_by(treatment) %>% 
  ggplot(aes(x = factor(time), y = soda_drank, color = factor(treatment))) +
  geom_point() +
  geom_line(aes(x = time, y = soda_drank)) +
  labs(x = "Time periods", y = "Ounces of soda drank per week", color = "Treatment group")+
  theme_minimal()

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

1.3 Drafting some brief recommedations

Based on your analysis of the data at hand, you decide to recommend that the tax measure should move forward in the rest of Pawnee. You state that it is a very good example of a pigouvian tax, which captures the negative externalities not included in the market price of sugar-added drinks. The findings suggest that the tax reduced the weekly sugar-added drink consumption by about 150 luquid ounces (almost 4.5 liters).

Your evaluation report is so convincing that the Director of the Parks Department, Ron Swanson, is even doubting libertarianism.