Measuring the effect of a soda tax on sugar-added drink consumption
After the very succesful impact evaluations you have performed in the past weeks, you are contacted by the local government of Pawnee, Indiana. The city is interested in your advice to assess a policy intervention passed with the support of councilwoman Leslie Knope.
The city of Pawnee has been at the spotlight recently, as it has come to be known as the child obesity and diabetes capital of the state of Indiana. Some of the constituents of the city point at the fast food culture and soda sizes across the restaurants in town as a source of the problem. The largest food chain in Pawnee, Paunch Burger, offers its smallest soda size at a whopping 64oz (about 1.9 liters).
The "soda tax", as it came to be known, came to effect initially at a couple of districts. Fortunately for you, based on an archaic law, residents of Indiana have to demonstrate their residence in the district they intend to dine before being served at any of the restaurants. The latter fact means that Pawnee inhabitants can only buy sugar-added drinks in their respective home district.

1.1 Exploring the data
set.seed(42) #for consistent results
library(dplyr) # to wrangle our data
library(tidyr) # to wrangle our data - pivot_longer()
library(ggplot2) # to render our graphs
library(readr) # for loading the .csv data
library(lmtest) # to gather our clustered standard errors - coeftest()
library(plm) # to gather our clustered standard errors - vcovHC()
library(kableExtra) # to render better formatted tables
library(stargazer) # for formatting your model output
soda_tax_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/hertiestats2/master/data/soda_tax_df.csv") # simulated data
names(soda_tax_df) # to check the names of the variables in our data
## [1] "id" "district" "treatment" "pre_tax" "post_tax"
Our dataset soda_tax_df, contains the following information:
ìd
: A unique number identifier for each of the 7,500 inhabitants of Pawneedistrict
: The name of the district in which the corresponding unit livestreatment
: A binary variable that signals whether the subject lived in a district where the tax was implementedpre_tax
: The weekly sugar-added drink consumption in ounces before the tax was imposedpost_tax
: The weekly sugar-added drink consuption in ounces after the tax was imposed
Let's check how many units are there in every district and which districts have treated units
We can use the base R table() function, which performs categorical tabulation of data with the variable and its frequency, to check how many people live on each district.
table(soda_tax_df$district) %>%
knitr::kable(col.names = c("District", "Inhabitants")) %>% # create kable table
kableExtra::kable_styling() # view kable table
District | Inhabitants |
---|---|
City Hall | 1250 |
Old Eagleton | 1250 |
River Park | 1250 |
Snake Lounge | 1250 |
The Pit | 1250 |
Wamapoke Quarter | 1250 |
Even more, we can create a two-way cross-table to check how many treated units live in each district.
table(soda_tax_df$treatment, soda_tax_df$district) %>%
knitr::kable() %>% # create kable table
kableExtra::kable_styling() # view kable table
City Hall | Old Eagleton | River Park | Snake Lounge | The Pit | Wamapoke Quarter | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1250 | 1250 | 1250 |
1 | 1250 | 1250 | 1250 | 0 | 0 | 0 |
Let's check the distribution of sugar-added drinks consumption at different points in time
We can also utilize summary() to check the distribution of our variables. In this case, we can look at the distribution of our variable of interest at the two points in time:
summary(soda_tax_df$pre_tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 400.6 506.6 509.7 610.8 1687.6
ggplot(soda_tax_df, aes(x = pre_tax)) +
geom_vline(xintercept = 509.7, linetype = "dashed") +
geom_density(fill = "#f1a340", alpha = 0.3) +
theme_minimal() +
labs(x = "Pre-tax soda consumtion (oz)",
y = "Density")+
ggtitle("Soda consumption before the tax, in all districts")

summary(soda_tax_df$post_tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 327.0 445.9 449.8 566.1 1705.6
ggplot(soda_tax_df, aes(x = post_tax)) +
geom_density(fill = "#998ec3", alpha = 0.3) +
geom_vline(xintercept = 449.8, linetype = "dashed") +
theme_minimal() +
labs(x = "Post-tax soda consumtion (oz)",
y = "Density") +
ggtitle("Soda consumption after the tax, in all districts")
1.2 Modeling and estimating
So far we have ignored time in our estimations. Up until this point, most of the tools we have learned produce estimates of the counterfactual through explicit assignment rules that work randomly or as-if-randomly (e.g. randomized experimental, regression discountinuity, and instrumental variable set-ups).
Difference-in-differences compares the changes in outcomes over time between units under different treatment states. This allows us to correct for any differences between the treatment and comparison groups that are constant over time assuming that the trends in time are parallel.
a. Calculating without time
If we did not have thepre_tax
baseline measure, we would likely utilize the post_tax
to explore the average effect on the treated. In this case, we would model this as:
after_model <- lm(post_tax ~ treatment, data = soda_tax_df)
stargazer(after_model, type = "text")
=============================================== Dependent variable:
--------------------------- post_tax
----------------------------------------------- treatment -146.918***
(3.798)
Constant 523.273***
(2.686)
Observations 7,500
R2 0.166
Adjusted R2 0.166
Residual Std. Error 164.465 (df = 7498)
F Statistic 1,496.245*** (df = 1; 7498) =============================================== Note: p<0.1; p<0.05; p<0.01
We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 146.9 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us a comparison of the treatment and control groups after treatment.
To believe the results of our after_model
, we would need to believe that the mean ignorability of treatment assignment assumption is fulfilled. We would have to think carefully about possible factors that could differentiate our treatment and control groups. We use a treatment indicator based on the districts where the measure was able to be implemented. Treatment was not fully randomly assigned, so there may be lots of potential confounders that create baseline differences in the scores for those living in Old Eagleton compared to those in Snake Lounge, which also affect the after-treatment comparisons.
If we think about the mechanics behind this naive calculation, we are just comparing the average observed outcomes for those treated and not treated after the tax was imposed:
soda_tax_df %>%
dplyr::group_by(treatment) %>%
dplyr::summarize(mean = mean(post_tax)) %>%
knitr::kable(col.names = c("Treatment", "Average after tax")) %>%
kableExtra::kable_styling()
Treatment | Average after tax |
---|---|
0 | 523.2726 |
1 | 376.3548 |
b. Including the time dimension (First differences on treatment indicator)
We can introduce the time component to our calculation by incorporating the pre-treatment levels of sugar-added drink consumption, which gives us the diff-in-diff estimand. We could calculate this in a fairly straightforward manner by creating a variable assessing the change:
change
: The difference in sugar-added drink consuption between post- and pre-tax
soda_tax_df <- soda_tax_df %>%
dplyr::mutate(change = post_tax - pre_tax) #
did_model <- lm(change ~ treatment, data = soda_tax_df)
stargazer(did_model, after_model, type = "text")
============================================================ Dependent variable:
---------------------------- change post_tax
(1) (2)
------------------------------------------------------------ treatment -149.744*** -146.918*** (0.246) (3.798)
Constant 14.967*** 523.273*** (0.174) (2.686)
Observations 7,500 7,500
R2 0.980 0.166
Adjusted R2 0.980 0.166
Residual Std. Error (df = 7498) 10.671 164.465
F Statistic (df = 1; 7498) 369,242.400*** 1,496.245 ============================================================ Note: p<0.1; p<0.05; ***p<0.01
We could read this result substantively as: those who lived in districts were the tax was implemented consumed on average 149.7 ounces less of sugar-added drinks per week compared to those who lived in districts were the tax was not put in place. This calculation would give us the change, or difference, in sugar-added drink consumption for treatment and control groups.
To believe the results of our did_model
, we would need to believe that there are parallel trends between the two groups.
Note: when simulating the data the post_tax
was defined as: \(post_tax = 15 + pre\_tax - 150(treatment) + error\)
c. Including the time dimension (Regression formulation of the DiD model)
Remember the formula from the lecture where we estimate the diff-in-diff effect with time and treatment dummies? We can re-format our data to gather our diff-in-diff estimand\[Y_{it} = β_0 + β_1D^*_i + β_2P_t + β_{DD}D^∗_i × P_t + q_{it} \]
We need to convert our data to a long format to render the time and treatment dummy variables. This is how our data look like:
head(soda_tax_df, 6)
## # A tibble: 6 x 6
## id district treatment pre_tax post_tax change
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Snake Lounge 0 1688. 1706. 17.9
## 2 2 Snake Lounge 0 427. 438. 11.0
## 3 3 Snake Lounge 0 566. 560. -6.80
## 4 4 Snake Lounge 0 607. 624. 17.0
## 5 5 Snake Lounge 0 573. 607. 34.2
## 6 6 Snake Lounge 0 496. 502. 5.72
We will utilize the pivot_longer() function from tidyr
to format our data frame.
soda_tax_df_long <- soda_tax_df %>%
dplyr::select(-change) %>% # select the columns we are interested in
tidyr::pivot_longer(cols = c(pre_tax, post_tax), names_to = "period", values_to = "soda_drank") %>% # grab columns, put names to new variable period and values to new variable soda_drank
dplyr::mutate(after_tax = ifelse(period == "post_tax", 1, 0)) # create dummy for period
head(soda_tax_df_long, 6)
## # A tibble: 6 x 6
## id district treatment period soda_drank after_tax
## <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 1 Snake Lounge 0 pre_tax 1688. 0
## 2 1 Snake Lounge 0 post_tax 1706. 1
## 3 2 Snake Lounge 0 pre_tax 427. 0
## 4 2 Snake Lounge 0 post_tax 438. 1
## 5 3 Snake Lounge 0 pre_tax 566. 0
## 6 3 Snake Lounge 0 post_tax 560. 1
We can see that under our long format, we have two entries for every individual. We have our variable after_tax
, which represents \(P_t\), where 0 and 1 are pre- and post-tax periods respectively. We can now render our regression based on the formula:
did_long <- lm(soda_drank ~ treatment + after_tax + treatment * after_tax, data = soda_tax_df_long) #running our model
did_long_clustered_se <- coeftest(did_long, vcov=vcovHC(did_long,type="HC0",cluster="district")) #clustering out standard errors at the district level
stargazer(did_long_clustered_se, type = "text")
=============================================== Dependent variable:
---------------------------
treatment 2.827
(3.799)
after_tax 14.967***
(3.836)
treatment:after_tax -149.744***
(5.372)
Constant 508.306***
(2.708)
===============================================
Note: p<0.1; p<0.05; p<0.01
(We can think about the results of this regression in terms of the lab's slide 10: we are mostly interested in the coefficient for the interaction term in our regression equation).
soda_tax_df_long %>%
dplyr::group_by(period, treatment) %>% #group by period and treatment
dplyr::summarize(soda_consumed = mean(soda_drank)) %>% #render averages
tidyr::pivot_wider(names_from = period, values_from = soda_consumed) %>%
dplyr::select(treatment, pre_tax, post_tax) %>%
dplyr::arrange(desc(treatment)) %>%
knitr::kable(col.names = c("Treatment", "Pre-tax", "Post-tax"),
digits = 2) %>%
kableExtra::kable_styling(full_width = F) %>%
kableExtra::add_header_above(c("", "Period" = 2)) #add header for period
Treatment | Pre-tax | Post-tax |
---|---|---|
1 | 511.13 | 376.35 |
0 | 508.31 | 523.27 |
soda_tax_df %>%
select(-id, -district) %>%
group_by(treatment) %>%
summarize_all(funs(mean)) %>%
pivot_longer(cols = c(pre_tax, post_tax),
names_to = "period",
values_to = "soda_drank") %>%
mutate(time = ifelse(period == "post_tax", 2, 1)) %>%
group_by(treatment) %>%
ggplot(aes(x = factor(time), y = soda_drank, color = factor(treatment))) +
geom_point() +
geom_line(aes(x = time, y = soda_drank)) +
labs(x = "Time periods", y = "Ounces of soda drank per week", color = "Treatment group")+
theme_minimal()
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

1.3 Drafting some brief recommedations
Based on your analysis of the data at hand, you decide to recommend that the tax measure should move forward in the rest of Pawnee. You state that it is a very good example of a pigouvian tax, which captures the negative externalities not included in the market price of sugar-added drinks. The findings suggest that the tax reduced the weekly sugar-added drink consumption by about 150 luquid ounces (almost 4.5 liters).
Your evaluation report is so convincing that the Director of the Parks Department, Ron Swanson, is even doubting libertarianism.
