Panel Data and Fixed Effects

Our first steps in Panel Data and Fixed Effects with R

Welcome

Introduction!

Welcome to our ninth tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

During this week’s lecture you were introduced to Panel Data and Fixed Effects.

In this lab session we will:

  • Create three-way tables with janitor::tabyl()
  • Visualize trends across time with ggplot2
  • Learn how to extract our pooled, unit-fixed, and unit- and time-fixed effects estimates with Least Squares Dummy Variables (LSDV) estimation (with lm() and the de-meaning approach with plm())

Measuring the effect of a carbon tax on national carbon dioxide emissions per capita

After all the very valuable output you have generated in the past weeks, you have become a sensation within the policy analysis world. You are hired as an outside consultant by the Organization of Economic Non-Cooperation for Development (OENCD), they are interested in studying the effect of a carbon tax on national carbon dioxide emissions per capita. You are provided with data for the twenty-members of the organization from 2009 to 2019. The data can be called a balanced panel based on the description given in the lecture (i.e. each panel member is observed every year)


Packages

# These are the libraries we will use today. Make sure to install them in your console in case you have not done so previously.

set.seed(42) #for consistent results

library(dplyr) # to wrangle our data
library(tidyr) # to wrangle our data - pivot_longer()
library(lubridate) # for working with dates-times in a more intuitive manner
library(ggplot2) # to render our graphs
library(readr) # for loading the .csv data
library(kableExtra) # to render better formatted tables
library(stargazer) # for formatting your model output
library(janitor) # for data management and tabyl()
library(lmtest) # to gather our clustered standard errors - coeftest()
library(plm)  # to gather our clustered standard errors - vcovHC() and plm()

Our data

carbon_tax_df <- readr::read_csv("https://raw.githubusercontent.com/seramirezruiz/stats-ii-lab/master/Session%207/data/carbon_tax_df.csv") # simulated data

Our dataset carbon_tax_df, contains the following information:

  • country_name: Name of the country
  • country_code: Three-letter country code
  • year: Year
  • tax: Dummy for whether the carbon tax was in place
  • income_class: Categorical variable with income label (Low to High)
  • co2_per_capita: carbon dioxide emissions per capita in metric tons (T)

Exploratory analysis

Let’ explore who had the tax in place at what point

We can use what we have learned about the janitor::tabyl() function to check how many observations we have for each of the countries:

  • One-way table:
carbon_tax_df %>%
  janitor::tabyl(country_name) %>%
  janitor::adorn_pct_formatting(digits = 1) %>%
  knitr::kable(col.names = c("Country", "N", "Percent, %")) %>%
  kableExtra::kable_styling()
Country N Percent, %
Adjikistan 11 8.3%
Borovia 11 8.3%
Carpania 11 8.3%
Corinthia 11 8.3%
Freedonia 11 8.3%
Glenraven 11 8.3%
Khemed 11 8.3%
Laurania 11 8.3%
Parano 11 8.3%
Ron 11 8.3%
Rumekistan 11 8.3%
Transia 11 8.3%

We can also explore how many countries had a tax in place every year.

  • Two-way table:
carbon_tax_df %>%
  janitor::tabyl(tax, year) %>%
  janitor::adorn_totals("row") %>%
  janitor::adorn_percentages("col") %>%
  janitor::adorn_pct_formatting(digits = 1) %>%
  janitor::adorn_ns() %>% 
  knitr::kable() %>%
  kableExtra::kable_styling() %>%
  kableExtra::add_header_above(c("", "Year" = 11))
Year
tax 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100.0% (12) 100.0% (12) 83.3% (10) 75.0% (9) 66.7% (8) 66.7% (8) 41.7% (5) 33.3% (4) 33.3% (4) 33.3% (4) 41.7% (5)
1 0.0% (0) 0.0% (0) 16.7% (2) 25.0% (3) 33.3% (4) 33.3% (4) 58.3% (7) 66.7% (8) 66.7% (8) 66.7% (8) 58.3% (7)
Total 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12) 100.0% (12)

We can further explore how many countries had a tax in place every year at the income_class cluster:

  • Three-way table:
carbon_tax_df %>%
  janitor::tabyl(tax, year, income_class) %>%
  janitor::adorn_percentages("col") %>%
  janitor::adorn_pct_formatting(digits = 1) %>%
  janitor::adorn_ns() 
## $High
##  tax       2009       2010      2011      2012      2013      2014      2015
##    0 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 33.3% (1) 33.3% (1) 33.3% (1)
##    1   0.0% (0)   0.0% (0) 33.3% (1) 33.3% (1) 66.7% (2) 66.7% (2) 66.7% (2)
##       2016      2017      2018      2019
##  33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)
##  66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)
## 
## $`High-Middle`
##  tax       2009       2010      2011      2012      2013      2014       2015
##    0 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)   0.0% (0)
##    1   0.0% (0)   0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 100.0% (3)
##        2016       2017       2018      2019
##    0.0% (0)   0.0% (0)   0.0% (0) 33.3% (1)
##  100.0% (3) 100.0% (3) 100.0% (3) 66.7% (2)
## 
## $Low
##  tax       2009       2010       2011       2012       2013       2014
##    0 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3)
##    1   0.0% (0)   0.0% (0)   0.0% (0)   0.0% (0)   0.0% (0)   0.0% (0)
##        2015      2016      2017      2018      2019
##  100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)
##    0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)
## 
## $`Low-Middle`
##  tax       2009       2010       2011      2012      2013      2014      2015
##    0 100.0% (3) 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 33.3% (1)
##    1   0.0% (0)   0.0% (0)   0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 66.7% (2)
##       2016      2017      2018      2019
##  33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)
##  66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)

As we can see, the three-way table creates a list containing the different combinations which are accesible with the $ operator. We can do a little bit of code gymnastics to generate our nicely formated output by saving the janitor::tabyl() output into an object and printing each of the tables individually.

three_way_tab <- carbon_tax_df %>%
  janitor::tabyl(tax, year, income_class) %>%
  janitor::adorn_percentages("col") %>%
  janitor::adorn_pct_formatting(digits = 1) %>%
  janitor::adorn_ns() 
High
Year
tax 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)
1 0.0% (0) 0.0% (0) 33.3% (1) 33.3% (1) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)
High-Middle
Year
tax 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 33.3% (1)
1 0.0% (0) 0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 66.7% (2)
Low-Middle
Year
tax 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100.0% (3) 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)
1 0.0% (0) 0.0% (0) 0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)
Low
Year
tax 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 100.0% (3) 66.7% (2) 66.7% (2) 66.7% (2) 66.7% (2)
1 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 0.0% (0) 33.3% (1) 33.3% (1) 33.3% (1) 33.3% (1)

Let’s explore visually the levels of carbon dioxide emmissions

ggplot(carbon_tax_df, aes(x = factor(year),
                          y= co2_per_capita, 
                          color = factor(tax))) +
  geom_point() + #create scatterplot
  labs(title = "Exploratory plot of CO2 emissions per capita",
       x = "Year",
       y = "CO2 emissions in metric tons (T)",
       color = "Carbon tax") +
  theme_minimal() +
  theme(legend.position="bottom") + #move legend to the bottom
  scale_color_manual(name = " ", # changes to fill dimension
                     values = c("#a7a8aa", "#cc0055"),
                     labels = c("Control", "Treatment"))

  • What do we see here?

carbon_tax_df %>%
  dplyr::mutate(year_as_date = lubridate::ymd(year, truncated = 2L), #turning numeric to date format
                income_class = factor(carbon_tax_df$income_class, 
                                      levels = c("High", "High-Middle",
                                                 "Low-Middle", "Low"))) %>% #reordering the factors
  ggplot(., aes(x = year_as_date, 
                y= co2_per_capita, 
                color = factor(tax))) +
  geom_point() + #create scatterplot
  geom_path(aes(group = 1)) + #to render consecutive lines disregarding the tax (you will likely use geom_line() for the assignment)
  facet_wrap(country_name~income_class) + #to split plot into panels based on this dimension
  scale_x_date(date_labels = "%Y") + #telling ggplot that we want the ticks to be the years in the dates
  labs(title = "Exploratory plot of CO2 emissions per capita",
       x = "Year",
       y = "CO2 emissions in metric tons (T)",
       color = "Carbon tax") +
  theme_bw() + 
  scale_color_manual(name = " ", # changes to fill dimension
                     values = c("#a7a8aa", "#cc0055"),
                     labels = c("Control", "Treatment"))

  • What do we see here?

Note: The exploratory data analysis portions of our scripts will not transfer directly to this week’s assignment; however, they will be very useful for your future endevours. Summarizing, graphing, and exploring your data will be critical to discover patterns, to spot anomalies, and to check assumptions


Modeling and estimating

We have seen during the lecture that balanced panel data can help us decompose the error term. With a balanced panel we can capture all unobserved, unit- and time-specific factors.

In the example at hand, we can think of unit-specific factors as characteristics of individual countries that are constant over time (e.g. a country that just loves big pick-up trucks). We can also think about time-specific factors that affect all countries (e.g. global economic shocks).

We can formulate this as:

\[Y_{it} = β_0 + β_1D_{it} + \theta_{i} + \delta_t + \upsilon_{it}\]

where \(\theta_i\) reflects the time-invariant traits of the units, \(\delta_t\) reflects the time-specific factors that affect everyone and \(\upsilon_{it}\) is the idiosyncratic error


We will move forward by creating three models:

  • A naive model (also known as a pooled model), where we will regress co2_per_capita on tax.
  • A model with unit-fixed effects, where we will capture the \(\theta\) portion of our error
  • A model with time- and unit-fixed effects, where we will capture our \(\delta\) and \(\theta\) portions of our error term

Naive modeling

naive_carbon <- lm(co2_per_capita ~ tax, data = carbon_tax_df)
pooled_naive_carbon <- plm(co2_per_capita ~ tax, 
                           data = carbon_tax_df, 
                           model = "pooling")

Naive model with lm()

modelsummary::modelsummary(naive_carbon, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.",
                           statistic = "conf.int",
                           stars = T)
Model 1
(Intercept) 10.63***
[9.93, 11.32]
tax -6.26***
[-7.38, -5.14]
Num.Obs. 132
R2 0.485
R2 Adj. 0.481
F 122.394
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Naive model with plm()

modelsummary::modelsummary(pooled_naive_carbon, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.")
Model 1
(Intercept) 10.63
(0.35)
tax -6.26
(0.57)
Num.Obs. 132
R2 0.485
R2 Adj. 0.481

What do we see here?

This model is telling us that on average, the \(CO_2\) emissions per capita are reduced by 6.2 metric tons when a carbon tax is put in place. Still, after all the work we have done throughout the semester, we understand that there may be a plethora of factors that could be skewing the results of this bivariate regression.


Unit-fixed effects

We will learn two ways of gathering unit- and time-fixed effects in R:

First, we will perform Least Squares Dummy Variables (LSDV) estimation with lm(), where we essentially get an individual estimate for each unit.

Second, we will run our model with plm(), which will do the same mechanics, yet it will not estimate each of the units intercepts (because it relies on the “de-meaning” approach).

lsdv_unit_fe <- lm(co2_per_capita ~ tax + country_name, data = carbon_tax_df)

unit_fe <- plm(co2_per_capita ~ tax, 
               data = carbon_tax_df, 
               index = c("country_name"), # unit 
               effect = "individual",
               model = "within")

Unit-fixed effects with lm() — Least Squares Dummy Variables (LSDV) estimation

modelsummary::modelsummary(lsdv_unit_fe, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.",
                           statistic = "conf.int",
                           stars = T)
Model 1
(Intercept) 6.91***
[5.67, 8.15]
tax -4.44***
[-5.37, -3.50]
country_nameBorovia 1.51
[-0.30, 3.31]
country_nameCarpania 5.11***
[3.30, 6.91]
country_nameCorinthia 6.43***
[4.68, 8.19]
country_nameFreedonia 6.12***
[4.33, 7.91]
country_nameGlenraven 0.37
[-1.51, 2.26]
country_nameKhemed -0.67
[-2.56, 1.21]
country_nameLaurania 1.53
[-0.32, 3.39]
country_nameParano 3.69***
[1.94, 5.45]
country_nameRon 2.71**
[0.91, 4.52]
country_nameRumekistan 7.43***
[5.68, 9.19]
country_nameTransia 1.88+
[-0.03, 3.80]
Num.Obs. 132
R2 0.797
R2 Adj. 0.776
F 38.844
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Unit-fixed effects with plm()

modelsummary::modelsummary(unit_fe, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.",
                           statistic = "conf.int",
                           stars = T)
Model 1
tax -4.44***
[-5.36, -3.51]
Num.Obs. 132
R2 0.424
R2 Adj. 0.366
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

What do we see here?

Adding unit-level fixed effects to the model, i.e. accounting for unobserved, time-invariant characteristics of countries and only focusing on within-state variation, the imposition of a carbon tax reduces \(CO_2\) per capita emissions by 4.44 metric tons. Once we have captured the variation between countries, we can see that our results from the naive model were substantially biased. We can still try to capture the time-specific portion of the error.

The results from the Least Squares Dummy Variables (LSDV) estimation are read in reference to a baseline. In this case, the constant is representing the intercept for Adjikistan. We can utilize the individual slopes for each country to say that Freedonians emit on average 6.12 more metric tons of \(CO_2\) per capita than Adjikistanians.


Unit- and time-fixed effects

We will perform our regressions with Least Squares Dummy Variables (LSDV) estimation with lm() and our simplified way with plm().

lsdv_unit_time_fe <- lm(co2_per_capita ~ tax + country_name + factor(year), 
                        data = carbon_tax_df)

unit_time_fe <- plm(co2_per_capita ~ tax, 
                    data = carbon_tax_df, 
                    index = c("country_name", "year"), # unit and time
                    model = "within", 
                    effect = "twoways")

Unit- and time-fixed effects with lm() — Least Squares Dummy Variables (LSDV) estimation

modelsummary::modelsummary(lsdv_unit_time_fe, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.",
                           statistic = "conf.int",
                           stars = T)
Model 1
(Intercept) 8.62***
[7.84, 9.41]
tax -3.91***
[-4.46, -3.35]
country_nameBorovia 1.27**
[0.44, 2.09]
country_nameCarpania 4.87***
[4.04, 5.69]
country_nameCorinthia 6.43***
[5.65, 7.22]
country_nameFreedonia 5.93***
[5.11, 6.74]
country_nameGlenraven -0.01
[-0.90, 0.87]
country_nameKhemed -1.06*
[-1.94, -0.17]
country_nameLaurania 1.20**
[0.33, 2.06]
country_nameParano 3.69***
[2.90, 4.48]
country_nameRon 2.47***
[1.65, 3.30]
country_nameRumekistan 7.43***
[6.64, 8.22]
country_nameTransia 1.45**
[0.54, 2.36]
factor(year)2010 -4.70***
[-5.45, -3.94]
factor(year)2011 1.89***
[1.13, 2.65]
factor(year)2012 -3.06***
[-3.83, -2.29]
factor(year)2013 0.11
[-0.67, 0.88]
factor(year)2014 -1.88***
[-2.65, -1.10]
factor(year)2015 -3.02***
[-3.84, -2.20]
factor(year)2016 -3.29***
[-4.13, -2.45]
factor(year)2017 -0.96*
[-1.80, -0.12]
factor(year)2018 -2.49***
[-3.33, -1.65]
factor(year)2019 -1.40**
[-2.22, -0.58]
Num.Obs. 132
R2 0.963
R2 Adj. 0.955
F 127.305
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Unit-fixed effects with plm()

modelsummary::modelsummary(unit_time_fe, 
                           fmt = 2,
                           gof_omit = "AIC|BIC|Log.Lik.",
                           statistic = "conf.int",
                           stars = T)
Model 1
tax -3.91***
[-4.46, -3.36]
Num.Obs. 132
R2 0.641
R2 Adj. 0.568
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

What do we see here?

Now in addition to adding unit-level fixed effects to the model, we control for time-specific factors that affect the global per capita \(CO_2\) emissions. The results suggest that the effect of a carbon-tax leads to a decrease \(CO_2\) emissions of 3.91 metric tons per capita.


Test of serial correlation for the idiosyncratic component of the errors in panel models

In our models our assumption for the errors is \(υ_{it} ∼ iid(0, σ_υ^{2})\).

With longer panels, serial correlation between errors is a problem:

\(Cor(υ_{it}, υ_i(t−1))≠0\).

We can test for serial correlation in our time and unit FE specification model, as such:

pbgtest(unit_time_fe, order = 2)
## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  co2_per_capita ~ tax
## chisq = 0.25101, df = 2, p-value = 0.8821
## alternative hypothesis: serial correlation in idiosyncratic errors

In here, the null hypothesis is there is serial correlation of any order up to 2 (i.e., first- and second-order). In this case, we do not find any serial correlation, so we do not need to correct our standard errors manually. If we were to find serial correlation, we could introduce robust standard errors with a similar syntax to the one used last week for clustered standard errors:

model_with_robust_se <- coeftest(unit_time_fe, vcov=vcovHC(unit_time_fe, type = "sss"))

Drafting some brief recommedations

You report back to the Organization of Economic Non-Cooperation for Development (OENCD). Based on your analysis of the data at hand, you suggest that the implementation of a carbon tax does have an effect on national carbon dioxide emissions per capita. Your results show that a carbon tax reduces \(CO_2\) emissions by 3.91 metric tons per capita.

Your results are welcomed internationally and all states move forward with the measure.