Welcome!

The practical component of the Statistics II: Statistical Modeling and Causal Inference course relies largely in R programming. Today we will center on some of the necessary skills to perform the assignments for the course. Finally, we will dive into some of the substance of the class by exploring the Potential Outcomes Framework.

In this tutorial, you will learn to:

  • identify the some of the basic functionalities of R and RStudio
  • import data sets into your R environment
  • distinguish the foundations of the Potential Outcomes Framework (POF)
  • apply dplyr verbs to retrieve estimates

R and RStudio: Basics

The RStudio layout

RStudio is an integrated development environment (IDE) for R. Think of RStudio as a front that allows us to interact, compile, and render R code in a more instinctive way. The following image shows what the standard RStudio interface looks like:

  1. Console: The console provides a means to interact directly with R. You can type some code at the console and when you press ENTER, R will run that code. Depending on what you type, you may see some output in the console or if you make a mistake, you may get a warning or an error message.

  2. Script editor: You will utilize the script editor to complete your assignments. The script editor will be the space where files will be displayed. For example, once you download and open the bi-weekly assignment .Rmd template, it will appear here. The editor is a where you should place code you care about, since the code from the console cannot be saved into a script.

  3. Environment: This area holds the abstractions you have created with your code. If you run myresult <- 5+3+2, the myresult object will appear there.

  4. Plots and files: This area will be where graphic output will be generated. Additionally, if you write a question mark before any function, (i.e. ?mean) the online documentation will be displayed here.


R Packages

For the most part, R Packages are collections of code and functions that leverage R programming to expand on the basic functionalities. Last week we met dplyr that aids R programmers in the process of data cleaning and manipulation. There are a plethora of packages in R designed to facilite the completion of tasks. In fact, this tutorial builds on the learnr package that allows us to turn R Markdown documents into interactive tutorials.

Unlike other programming languages, in R you only need to install a package once. The following times you will only need to "require" the package. As a good practice I recommend running the code to install packages only in your R console, not in the code editor. You can install a package with the following syntax

install.packages("name_of_your_package") #note that the quotation marks are mandatory at this stage

Once the package has been installed, you just need to "call it" every time you want to use it in a file by running:

library("name_of_your_package") #either of this lines will require the package
library(name_of_your_package) #library understands the code with, or without, quotation marks

It is extremely important that you do not have any lines installing packages for your assignments because the file will fail to knit


Working directory

The working directory is just a file path on your computer that sets the default location of any files you read into R, or save out of R. Normally, when you open RStudio it will have a default directory (a folder in your computer). You can check you directory by running getwd() in your console:

#this is the default in my case
getwd()
#[1] "/Users/sebastianramirezruiz"

When your RStudio is closed and you open a file from your finder in MacOS or file explorer in Windows, the default working directory will be the folder where the file is hosted


Setting your working directory

You can set you directory manually from RStudio: use the menu to change your working directory under Session > Set Working Directory > Choose Directory.

You can also use the setwd() function:

setwd("/path/to/your/directory") #in macOS
setwd("c:/path/to/your/directory") #in windows

Dealing with errors in R

Errors in R occur when code is used in a way that it is not intended. For example when you try to add two character strings, you will get the following error:

"hello" + "world"
Error in "hello" + "world": non-numeric argument to binary operator

Normally, when something has gone wrong with your program, R will notify you of an error in the console. There are errors that will prevent the code from running, while others will only produce warning messages. In the following case, the code will run, but you will notice that the string "three" is turned into a NA.

as.numeric(c("1", "2", "three"))
Warning: NAs introduced by coercion
[1]  1  2 NA

Since we will be utilizing widely used packages and functions in the course of the semester, the errors that you may come across in the process of completing your assignments will be common for other R users. Most errors occur because of typos. A Google search of the error message can take you a long way as well. Most of the times the first entry on stackoverflow.com will solve the problem.


Importing data

Last week we worked with data provided by the palmerpenguins package; however, most practical applications will require you to work with your own data. In fact, for most assignments, you will be given a data set to work with. As we discussed last week, data are messy. You know what else is messy? Data formats. You may be acquainted with a couple of them (.csv, .tsv, .xlsx).

Fortunately for us, the tidyverse has two packages that make the process of loading data sets from different formats very easy.

  • readr: The goal of readr is to provide a fast and friendly way to read rectangular data (like csv, tsv, rds, and fwf)
  • haven: The goal of haven is to enable R to read and write various data formats used by other statistical packages (like dta, sas, and sav)

You can read more about how to load different types of data in the respective documentations of the packages — readr and haven


For the assignment that went live this week, you will be required to load the earnings-pof.rds. You can do that by installing the readr package and utilizing:

#with readr
your_data_frame <- readr::read_rds("path_for_the_file")

#you can alternatively use a base R option
your_data_frame <- base::readRDS("path_for_the_file")

When in doubt just Google "How to load x_format data in R?" That will do the trick!


The double colon operator ::

You may have noted in the previous section that the functions were preceded by their package name and two colons, for example: readr::read_rds(). The double colon operator :: helps us ensure that we select functions from a particular package. We utilize the operator to explicitly state where the function is coming. This may become even more important when you are doing data analysis as part of a team further in your careers. Though it is likely that this will not be a problem during the course, we can try to employ the following convention package_name::function() to ensure that we will not encounter errors in our knitting process:

dplyr::select()

Let's look at what happens when we load tidyverse.

library(tidyverse)

#── Attaching packages ──────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 #──
#✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#✓ tibble  3.0.3     ✓ dplyr   1.0.2
#✓ tidyr   1.1.2     ✓ stringr 1.4.0
#✓ readr   1.3.1     ✓ forcats 0.5.0
#── Conflicts ─────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() #──
#x dplyr::filter() masks stats::filter()
#x dplyr::lag()    masks stats::lag()

You may notice that R points out some conflicts where some functions are being masked. The default in this machine will become the filter() from the dplyr package during this session. If you were to run some code that is based on the filter() from the stats package, your code will probably result in errors.


The Potential Outcomes Framework

The POF in practice

Let's revisit the example from the lecture once again. Say we are interested in assessing the premise of Allport's hypothesis about interpersonal contact being conducive to reducing intergroup prejudice. We are studying a set of (\(n=8\)) students assigned to a dorm room with a person from their own ethnic group (contact=0) and from a different group (contact=1).

Student (i) Prejudice (C=0) Prejudice (C=1)
1 6 5
2 4 2
3 4 4
4 6 7
5 3 1
6 2 2
7 8 7
8 4 5

The data are already pre-loaded in the prejudice_df object

Data set

Today we will work with the prejudice_df. The data frame contains the following four variables:

  • student_id: numeric student identification
  • prej_0: prejudice level under \(Y_{0i}\) (Contact=0)
  • prej_1: prejudice level under \(Y_{1i}\) (Contact=1)
  • dorm_type: binary for actual treatment state
## # A tibble: 8 x 4
##   student_id prej_0 prej_1 dorm_type
##        <dbl>  <dbl>  <dbl>     <dbl>
## 1          1      6      5         0
## 2          2      4      2         1
## 3          3      4      4         0
## 4          4      6      7         0
## 5          5      3      1         1
## 6          6      2      2         1
## 7          7      8      7         0
## 8          8      4      5         0

Treatment Effects

Individual Treatment Effect (ITE)

We assume from the potential outcomes framework that each subject has a potential outcome under both treatment states. Let's take the first student in the list as an example. %>%

The figure illustrates the potential outcomes for Student 1. We see that in a reality where Student 1 is assigned to in-group dorm (contact=0) their levels of prejudice are 6. On the contrary, in a reality where Student 1 is assigned to co-ethnic dorm (contact=1) their levels of prejudice are 5.

From this illustration, we can gather the individual treatment effect (ITE) for student one. The ITE is equal to the values under treatment (contact=1) minus to the values without treatment (contact=0) or \(ITE = y_{1i} - y_{0i}\).

\[ITE = 5 - 6 = -1\]

As it was put during the lecture:

The ITE is a “comparison of two states of the world” (Cunningham, 2021): individuals are exposed to contact, and not exposed to it.

Evidently, each subject can only be observed in one treatment state at any point in time in real life. This is known as the fundamental problem (Holland, 1986) of causal inference. The Individual Treatment Effect (ITE) in reality is unattainable. Still, it provides us with a conceptual foundation for causal estimation.

Exercise: Our data are coming from a world with perfect information. In that sense, we have both potential outcomes prej_0 and prej_1. Can you think of a way to calculate the ITE for the eight students with one of the dplyr verbs we learned last week?

#you can employ dplyr::mutate() to create the new variable ite
prejudice_df %>% 
  dplyr::mutate(ite = prej_1 - prej_0)
## # A tibble: 8 x 5
##   student_id prej_0 prej_1 dorm_type   ite
##        <dbl>  <dbl>  <dbl>     <dbl> <dbl>
## 1          1      6      5         0    -1
## 2          2      4      2         1    -2
## 3          3      4      4         0     0
## 4          4      6      7         0     1
## 5          5      3      1         1    -2
## 6          6      2      2         1     0
## 7          7      8      7         0    -1
## 8          8      4      5         0     1

Average Treatment Effect (ATE)

Normally, we are not interested in the estimates of individual subjects, but rather a population. The Average Treatment Effect (ATE) is the difference in the average potential outcomes of the population.

\[ATE = E(Y_{1i}) - E(Y_{0i})\]

In other words, the ATE is the average ITE of all the subjects in the population. As you can see, the ATE as defined in the formula is also not attainable. Can you think why?

Exercise: Since our data are coming from a world with perfect information. Can you think of a way to calculate the ATE for the eight students based on what we learned last week?

#we know that the ATE is the averge of every subject's ITE. Do you remember dplyr::summarize()?
#how can we use the verbs from last week to get the average treatment effect?

prejudice_df %>%
  dplyr::mutate(ite = prej_1 - prej_0) %>%
  dplyr::summarize(mean(ite))
## # A tibble: 1 x 1
##   `mean(ite)`
##         <dbl>
## 1        -0.5

The Average Treatment Effect Among the Treated and Untreated (ATT) and (ATU)

The names for these two estimates are very self-explanatory. These two estimates are simply the average treatment effects conditional on the group subjects are assigned to.

The average treatment effect on the treated ATT is defined as the difference in the average potential outcomes for those subjects who were treated: \[ATT = E(Y_{1i}-Y_{0i} | D = 1)\]

The average treatment effect on the untreated ATU is defined as the difference in the average potential outcomes for those subjects who were not treated: \[ATU = E(Y_{1i}-Y_{0i} | D = 0)\]

Exercise: Since our data are coming from a world with perfect information. Can you think of a way to calculate the ATT and ATU for the eight students based on what we learned last week?

#we know that the ATT and ATU are the average of every subject's ITE grouped by their treatment status. Do you remember how the combination of dplyr::group_by() and dplyr::summarize() worked?
#how can we use the verbs from last week to get the average treatment effect on the treated and untreated?

prejudice_df %>%
  dplyr::mutate(ite = prej_1 - prej_0) %>%
  dplyr::group_by(dorm_type) %>%
  dplyr::summarize(mean(ite))
## # A tibble: 2 x 2
##   dorm_type `mean(ite)`
## *     <dbl>       <dbl>
## 1         0        0   
## 2         1       -1.33

What do you think these treatment group differences tell us?


The Näive Average Treatment Effect (NATE)

So far, we have worked with perfect information. Still, we know that in reality we can only observe subjects in one treatment state. This is the information we do have.

The Näive Average Treatment Effect (NATE) is the calculation we can compute based on the observed outcomes.

\[NATE = E(Y_{1i}|D{i}=1) - E(Y_{0i}|D{i}=0)\] *reads in English as: "The expected average outcome under treatment for those treated minus the expected average outcome under control for those not treated"

Exercise: Can you think of a way to calculate the NATE for the eight students employing the new observed_prej variable?

#we know that the NATE is the difference in average observed outcomes grouped by their treatment status. Do you remember how the combination of dplyr::group_by() and dplyr::summarize() worked?

prejudice_df %>%
  dplyr::mutate(observed_prej = ifelse(dorm_type == 1, prej_1, prej_0)) %>%
  dplyr::group_by(dorm_type) %>%
  dplyr::summarize(mean(observed_prej))
## # A tibble: 2 x 2
##   dorm_type `mean(observed_prej)`
## *     <dbl>                 <dbl>
## 1         0                  5.6 
## 2         1                  1.67

Note. The ìfelse() function is a very handy tool to have. It allows us to generate conditional statements. The syntax is the following:

ifelse(condition_to_meet, what_to_do_if_met, what_to_do_if_not_met)

In the case of observed_prej, we ask R to create a new variable, where if the subject is in a co-ethnic dorm, we print the prejudice value under treatment. If that condition is not met, we print the prejudice value under control.


Bias

Bias

During the lecture, we met two sources of bias:


Selection bias

Selection bias is difference in expected outcomes in the absence of treatment for the actual treatment and control group. In other words, these are the underlying differences that individuals in either group start off with.


Heterogeneous Treatment Effect (HTE) bias

Heterogeneous Treatment Effect (HTE) bias is the difference in returns to treatment (the treatment effect) between the treatment and control group, multiplied by the share of the population in control. In other words, this type of bias relates to the dissimilarities stemming for ways in which individuals in either group are affected differently by the treatment.

Exercise: Since our data are coming from a world with perfect information. Can you think of a way to explore the existence selection bias in our data?

## # A tibble: 8 x 4
##   student_id prej_0 prej_1 dorm_type
##        <dbl>  <dbl>  <dbl>     <dbl>
## 1          1      6      5         0
## 2          2      4      2         1
## 3          3      4      4         0
## 4          4      6      7         0
## 5          5      3      1         1
## 6          6      2      2         1
## 7          7      8      7         0
## 8          8      4      5         0

Exercise: Since our data are coming from a world with perfect information. Can you think of a way to explore the existence HTE bias in our data?

## # A tibble: 8 x 4
##   student_id prej_0 prej_1 dorm_type
##        <dbl>  <dbl>  <dbl>     <dbl>
## 1          1      6      5         0
## 2          2      4      2         1
## 3          3      4      4         0
## 4          4      6      7         0
## 5          5      3      1         1
## 6          6      2      2         1
## 7          7      8      7         0
## 8          8      4      5         0