Introduction to Data Manipulation

1. Introduction

Welcome!

Welcome to our second tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.

The labs are designed to reinforce the material covered during the lectures by introducing you to hands-on applications.

The practical nature of our class means that our labs will be data-centered. Throughout our class, we will get acquinted with multiple packages of the tidyverse.

Though we expect that some of you may already know them, the tidyverse is a collection of R packages that share an underlying design, syntax, and structure. They will definitely make your life easier!!

Today, we will start with a brief introduction to data manipulation through the dplyr package.

In this tutorial, you will learn to:

  • identify the purpose of a set of dplyr verbs
  • write statements in tidy syntax
  • apply dplyr verbs to solve your data manipulation challenges

This tutorial is partly based on R for Data Science, section 5.2, and Quantitative Politics with R, chapter 3.


What we will need today

We’ll practice some wrangling in dplyr using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081

You do not need to import the data to work through this tutorial - the data are already here waiting behind the scenes.

But if you do ever want to use the penguins data outside of this tutorial, they now exist in the palmerpenguins package in R.

Let’s begin!


2. Data Structure

Tidy data

Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table

The penguins data set

The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:

  • species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)
  • island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
  • culmen_length_mm: a number denoting length of the dorsal ridge of penguin bill (millimeters)
  • culmen_depth_mm: a number denoting the depth of the penguin bill (millimeters)
  • flipper_length_mm: an integer denoting penguin flipper length (millimeters)
  • body_mass_g: an integer denoting penguin body mass (grams)
  • sex: a factor denoting penguin sex (MALE, FEMALE)
  • year an integer denoting the year of the record

*Illustration by @allisonhorst*

Let’s explore the data set.

head() is a function that returns the first couple rows from a data frame. Write the R code required to explore the first observations of the penguins data set:

Notice that when you press ‘Run,’ the output of the code is returned below it! So by pressing ‘Run,’ you’ve run your first R code of the class!

head(penguins)
species island bill\_length\_mm bill\_depth\_mm flipper\_length\_mm body\_mass\_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007

3. Manipulating data with dplyr

What we will learn today

In this tutorial, you’ll learn and practice examples using some functions in dplyr to work with data. Those are:

  • select(): keep or exclude some columns
  • filter(): keep rows that satisfy your conditions
  • mutate(): add columns from existing data or edit existing columns
  • group_by(): lets you define groups within your data set
  • summarize(): get summary statistics
  • arrange(): reorders the rows according to single or multiple variables

Let’s get to work.


3.1. select()

The first verb (function) we will utilize is select(). We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names() to retrieve the names of the variables in a data frame.

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:

select(data, columns)

Activity The following code chunk would select the species, island, and year variables. What should we do to keep the body_mass_g and sex variables as well?

dplyr::select(penguins, species, island, year)
Answer

# you just need to type the names of the columns
dplyr::select(penguins, species, island, year, body_mass_g, sex)

To drop variables, use - before the variable name.

For example, select(penguins, -year) will drop the year column.


3.2. filter()

The second verb (function) we will employ is filter(). filter() lets you use a logical test to extract specific rows from a data frame. To use filter(), pass it the data frame followed by one or more logical tests. filter() will return every row that passes each logical test.

The more commonly used logical operators are:

  • ==: Equal to
  • !=: Not equal to
  • >, >=: Greater than, greater than or equal to
  • <, <=: Less than, less than or equal to
  • &, |: And, or

Say we are interested in retrieving the observations from the year 2007. We would do:

dplyr::filter(penguins, year == 2007)

Activity Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007 (remember that species contains character units)

Answer

# you just need to utilize & and type the logical operator for the species
dplyr::filter(penguins, year == 2007 & species == "Chinstrap")


3.3. The Pipe Operator: %>%

The pipe, %>%, comes from the magrittr package by Stefan Milton Bache. Packages in the tidyverse load %>% for you automatically, so you don’t usually load magrittr explicitly. This will be one of your best friends in R. >Pipes are a powerful tool for clearly expressing a sequence of multiple operations. Let’s think about baking for a second.


Activity We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunk with the pipe and conditional logical operators we discussed?

only_2009 <- dplyr::filter(penguins, year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
Answer

penguins %>% #we start off with out df
  dplyr::filter(year == 2009 & species == "Chinstrap") %>% #filter
  dplyr::select(species, sex, year) #select


3.4. mutate()

mutate() lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate() as such:

mutate(new\_var\_name = conditions)

Activity Can you edit the following code chunk to render a new variable body_mass_kg?

penguins %>%
  dplyr::mutate(body_mass_lbs = body_mass_g/453.6)
Answer

penguins %>%
  dplyr::mutate(body_mass_kg = body_mass_g/1000) #grams divided by 1000 


3.5. group_by() and summarize()

These two verbs group_by() and summarize() tend to go together. When combined , ’summarize()` will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. For example:

  • summarize():
penguins %>%
  dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) #max() does not know how to deal with NAs very well
heaviest\_penguin
6300
  • group_by() + summarize():
penguins %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T))
species heaviest\_penguin
Adelie 4775
Chinstrap 4800
Gentoo 6300

Activity Can you get the weight of the lightest penguin of each species? You can use min(). What happens when in addition to species you also group by year group_by(species, year)?

Answers

penguins %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(lightest_penguin = min(body_mass_g, na.rm = T))
species lightest\_penguin
Adelie 2850
Chinstrap 2700
Gentoo 3950
penguins %>%
  dplyr::group_by(species, year) %>%
  dplyr::summarize(lightest_penguin = max(body_mass_g, na.rm = T)) 
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.


3.6. arrange()

The arrange() verb is pretty self-explanatory. arrange() orders the rows of a data frame by the values of selected columns in ascending order. You can use the desc() argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins’ bill. You hint tab contains the code for the descending order alternative.

arrange(variable\_of\_interest)

penguins %>%
  dplyr::arrange(bill_length_mm)
penguins %>%
  dplyr::arrange(desc(bill_length_mm))

Activity Can you create a data frame arranged by body_mass_g of the penguins observed in the “Dream” island?

Answer

penguins %>%
  dplyr::filter(island == "Dream") %>%
  dplyr::arrange(desc(body_mass_g)) 

Previous
Next