Introduction to Data Manipulation
1. Introduction
Welcome!
Welcome to our second tutorial for the Statistics II: Statistical Modeling & Causal Inference (with R) course.
The labs are designed to reinforce the material covered during the lectures by introducing you to hands-on applications.
The practical nature of our class means that our labs will be data-centered. Throughout our class, we will get acquinted with multiple packages of the tidyverse
.
Though we expect that some of you may already know them, the tidyverse
is a collection of R packages that share an underlying design, syntax, and structure. They will definitely make your life easier!!
Today, we will start with a brief introduction to data manipulation through the dplyr
package.
In this tutorial, you will learn to:
- identify the purpose of a set of
dplyr
verbs - write statements in tidy syntax
- apply
dplyr
verbs to solve your data manipulation challenges
This tutorial is partly based on R for Data Science, section 5.2, and Quantitative Politics with R, chapter 3.
What we will need today
We’ll practice some wrangling in dplyr
using data for penguin sizes recorded by Dr. Kristen Gorman and others at several islands in the Palmer Archipelago, Antarctica. Data are originally published in: Gorman KB, Williams TD, Fraser WR (2014) PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
You do not need to import the data to work through this tutorial - the data are already here waiting behind the scenes.
But if you do ever want to use the penguins data outside of this tutorial, they now exist in the palmerpenguins package in R.
Let’s begin!
2. Data Structure
Tidy data
Generally, we will encounter data in a tidy format. Tidy data refers to a way of mapping the structure of a data set. In a tidy data set:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table
The penguins
data set
The 3 species of penguins in this data set are Adelie, Chinstrap and Gentoo. The data set contains 8 variables:
- species: a factor denoting the penguin species (Adelie, Chinstrap, or Gentoo)
- island: a factor denoting the island (in Palmer Archipelago, Antarctica) where observed
- culmen_length_mm: a number denoting length of the dorsal ridge of penguin bill (millimeters)
- culmen_depth_mm: a number denoting the depth of the penguin bill (millimeters)
- flipper_length_mm: an integer denoting penguin flipper length (millimeters)
- body_mass_g: an integer denoting penguin body mass (grams)
- sex: a factor denoting penguin sex (MALE, FEMALE)
- year an integer denoting the year of the record
*Illustration by @allisonhorst*
Let’s explore the data set.
head()
is a function that returns the first couple rows from a data frame. Write the R code required to explore the first observations of the penguins
data set:
Notice that when you press ‘Run,’ the output of the code is returned below it! So by pressing ‘Run,’ you’ve run your first R code of the class!
head(penguins)
species | island | bill\_length\_mm | bill\_depth\_mm | flipper\_length\_mm | body\_mass\_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
3. Manipulating data with dplyr
What we will learn today
In this tutorial, you’ll learn and practice examples using some functions in dplyr
to work with data. Those are:
select()
: keep or exclude some columnsfilter()
: keep rows that satisfy your conditionsmutate()
: add columns from existing data or edit existing columnsgroup_by()
: lets you define groups within your data setsummarize()
: get summary statisticsarrange()
: reorders the rows according to single or multiple variables
Let’s get to work.
3.1. select()
The first verb (function) we will utilize is select()
. We can employ it to manipulate our data based on columns. If you recall from our initial exploration of the data set there were eight variables attached to every observation. Do you recall them? If you do not, there is no problem. You can utilize names()
to retrieve the names of the variables in a data frame.
names(penguins)
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Say we are only interested in the species, island, and year variables of these data, we can utilize the following syntax:
Activity The following code chunk would select the species, island, and year variables. What should we do to keep the body_mass_g and sex variables as well?
dplyr::select(penguins, species, island, year)
Answer
# you just need to type the names of the columns
dplyr::select(penguins, species, island, year, body_mass_g, sex)
To drop variables, use - before the variable name.
For example, select(penguins, -year)
will drop the year column.
3.2. filter()
The second verb (function) we will employ is filter()
. filter()
lets you use a logical test to extract specific rows from a data frame. To use filter()
, pass it the data frame followed by one or more logical tests. filter()
will return every row that passes each logical test.
The more commonly used logical operators are:
==
: Equal to!=
: Not equal to>
,>=
: Greater than, greater than or equal to<
,<=
: Less than, less than or equal to&
,|
: And, or
Say we are interested in retrieving the observations from the year 2007. We would do:
dplyr::filter(penguins, year == 2007)
Activity Can you adapt the code to retrieve all the observations of Chinstrap penguins from 2007 (remember that species contains character units)
Answer
# you just need to utilize & and type the logical operator for the species
dplyr::filter(penguins, year == 2007 & species == "Chinstrap")
3.3. The Pipe Operator: %>%
The pipe, %>%
, comes from the magrittr
package by Stefan Milton Bache. Packages in the tidyverse
load %>%
for you automatically, so you don’t usually load magrittr
explicitly. This will be one of your best friends in R.
>Pipes are a powerful tool for clearly expressing a sequence of multiple operations. Let’s think about baking for a second.
Activity We can leverage the pipe operator to sequence our code in a logical manner. Can you adapt the following code chunk with the pipe and conditional logical operators we discussed?
only_2009 <- dplyr::filter(penguins, year == 2009)
only_2009_chinstraps <- dplyr::filter(only_2009, species == "Chinstrap")
only_2009_chinstraps_species_sex_year <- dplyr::select(only_2009_chinstraps, species, sex, year)
final_df <- only_2009_chinstraps_species_sex_year
final_df #to print it in our console
Answer
penguins %>% #we start off with out df
dplyr::filter(year == 2009 & species == "Chinstrap") %>% #filter
dplyr::select(species, sex, year) #select
3.4. mutate()
mutate()
lets us create, modify, and delete columns. The most common use for now will be to create new variables based on existing ones. Say we are working with a U.S. American client and they feel more confortable with assessing the weight of the penguins in pounds. We would utilize mutate()
as such:
Activity Can you edit the following code chunk to render a new variable body_mass_kg?
penguins %>%
dplyr::mutate(body_mass_lbs = body_mass_g/453.6)
Answer
penguins %>%
dplyr::mutate(body_mass_kg = body_mass_g/1000) #grams divided by 1000
3.5. group_by()
and summarize()
These two verbs group_by()
and summarize()
tend to go together. When combined , ’summarize()` will create a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. For example:
summarize()
:
penguins %>%
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T)) #max() does not know how to deal with NAs very well
heaviest\_penguin |
---|
6300 |
group_by()
+summarize()
:
penguins %>%
dplyr::group_by(species) %>%
dplyr::summarize(heaviest_penguin = max(body_mass_g, na.rm = T))
species | heaviest\_penguin |
---|---|
Adelie | 4775 |
Chinstrap | 4800 |
Gentoo | 6300 |
Activity
Can you get the weight of the lightest penguin of each species? You can use min()
. What happens when in addition to species you also group by year group_by(species, year)
?
Answers
penguins %>%
dplyr::group_by(species) %>%
dplyr::summarize(lightest_penguin = min(body_mass_g, na.rm = T))
species | lightest\_penguin |
---|---|
Adelie | 2850 |
Chinstrap | 2700 |
Gentoo | 3950 |
penguins %>%
dplyr::group_by(species, year) %>%
dplyr::summarize(lightest_penguin = max(body_mass_g, na.rm = T))
## `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
3.6. arrange()
The arrange()
verb is pretty self-explanatory. arrange()
orders the rows of a data frame by the values of selected columns in ascending order. You can use the desc()
argument inside to arrange in descending order. The following chunk arranges the data frame based on the length of the penguins’ bill. You hint tab contains the code for the descending order alternative.
penguins %>%
dplyr::arrange(bill_length_mm)
penguins %>%
dplyr::arrange(desc(bill_length_mm))
Activity Can you create a data frame arranged by body_mass_g of the penguins observed in the “Dream” island?
Answer
penguins %>%
dplyr::filter(island == "Dream") %>%
dplyr::arrange(desc(body_mass_g))