1 Installing Tidyverse

If you haven’t done so yet, you need to install the tidyverse collection of packages. To do so, simply call the following line. (Note that this will install several R-packages and take a lot of time if you haven’t already installed these packages.)

install.packages("tidyverse")

If you already have tidyverse installed, make sure that all of your packages are up to date, either by calling

update.packages(ask = FALSE) # ask = FALSE is to avoid being prompted

or by going to the Packages tab in the lower-right R Studio pane and clicking on Update.

2 Importing Data into R

R is extremely versatile when it comes to handling data of different formats and types. Much of this functionality is made available via R-packages, which let you import data into R from Microsoft Excel, Stata, SAS, or SPSS files in addition to standard file types such as comma-separated files (.csv) and tab-separated files (.tsv). In this course, most of the data sets we use will be available directly through R and R packages, but knowing how to import data directly is a useful skill.1

First we need to find some data to import. Download the US Cars dataset that we have provided in the git repository for the course. The data is available here (this link may not work if you’re browsing this page from inside Canvas). Alternatively, you can call following lines of code to download the dataset.

# check that we have a data folder in our working directory, and if not create
# one
if (!dir.exists("data")) {
  dir.create("data")
}

# download the dataset to data/us_cars.csv
download.file(
  "https://raw.githubusercontent.com/stat-lu/dataviz/main/data/us_cars.csv",
  file.path("data", "us_cars.csv"),
  mode = "wb"
)

Now we load this dataset into R via the readr package (part of tidyverse). This is a comma-separated file (note the file extension), so we use the read_csv() function2

library(tidyverse)
us_cars <- read_csv(file.path("data", "us_cars.csv"))

When you call read_csv(), the readr package helpfully prints a message to your console with information about how the columns in the dataset were formatted when you imported the data. Take a look at this information—does everything look okay?

An alternative to this approach is to use readr to read data directly from an URL into R. To do so, you simply use the URL instead of the file name in the call to read_csv(), like this.

us_cars <- read_csv(
  "https://raw.githubusercontent.com/stat-lu/dataviz/main/data/us_cars.csv"
)

2.1 Taking a Glimpse

Whenever you start working with new data, always start by taking a look at the data in raw form. The best first step is usually just to print the data to the console or by calling head() (to see the first few lines).

us_cars
## # A tibble: 2,499 × 12
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <chr> <chr> <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1  6300 toyo… crui…  2008 clean vehic…  274117 black jtez… 1.59e8 new … usa    
##  2  2899 ford  se     2011 clean vehic…  190552 silv… 2fmd… 1.67e8 tenn… usa    
##  3  5350 dodge mpv    2018 clean vehic…   39590 silv… 3c4p… 1.68e8 geor… usa    
##  4 25000 ford  door   2014 clean vehic…   64146 blue  1ftf… 1.68e8 virg… usa    
##  5 27700 chev… 1500   2018 clean vehic…    6654 red   3gcp… 1.68e8 flor… usa    
##  6  5700 dodge mpv    2018 clean vehic…   45561 white 2c4r… 1.68e8 texas usa    
##  7  7300 chev… pk     2010 clean vehic…  149050 black 1gcs… 1.68e8 geor… usa    
##  8 13350 gmc   door   2017 clean vehic…   23525 gray  1gks… 1.68e8 cali… usa    
##  9 14600 chev… mali…  2018 clean vehic…    9371 silv… 1g1z… 1.68e8 flor… usa    
## 10  5250 ford  mpv    2017 clean vehic…   63418 black 2fmp… 1.68e8 texas usa    
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>

If you want to see more rows of the data than are printed by default, try to call print() with the n argument set to the number of rows you want, or use head() in the same way.

Another useful function, particularly when there are many columns in the dataset, is glimpse().

glimpse(us_cars)
## Rows: 2,499
## Columns: 12
## $ price        <dbl> 6300, 2899, 5350, 25000, 27700, 5700, 7300, 13350, 14600,…
## $ brand        <chr> "toyota", "ford", "dodge", "ford", "chevrolet", "dodge", …
## $ model        <chr> "cruiser", "se", "mpv", "door", "1500", "mpv", "pk", "doo…
## $ year         <dbl> 2008, 2011, 2018, 2014, 2018, 2018, 2010, 2017, 2018, 201…
## $ title_status <chr> "clean vehicle", "clean vehicle", "clean vehicle", "clean…
## $ mileage      <dbl> 274117, 190552, 39590, 64146, 6654, 45561, 149050, 23525,…
## $ color        <chr> "black", "silver", "silver", "blue", "red", "white", "bla…
## $ vin          <chr> "jtezu11f88k007763", "2fmdk3gc4bbb02217", "3c4pdcgg5jt346…
## $ lot          <dbl> 159348797, 166951262, 167655728, 167753855, 167763266, 16…
## $ state        <chr> "new jersey", "tennessee", "georgia", "virginia", "florid…
## $ country      <chr> "usa", "usa", "usa", "usa", "usa", "usa", "usa", "usa", "…
## $ condition    <chr> "10 days left", "6 days left", "2 days left", "22 hours l…

3 Wrangling

3.1 Pivoting

One of the most important data wrangling tools when it comes to data visualization is pivoting, which can be used to transform messy data into tidy data, which in turn will make visualization much easier for us.

Let’s see what this can entail by looking at a simple dataset, table2 from the tidyr package, which contains information on Tuberculosis cases in a few countries from 1999 and 2000.

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

This data is not tidy—but why not? Take a moment to consider this for yourself before moving on.

Did you figure it out? The problem is that cases and population are two different variables but don’t have their own separate columns. To fix this, we need to reshape the data by pivoting it to a wider form using pivot_wider(). Before you continue, take a look at the documentation for the function by calling help(pivot_wider) (or ?pivot_wilder) in R and see if you can make sense of the manual entry for the function.

The function has many arguments but we only need to concern ourselves with data, names_from, and values_from right now. names_from should indicate which columns store the names of the variables we want to pivot, while values_from should contain those variables’ values. Putting this together, we get the following.

table2_tidy <-
  pivot_wider(
    table2,
    names_from = "type",
    values_from = "count"
  )

table2_tidy
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Now it’s easy to visualize this data!

ggplot(table2_tidy, aes(year, cases, fill = country)) +
  geom_col(position = "dodge")
Tubercolosis cases in Afghanistan, Brazil, and China in 1999 and 2000

Figure 1: Tubercolosis cases in Afghanistan, Brazil, and China in 1999 and 2000

Sometimes it’s necessary to pivot in the other direction: from wide to long, in which case we need to use pivot_longer() instead. pivot_longer() is exactly the inverse of pivot_wider(). Just like you did for pivot_wider(), read the documentation for pivot_longer() and then see if you can figure out by yourself how to pivot table2_tidy back into table2.

When we use pivot_longer() we need to tell R what columns we want to stack, which we do through the cols argument. Here we simply specify the names of the columns (cases and population), and then use names_to and values_to with whatever names we think are appropriate.

table2_messy <-
  pivot_longer(
    table2_tidy,
    cols = c("cases", "population"),
    names_to = "type",
    values_to = "count"
  )

3.2 Manipulation

In the tidyverse vocabulary, filtering refers to the process of selecting a subset of rows (observations if the data is tidy) from your dataset, whereas selecting means selecting a subset of columns (variables if the data is tidy). We use the aptly named filter() and select() respectively for these tasks.

3.3 Filtering

filter() is used on a dataset (tidyverse functions typically always take data as its first argument) together with a number of logical expressions that specify which rows to keep in the dataset. Recall that a logical expression is a binary expression that relates the left-hand side to the right-hand side in some way, for instance to check equality or inequality, like so:

c(1, 2, 3) < c(0.2, 0.5, 3.8)
## [1] FALSE FALSE  TRUE
1:3 == 3:1
## [1] FALSE  TRUE FALSE
c("a", "b", "c") %in% c("a", "c")
## [1]  TRUE FALSE  TRUE

Let’s say that we only wanted to look at cars from Tennessee from years 2015 and onward. Then we can use filter() in the following way.

filter(us_cars, state == "tennessee", year >= 2015)
## # A tibble: 17 × 12
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <chr> <chr> <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1 31900 chev… 1500   2018 clean vehic…   22909 black 3gcu… 1.68e8 tenn… usa    
##  2 16500 buick enco…  2018 clean vehic…   20002 red   kl4c… 1.68e8 tenn… usa    
##  3 19200 chev… 1500   2018 clean vehic…    2430 white 1gcn… 1.68e8 tenn… usa    
##  4 30500 chev… 1500   2018 clean vehic…   30442 red   3gcu… 1.68e8 tenn… usa    
##  5 19000 chev… mali…  2017 clean vehic…   18414 white 1g1z… 1.68e8 tenn… usa    
##  6 27000 buick encl…  2017 clean vehic…   32107 no_c… 5gak… 1.68e8 tenn… usa    
##  7 29400 bmw   x3     2017 clean vehic…   23765 black 5uxw… 1.68e8 tenn… usa    
##  8  8500 ford  fusi…  2017 clean vehic…   78739 black 3fa6… 1.68e8 tenn… usa    
##  9 30300 ford  f-150  2019 clean vehic…   31899 white 1fte… 1.68e8 tenn… usa    
## 10 39200 ford  max    2019 clean vehic…   43617 black 1fmj… 1.68e8 tenn… usa    
## 11 41200 ford  srw    2019 clean vehic…   25638 gray  1ft7… 1.68e8 tenn… usa    
## 12 29000 ford  door   2017 clean vehic…   42291 black 1fte… 1.68e8 tenn… usa    
## 13 14500 niss… rogue  2017 clean vehic…    8677 blue  knma… 1.68e8 tenn… usa    
## 14 18200 niss… sport  2018 clean vehic…   28009 black jn1b… 1.68e8 tenn… usa    
## 15 10900 niss… sent…  2018 clean vehic…   28880 silv… 3n1a… 1.68e8 tenn… usa    
## 16 12600 niss… sent…  2017 clean vehic…   11837 gray  3n1a… 1.68e8 tenn… usa    
## 17 12700 niss… door   2017 clean vehic…   17738 black 1n4a… 1.68e8 tenn… usa    
## # ℹ 1 more variable: condition <chr>

3.4 Selecting

If you think of filtering as slicing a dataset horizontally, select() does the opposite, slicing a dataset vertically, by selecting a subset of the columns. The interface is similar to filter()’s. You begin with the data and then through various arguments select which columns it is you want to keep.

There is a plethora of possibilities when it comes to select(). We’ll try to cover some of the most common ones here, but see the documentation for select() if you want to learn more.

The simplest option is simply to list the columns you want to keep.

select(us_cars, brand, "model") # notice that you may omit quotation marks
## # A tibble: 2,499 × 2
##    brand     model  
##    <chr>     <chr>  
##  1 toyota    cruiser
##  2 ford      se     
##  3 dodge     mpv    
##  4 ford      door   
##  5 chevrolet 1500   
##  6 dodge     mpv    
##  7 chevrolet pk     
##  8 gmc       door   
##  9 chevrolet malibu 
## 10 ford      mpv    
## # ℹ 2,489 more rows

When using select(), it’s actually possible to change the names of the columns by using a value = name pair, like this:

select(us_cars, vintage = vin, "price_in_dollars" = price)
## # A tibble: 2,499 × 2
##    vintage           price_in_dollars
##    <chr>                        <dbl>
##  1 jtezu11f88k007763             6300
##  2 2fmdk3gc4bbb02217             2899
##  3 3c4pdcgg5jt346413             5350
##  4 1ftfw1et4efc23745            25000
##  5 3gcpcrec2jg473991            27700
##  6 2c4rdgeg9jr237989             5700
##  7 1gcsksea1az121133             7300
##  8 1gks2gkc3hr326762            13350
##  9 1g1zd5st5jf191860            14600
## 10 2fmpk3j92hbc12542             5250
## # ℹ 2,489 more rows

If you instead want to drop a particular column, you just preface it with - or !.

select(us_cars, !brand, -model)
## # A tibble: 2,499 × 10
##    price  year title_status  mileage color  vin      lot state country condition
##    <dbl> <dbl> <chr>           <dbl> <chr>  <chr>  <dbl> <chr> <chr>   <chr>    
##  1  6300  2008 clean vehicle  274117 black  jtez… 1.59e8 new … usa     10 days …
##  2  2899  2011 clean vehicle  190552 silver 2fmd… 1.67e8 tenn… usa     6 days l…
##  3  5350  2018 clean vehicle   39590 silver 3c4p… 1.68e8 geor… usa     2 days l…
##  4 25000  2014 clean vehicle   64146 blue   1ftf… 1.68e8 virg… usa     22 hours…
##  5 27700  2018 clean vehicle    6654 red    3gcp… 1.68e8 flor… usa     22 hours…
##  6  5700  2018 clean vehicle   45561 white  2c4r… 1.68e8 texas usa     2 days l…
##  7  7300  2010 clean vehicle  149050 black  1gcs… 1.68e8 geor… usa     22 hours…
##  8 13350  2017 clean vehicle   23525 gray   1gks… 1.68e8 cali… usa     20 hours…
##  9 14600  2018 clean vehicle    9371 silver 1g1z… 1.68e8 flor… usa     22 hours…
## 10  5250  2017 clean vehicle   63418 black  2fmp… 1.68e8 texas usa     2 days l…
## # ℹ 2,489 more rows

Notice that, when you only have negative indexing, the function assumes that you want to keep all of the remaining columns.

It can, however, be tedious to manually select every column, in which case you may use the : operator to specify a range instead.

select(us_cars, title_status:state)
## # A tibble: 2,499 × 6
##    title_status  mileage color  vin                     lot state     
##    <chr>           <dbl> <chr>  <chr>                 <dbl> <chr>     
##  1 clean vehicle  274117 black  jtezu11f88k007763 159348797 new jersey
##  2 clean vehicle  190552 silver 2fmdk3gc4bbb02217 166951262 tennessee 
##  3 clean vehicle   39590 silver 3c4pdcgg5jt346413 167655728 georgia   
##  4 clean vehicle   64146 blue   1ftfw1et4efc23745 167753855 virginia  
##  5 clean vehicle    6654 red    3gcpcrec2jg473991 167763266 florida   
##  6 clean vehicle   45561 white  2c4rdgeg9jr237989 167655771 texas     
##  7 clean vehicle  149050 black  1gcsksea1az121133 167753872 georgia   
##  8 clean vehicle   23525 gray   1gks2gkc3hr326762 167692494 california
##  9 clean vehicle    9371 silver 1g1zd5st5jf191860 167763267 florida   
## 10 clean vehicle   63418 black  2fmpk3j92hbc12542 167656121 texas     
## # ℹ 2,489 more rows

Finally, it can sometimes be useful to match columns by name in some way, for instance if a number of columns that you want to contains a specific word. To be able to do this, tidyr provides a set of helper functions, such as

  • starts_with(),
  • ends_with(), and
  • contains().

The manual entry for select() contains several examples using these helper functions, as does the individual entries for each function.

Consider taking a little time to play around with select() and its helper functions on the us_cars dataset.

3.5 Mutating

Frequently when visualizing data you will want to transform that data in some way. Perhaps you’re more interested in the proportion than a number, want to convert a value to a different unit or you just need to change the names of some factor variables. In this cases, mutate() is your best friend.

Let’s start by example. Say that we want to convert the price of the cars in the us_cars dataset to Swedish kronor instead. Here’s how to do this:

mutate(us_cars, price = price * 8.92)
## # A tibble: 2,499 × 12
##      price brand     model    year title_status mileage color vin      lot state
##      <dbl> <chr>     <chr>   <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr>
##  1  56196  toyota    cruiser  2008 clean vehic…  274117 black jtez… 1.59e8 new …
##  2  25859. ford      se       2011 clean vehic…  190552 silv… 2fmd… 1.67e8 tenn…
##  3  47722  dodge     mpv      2018 clean vehic…   39590 silv… 3c4p… 1.68e8 geor…
##  4 223000  ford      door     2014 clean vehic…   64146 blue  1ftf… 1.68e8 virg…
##  5 247084  chevrolet 1500     2018 clean vehic…    6654 red   3gcp… 1.68e8 flor…
##  6  50844  dodge     mpv      2018 clean vehic…   45561 white 2c4r… 1.68e8 texas
##  7  65116  chevrolet pk       2010 clean vehic…  149050 black 1gcs… 1.68e8 geor…
##  8 119082  gmc       door     2017 clean vehic…   23525 gray  1gks… 1.68e8 cali…
##  9 130232  chevrolet malibu   2018 clean vehic…    9371 silv… 1g1z… 1.68e8 flor…
## 10  46830  ford      mpv      2017 clean vehic…   63418 black 2fmp… 1.68e8 texas
## # ℹ 2,489 more rows
## # ℹ 2 more variables: country <chr>, condition <chr>

In this instance we simply overwrote the price variable with a new value, but mutate can also be used to create new variables.

mutate(us_cars, price_sek = price * 8.92)
## # A tibble: 2,499 × 13
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <chr> <chr> <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1  6300 toyo… crui…  2008 clean vehic…  274117 black jtez… 1.59e8 new … usa    
##  2  2899 ford  se     2011 clean vehic…  190552 silv… 2fmd… 1.67e8 tenn… usa    
##  3  5350 dodge mpv    2018 clean vehic…   39590 silv… 3c4p… 1.68e8 geor… usa    
##  4 25000 ford  door   2014 clean vehic…   64146 blue  1ftf… 1.68e8 virg… usa    
##  5 27700 chev… 1500   2018 clean vehic…    6654 red   3gcp… 1.68e8 flor… usa    
##  6  5700 dodge mpv    2018 clean vehic…   45561 white 2c4r… 1.68e8 texas usa    
##  7  7300 chev… pk     2010 clean vehic…  149050 black 1gcs… 1.68e8 geor… usa    
##  8 13350 gmc   door   2017 clean vehic…   23525 gray  1gks… 1.68e8 cali… usa    
##  9 14600 chev… mali…  2018 clean vehic…    9371 silv… 1g1z… 1.68e8 flor… usa    
## 10  5250 ford  mpv    2017 clean vehic…   63418 black 2fmp… 1.68e8 texas usa    
## # ℹ 2,489 more rows
## # ℹ 2 more variables: condition <chr>, price_sek <dbl>

Inside mutate(), you can basically use any operation that would work on a typical vector in R. You can, for instance, convert between different vector types, or use arbitrary functions (as long as they return a new vector).

mutate(
  us_cars,
  year = as.integer(year), # convert year from a double to an integer
  state = toupper(state), # capitalize state names
  price = round(price, -2), # round to hundreds
  brand = fct_lump_prop(brand, 0.1) # lump together low-frequent brands
)
## # A tibble: 2,499 × 12
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <fct> <chr> <int> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1  6300 Other crui…  2008 clean vehic…  274117 black jtez… 1.59e8 NEW … usa    
##  2  2900 ford  se     2011 clean vehic…  190552 silv… 2fmd… 1.67e8 TENN… usa    
##  3  5400 dodge mpv    2018 clean vehic…   39590 silv… 3c4p… 1.68e8 GEOR… usa    
##  4 25000 ford  door   2014 clean vehic…   64146 blue  1ftf… 1.68e8 VIRG… usa    
##  5 27700 chev… 1500   2018 clean vehic…    6654 red   3gcp… 1.68e8 FLOR… usa    
##  6  5700 dodge mpv    2018 clean vehic…   45561 white 2c4r… 1.68e8 TEXAS usa    
##  7  7300 chev… pk     2010 clean vehic…  149050 black 1gcs… 1.68e8 GEOR… usa    
##  8 13400 Other door   2017 clean vehic…   23525 gray  1gks… 1.68e8 CALI… usa    
##  9 14600 chev… mali…  2018 clean vehic…    9371 silv… 1g1z… 1.68e8 FLOR… usa    
## 10  5200 ford  mpv    2017 clean vehic…   63418 black 2fmp… 1.68e8 TEXAS usa    
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>

3.6 Rename

When you produce visualizations, you need to make sure that annotation in the plot, such as the axis titles, are legible. Many datasets, however, have variable (column) names that are not. You can always fix this when you set up your plot, but if you’re producing visualizations it may become tedious to rename the axes with every plot.

That is why it is often useful to rename your variables during the data wrangling step. This also has the benefit of making your data wrangling steps more readable. There are multiple ways to rename variables with the tidyverse approach. We’ve already seen that you can use select() to rename variables, but if you only want to rename and not select, it’s you should instead use rename(). Just as with select(), you use a new_name = old_name pair to do so. Let’s rename “vin” to “vintage”.

us_cars %>%
  rename(vintage = vin)
## # A tibble: 2,499 × 12
##    price brand     model    year title_status mileage color vintage    lot state
##    <dbl> <chr>     <chr>   <dbl> <chr>          <dbl> <chr> <chr>    <dbl> <chr>
##  1  6300 toyota    cruiser  2008 clean vehic…  274117 black jtezu1… 1.59e8 new …
##  2  2899 ford      se       2011 clean vehic…  190552 silv… 2fmdk3… 1.67e8 tenn…
##  3  5350 dodge     mpv      2018 clean vehic…   39590 silv… 3c4pdc… 1.68e8 geor…
##  4 25000 ford      door     2014 clean vehic…   64146 blue  1ftfw1… 1.68e8 virg…
##  5 27700 chevrolet 1500     2018 clean vehic…    6654 red   3gcpcr… 1.68e8 flor…
##  6  5700 dodge     mpv      2018 clean vehic…   45561 white 2c4rdg… 1.68e8 texas
##  7  7300 chevrolet pk       2010 clean vehic…  149050 black 1gcsks… 1.68e8 geor…
##  8 13350 gmc       door     2017 clean vehic…   23525 gray  1gks2g… 1.68e8 cali…
##  9 14600 chevrolet malibu   2018 clean vehic…    9371 silv… 1g1zd5… 1.68e8 flor…
## 10  5250 ford      mpv      2017 clean vehic…   63418 black 2fmpk3… 1.68e8 texas
## # ℹ 2,489 more rows
## # ℹ 2 more variables: country <chr>, condition <chr>

You can have spaces in your variable names if you want to, but this is bad practice because including special characters like %, &, $ in variable names can have undesired side-effects that you do best to avoid.

3.7 Grouping, Mutating, and Summarizing

Another common operation in data-visualization is to group and summarize data. This is useful when you want to visualize a summary statistic rather than the raw data. Before taking this route in data visualization, however, make it a habit to ask yourself whether this aggregation is needed. If you are able to craft a visualization where you showcase all your data without sacrificing the communicative properties of the visualization, then this is always a better alternative.

3.8 Grouping

To group data using the tidyverse methodology, we use group_by() (from dplyr). This is a very simple function. You simply list all the variables (columns) that you want to group by. Let’s group the us_cars dataset by country (USA or Canada).

group_by(us_cars, country)
## # A tibble: 2,499 × 12
## # Groups:   country [2]
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <chr> <chr> <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1  6300 toyo… crui…  2008 clean vehic…  274117 black jtez… 1.59e8 new … usa    
##  2  2899 ford  se     2011 clean vehic…  190552 silv… 2fmd… 1.67e8 tenn… usa    
##  3  5350 dodge mpv    2018 clean vehic…   39590 silv… 3c4p… 1.68e8 geor… usa    
##  4 25000 ford  door   2014 clean vehic…   64146 blue  1ftf… 1.68e8 virg… usa    
##  5 27700 chev… 1500   2018 clean vehic…    6654 red   3gcp… 1.68e8 flor… usa    
##  6  5700 dodge mpv    2018 clean vehic…   45561 white 2c4r… 1.68e8 texas usa    
##  7  7300 chev… pk     2010 clean vehic…  149050 black 1gcs… 1.68e8 geor… usa    
##  8 13350 gmc   door   2017 clean vehic…   23525 gray  1gks… 1.68e8 cali… usa    
##  9 14600 chev… mali…  2018 clean vehic…    9371 silv… 1g1z… 1.68e8 flor… usa    
## 10  5250 ford  mpv    2017 clean vehic…   63418 black 2fmp… 1.68e8 texas usa    
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>

Okay, so not much happened! If you look closely, however, you’ll see that this tibble is now grouped—but what does this mean? Well, the truth is that grouping is only useful when combined with functions that modify the data, especially summarize().

3.9 Summarizing

summarize() takes a dataset—usually a grouped dataset—and a set of functions that takes vectors as input and returns summary statistics, such as

  • mean(),
  • median(),
  • sum(), or
  • quantile(..., probs = 0.25).

Let’s see how this looks in practice by computing the median and mean prices of cars in USA and Canada.

us_cars_grouped <- group_by(us_cars, country)

summarize(
  us_cars_grouped,
  median_price = median(price),
  mean_price = mean(price),
  number_of_cars = n() # captures the size of the group
)
## # A tibble: 2 × 4
##   country median_price mean_price number_of_cars
##   <chr>          <dbl>      <dbl>          <int>
## 1 canada         30000     30357.              7
## 2 usa            16894     18735.           2492

Perfect! Note the use of the n() function that counts the number of observations in each group. In this case, it highlights the problem that can occur when we group, summarize, and plot, namely, that we lose information about the size of the groups. It would not be reasonable to draw conclusions about cars from Canada based on only seven observations.

4 Missing Data

Missing data is an issue that is prevalent in much of the real-life data that you may encounter, particularly when it comes to data involving people. Dealing with missing data is an important topic but, as we said in the introduction, it is a topic that lies mostly outside the scope of this course. We do, however, recommend that you always take time to consider the reasons for why data may be missing and for the most part make it a habit to include missing data in your visualizations instead of simply removing it.

What you do need to know, however, is that missing data (coded as NA in R) may interfere with your work in R. Let’s say that you’re working with sleep data on mammals using the msleep dataset from gplot2, and want to compute the mean number of hours in REM sleep:

mean(msleep$sleep_rem)
## [1] NA

The result here, NA, is a consequence of the presence of NA values in the column sleep_rem in msleep and this behavior is actually the default for many functions in R. To deal with this, you have two options: 1) directly remove the observations with NA values from the data set using drop_na() or 2) set na.rm = TRUE in the call to mean(), which excludes the NA observations in that particular function call.

The second option is usually the best choice, since it means that you can still keep the NA values around when you want to produce visualizations, but in this course we’ll sometimes do drop_na() to make working with the datasets a little bit smoother.

5 Pipes

The pipe operator, %>%, is a staple in data science workflows because it makes the process of managing data much more manageable, modular, and readable. The pipe operator can be hard to get to grips with at first, but once you do, you will never want to go back.

The pipe operator takes an object on its left-hand side (or line above) and pipes it through to an object (almost exclusively a function) on the right-hand side. The object on the left-hand side will enter the function on the right hand side as the first argument of that function, replacing (pushing back) any other argument that’s already there.

Let’s say that we have the following function, my_frac(), which takes arguments x and y and returns x/y.

my_frac <- function(x, y) {
  x / y
}

Now we can use this function on two numbers, like so:

my_frac(2, 5)
## [1] 0.4

If we were to do this with the pipe operator instead, it would look like this:

# use the pipe operator
2 %>% my_frac(5) # equivalent to my_frac(2, 5)
## [1] 0.4
# or we can do
2 %>%
  my_frac(5)
## [1] 0.4

You’re probably thinking that this doesn’t seem very useful at all, given that we’ve now spent more code to accomplish exactly what my_frac() already accomplished very well without the pipe. Where the pipe operator comes into its own, however, is when we need to chain multiple functions.

5.1 Case Study using Pipes

Let’s say that we want to take our us_cars dataset and perform a set of options, namely:

  1. filter out all cars that are black,
  2. group cars by state,
  3. compute the mean mileage per state,
  4. sort the states by mean mileage, and
  5. plot a simple bar chart of state versus mean mileage.

There are more or less three ways to do this:

  1. use intermediary objects,
  2. use function composition, or
  3. use pipes.

5.2 Intermediate Objects

To solve this by storing temporary objects, here’s what we would do:

black_cars <- filter(us_cars, color == "black")
black_cars_groupedby_states <- group_by(black_cars, state)
black_cars_state_mileage <-
  summarize(black_cars_groupedby_states, mean_mileage = mean(mileage))
arrange(black_cars_state_mileage, mean_mileage)
## # A tibble: 35 × 2
##    state         mean_mileage
##    <chr>                <dbl>
##  1 kentucky             7232 
##  2 arizona             17987.
##  3 washington          23432 
##  4 virginia            31971.
##  5 nevada              35147.
##  6 minnesota           36396.
##  7 pennsylvania        36828.
##  8 west virginia       37832.
##  9 california          39376.
## 10 illinois            40350.
## # ℹ 25 more rows

The largest downside to this approach is that you’re cluttering the workspace with names of intermediary objects that you need to keep track of and name with meaningful names, or alternatively name by incrementing a counter such as cars1, cars2, etc. This latter approach is actually what you’ll often end up doing, and (at least for me) often leads to mixing up the counters and ending up with the wrong results.

Storing intermediate results is sometimes precisely the right way to go about this, especially when you will later use the intermediary result, but in this case it leads to code that is hard to read and manage.

5.3 Function Composition

An alternative is to use composite functions for our call. The code above then looks like this.

arrange(
  summarize(
    group_by(
      filter(us_cars, color == "black"),
      state
    ),
    mean_mileage = mean(mileage)
  ),
  mean_mileage
)
## # A tibble: 35 × 2
##    state         mean_mileage
##    <chr>                <dbl>
##  1 kentucky             7232 
##  2 arizona             17987.
##  3 washington          23432 
##  4 virginia            31971.
##  5 nevada              35147.
##  6 minnesota           36396.
##  7 pennsylvania        36828.
##  8 west virginia       37832.
##  9 california          39376.
## 10 illinois            40350.
## # ℹ 25 more rows

Code like this is incredibly hard to manage (and read). Simply keeping track of which argument belongs to which function is strenuous in itself. On top of that, imagine that you want to swap order of one of these operations or add another step somewhere, and it’s easy to see why this is not an attractive approach.

5.4 Pipes

With pipes, we get the following result:

us_cars %>%
  filter(color == "black") %>%
  group_by(state) %>%
  summarize(mean_mileage = mean(mileage)) %>%
  arrange(mean_mileage)
## # A tibble: 35 × 2
##    state         mean_mileage
##    <chr>                <dbl>
##  1 kentucky             7232 
##  2 arizona             17987.
##  3 washington          23432 
##  4 virginia            31971.
##  5 nevada              35147.
##  6 minnesota           36396.
##  7 pennsylvania        36828.
##  8 west virginia       37832.
##  9 california          39376.
## 10 illinois            40350.
## # ℹ 25 more rows

The code is clean and expressive. If we want to switch the order of the steps or add another, it’s only a matter of adding or removing a row of the code. On top of that, you don’t need to name any intermediate objects (but remember that doing so is sometimes actually desirable).

The tidyverse is designed specifically with the pipe operator in mind. The reason that it works so well with the tidyverse is that virtually all of the main functions in the tidyverse take a data set as the first argument, which means that piping always works. The base R functions and many other packages out there don’t necessarily abide by this principle, however, so make sure you know what your functions are doing when you’re using pipes or you may end up with unexpected (and undesirable) results.

Note that the pipe operator that we have so far (%>%) covered is not part of the base R distribution but is, however, included via all of the tidyverse packages. So simply calling library(tidyverse) or library(dplyr) will get you access to it.

5.5 The Native Pipe Operator |>

R has actually recently3 implemented its own native pipe operator, which you don’t need any additional packages to use. It is written as |>. So in the last example we covered, you may as well have written the first two lines as the following.

us_cars |>
  filter(color == "black")
## # A tibble: 516 × 12
##    price brand model  year title_status mileage color vin      lot state country
##    <dbl> <chr> <chr> <dbl> <chr>          <dbl> <chr> <chr>  <dbl> <chr> <chr>  
##  1  6300 toyo… crui…  2008 clean vehic…  274117 black jtez… 1.59e8 new … usa    
##  2  7300 chev… pk     2010 clean vehic…  149050 black 1gcs… 1.68e8 geor… usa    
##  3  5250 ford  mpv    2017 clean vehic…   63418 black 2fmp… 1.68e8 texas usa    
##  4 31900 chev… 1500   2018 clean vehic…   22909 black 3gcu… 1.68e8 tenn… usa    
##  5 20700 ford  door   2013 clean vehic…  100757 black 1ftf… 1.68e8 virg… usa    
##  6  7300 kia   forte  2018 clean vehic…   38823 black 3kpf… 1.68e8 nort… usa    
##  7 15000 chev… door   2015 clean vehic…   61578 black 2gnf… 1.68e8 ohio  usa    
##  8 11900 gmc   door   2017 clean vehic…   27965 black 1gks… 1.68e8 cali… usa    
##  9 55000 ford  srw    2017 clean vehic…   15273 black 1ft7… 1.68e8 penn… usa    
## 10 12520 gmc   door   2017 clean vehic…   28972 black 1gks… 1.68e8 cali… usa    
## # ℹ 506 more rows
## # ℹ 1 more variable: condition <chr>

For all purposes that we use it for in this course, you may as well use the native pipe operator if you prefer to. %>% has a bit of additional functionality, but we will not need it in this course.

If you want to read more about pipes or just need a longer introduction, we recommend to take a look at the piping section of the R for Data Science book.

Have fun piping in the tidyverse!

6 Source Code

The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/cars.Rmd.


  1. And this might very well come in handy for your project, where you will choose your data yourself.↩︎

  2. There is a read_csv2() function as well, which is for semicolon-separated data.↩︎

  3. Since version 4.1↩︎