If you haven’t done so yet, you need to install the tidyverse collection of packages. To do so, simply call the following line. (Note that this will install several R-packages and take a lot of time if you haven’t already installed these packages.)
If you already have tidyverse installed, make sure that all of your packages are up to date, either by calling
or by going to the Packages tab in the lower-right R Studio pane and clicking on Update.
R is extremely versatile when it comes to handling data of different
formats and types. Much of this functionality is made available via R-packages,
which let you import data into R from Microsoft Excel, Stata, SAS, or SPSS files in
addition to standard file types such as comma-separated files (.csv
)
and tab-separated files (.tsv
). In this course, most of the data sets we use will be
available directly through R and R packages, but knowing how to import data directly is a
useful skill.1
First we need to find some data to import. Download the US Cars dataset that we have provided in the git repository for the course. The data is available here (this link may not work if you’re browsing this page from inside Canvas). Alternatively, you can call following lines of code to download the dataset.
# check that we have a data folder in our working directory, and if not create
# one
if (!dir.exists("data")) {
dir.create("data")
}
# download the dataset to data/us_cars.csv
download.file(
"https://raw.githubusercontent.com/stat-lu/dataviz/main/data/us_cars.csv",
file.path("data", "us_cars.csv"),
mode = "wb"
)
Now we load this dataset into R via the readr package (part of tidyverse).
This is a comma-separated file (note the file extension), so we use the
read_csv()
function2
When you call read_csv()
, the readr package helpfully prints a message to your console
with information about how the columns in the dataset were formatted when you
imported the data. Take a look at this information—does everything look
okay?
An alternative to this approach is to use readr to read data directly
from an URL into R. To do so, you simply use the URL instead of the file name
in the call to read_csv()
, like this.
Whenever you start working with new data, always start by taking a look at the
data in raw form. The best first step is usually just to print the data to the
console or by calling head()
(to see the first few lines).
## # A tibble: 2,499 × 12
## price brand model year title_status mileage color vin lot state country
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa
## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa
## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa
## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa
## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa
## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa
## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa
## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa
## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa
## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>
If you want to see more rows of the data than are printed by default, try to call
print()
with the n
argument set to the number of rows you want, or use
head()
in the same way.
Another useful function, particularly when there are many columns in the
dataset, is glimpse()
.
## Rows: 2,499
## Columns: 12
## $ price <dbl> 6300, 2899, 5350, 25000, 27700, 5700, 7300, 13350, 14600,…
## $ brand <chr> "toyota", "ford", "dodge", "ford", "chevrolet", "dodge", …
## $ model <chr> "cruiser", "se", "mpv", "door", "1500", "mpv", "pk", "doo…
## $ year <dbl> 2008, 2011, 2018, 2014, 2018, 2018, 2010, 2017, 2018, 201…
## $ title_status <chr> "clean vehicle", "clean vehicle", "clean vehicle", "clean…
## $ mileage <dbl> 274117, 190552, 39590, 64146, 6654, 45561, 149050, 23525,…
## $ color <chr> "black", "silver", "silver", "blue", "red", "white", "bla…
## $ vin <chr> "jtezu11f88k007763", "2fmdk3gc4bbb02217", "3c4pdcgg5jt346…
## $ lot <dbl> 159348797, 166951262, 167655728, 167753855, 167763266, 16…
## $ state <chr> "new jersey", "tennessee", "georgia", "virginia", "florid…
## $ country <chr> "usa", "usa", "usa", "usa", "usa", "usa", "usa", "usa", "…
## $ condition <chr> "10 days left", "6 days left", "2 days left", "22 hours l…
One of the most important data wrangling tools when it comes to data visualization is pivoting, which can be used to transform messy data into tidy data, which in turn will make visualization much easier for us.
Let’s see what this can entail by looking at a simple dataset, table2
from
the tidyr package, which contains information on Tuberculosis cases
in a few countries from 1999 and 2000.
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
This data is not tidy—but why not? Take a moment to consider this for yourself before moving on.
Did you figure it out? The problem is that cases
and population
are two
different variables but don’t have their own separate columns. To fix this, we
need to reshape the data by pivoting it to a wider form using pivot_wider()
.
Before you continue, take a look at the documentation for the function by
calling help(pivot_wider)
(or ?pivot_wilder
) in R and see if you can make
sense of the manual entry for the function.
The function has many arguments but we only need to concern ourselves
with data
, names_from
, and values_from
right now. names_from
should
indicate which columns store the names of the variables we want to pivot,
while values_from
should contain those variables’ values. Putting this
together, we get the following.
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Now it’s easy to visualize this data!
Figure 1: Tubercolosis cases in Afghanistan, Brazil, and China in 1999 and 2000
Sometimes it’s necessary to pivot in the other direction: from wide to long, in
which case we need to use pivot_longer()
instead. pivot_longer()
is exactly
the inverse of pivot_wider()
. Just like you did for pivot_wider()
, read the
documentation for pivot_longer()
and then see if you can figure out by
yourself how to pivot table2_tidy
back into table2
.
When we use pivot_longer()
we need to tell R what columns we want to stack,
which we do through the cols
argument. Here we simply specify the
names of the columns (cases and population), and then use names_to
and
values_to
with whatever names we think are appropriate.
In the tidyverse vocabulary, filtering refers to the process of selecting a
subset of rows (observations if the data is tidy) from your dataset, whereas
selecting means selecting a subset of columns (variables if the data is tidy).
We use the aptly named filter()
and select()
respectively for these tasks.
filter()
is used on a dataset (tidyverse functions typically always
take data as its first argument) together with a number of logical expressions
that specify which rows to keep in the dataset. Recall that a logical expression
is a binary expression that relates the left-hand side to the right-hand side
in some way, for instance to check equality or inequality, like so:
## [1] FALSE FALSE TRUE
## [1] FALSE TRUE FALSE
## [1] TRUE FALSE TRUE
Let’s say that we only wanted to look at cars from Tennessee from years
2015 and onward. Then we can use filter()
in the following way.
## # A tibble: 17 × 12
## price brand model year title_status mileage color vin lot state country
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 31900 chev… 1500 2018 clean vehic… 22909 black 3gcu… 1.68e8 tenn… usa
## 2 16500 buick enco… 2018 clean vehic… 20002 red kl4c… 1.68e8 tenn… usa
## 3 19200 chev… 1500 2018 clean vehic… 2430 white 1gcn… 1.68e8 tenn… usa
## 4 30500 chev… 1500 2018 clean vehic… 30442 red 3gcu… 1.68e8 tenn… usa
## 5 19000 chev… mali… 2017 clean vehic… 18414 white 1g1z… 1.68e8 tenn… usa
## 6 27000 buick encl… 2017 clean vehic… 32107 no_c… 5gak… 1.68e8 tenn… usa
## 7 29400 bmw x3 2017 clean vehic… 23765 black 5uxw… 1.68e8 tenn… usa
## 8 8500 ford fusi… 2017 clean vehic… 78739 black 3fa6… 1.68e8 tenn… usa
## 9 30300 ford f-150 2019 clean vehic… 31899 white 1fte… 1.68e8 tenn… usa
## 10 39200 ford max 2019 clean vehic… 43617 black 1fmj… 1.68e8 tenn… usa
## 11 41200 ford srw 2019 clean vehic… 25638 gray 1ft7… 1.68e8 tenn… usa
## 12 29000 ford door 2017 clean vehic… 42291 black 1fte… 1.68e8 tenn… usa
## 13 14500 niss… rogue 2017 clean vehic… 8677 blue knma… 1.68e8 tenn… usa
## 14 18200 niss… sport 2018 clean vehic… 28009 black jn1b… 1.68e8 tenn… usa
## 15 10900 niss… sent… 2018 clean vehic… 28880 silv… 3n1a… 1.68e8 tenn… usa
## 16 12600 niss… sent… 2017 clean vehic… 11837 gray 3n1a… 1.68e8 tenn… usa
## 17 12700 niss… door 2017 clean vehic… 17738 black 1n4a… 1.68e8 tenn… usa
## # ℹ 1 more variable: condition <chr>
If you think of filtering as slicing a dataset horizontally, select()
does the
opposite, slicing a dataset vertically, by selecting a subset of the columns.
The interface is similar to filter()
’s. You begin with the data and then
through various arguments select which columns it is you want to keep.
There is a plethora of possibilities when it comes to select()
. We’ll try to
cover some of the most common ones here, but see the documentation for
select()
if you want to learn more.
The simplest option is simply to list the columns you want to keep.
## # A tibble: 2,499 × 2
## brand model
## <chr> <chr>
## 1 toyota cruiser
## 2 ford se
## 3 dodge mpv
## 4 ford door
## 5 chevrolet 1500
## 6 dodge mpv
## 7 chevrolet pk
## 8 gmc door
## 9 chevrolet malibu
## 10 ford mpv
## # ℹ 2,489 more rows
When using select()
, it’s actually possible to change the names of the columns
by using a value = name
pair, like this:
## # A tibble: 2,499 × 2
## vintage price_in_dollars
## <chr> <dbl>
## 1 jtezu11f88k007763 6300
## 2 2fmdk3gc4bbb02217 2899
## 3 3c4pdcgg5jt346413 5350
## 4 1ftfw1et4efc23745 25000
## 5 3gcpcrec2jg473991 27700
## 6 2c4rdgeg9jr237989 5700
## 7 1gcsksea1az121133 7300
## 8 1gks2gkc3hr326762 13350
## 9 1g1zd5st5jf191860 14600
## 10 2fmpk3j92hbc12542 5250
## # ℹ 2,489 more rows
If you instead want to drop a particular column, you just preface it with -
or !
.
## # A tibble: 2,499 × 10
## price year title_status mileage color vin lot state country condition
## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 6300 2008 clean vehicle 274117 black jtez… 1.59e8 new … usa 10 days …
## 2 2899 2011 clean vehicle 190552 silver 2fmd… 1.67e8 tenn… usa 6 days l…
## 3 5350 2018 clean vehicle 39590 silver 3c4p… 1.68e8 geor… usa 2 days l…
## 4 25000 2014 clean vehicle 64146 blue 1ftf… 1.68e8 virg… usa 22 hours…
## 5 27700 2018 clean vehicle 6654 red 3gcp… 1.68e8 flor… usa 22 hours…
## 6 5700 2018 clean vehicle 45561 white 2c4r… 1.68e8 texas usa 2 days l…
## 7 7300 2010 clean vehicle 149050 black 1gcs… 1.68e8 geor… usa 22 hours…
## 8 13350 2017 clean vehicle 23525 gray 1gks… 1.68e8 cali… usa 20 hours…
## 9 14600 2018 clean vehicle 9371 silver 1g1z… 1.68e8 flor… usa 22 hours…
## 10 5250 2017 clean vehicle 63418 black 2fmp… 1.68e8 texas usa 2 days l…
## # ℹ 2,489 more rows
Notice that, when you only have negative indexing, the function assumes that you want to keep all of the remaining columns.
It can, however, be tedious to manually select every column, in which case you
may use the :
operator to specify a range instead.
## # A tibble: 2,499 × 6
## title_status mileage color vin lot state
## <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 clean vehicle 274117 black jtezu11f88k007763 159348797 new jersey
## 2 clean vehicle 190552 silver 2fmdk3gc4bbb02217 166951262 tennessee
## 3 clean vehicle 39590 silver 3c4pdcgg5jt346413 167655728 georgia
## 4 clean vehicle 64146 blue 1ftfw1et4efc23745 167753855 virginia
## 5 clean vehicle 6654 red 3gcpcrec2jg473991 167763266 florida
## 6 clean vehicle 45561 white 2c4rdgeg9jr237989 167655771 texas
## 7 clean vehicle 149050 black 1gcsksea1az121133 167753872 georgia
## 8 clean vehicle 23525 gray 1gks2gkc3hr326762 167692494 california
## 9 clean vehicle 9371 silver 1g1zd5st5jf191860 167763267 florida
## 10 clean vehicle 63418 black 2fmpk3j92hbc12542 167656121 texas
## # ℹ 2,489 more rows
Finally, it can sometimes be useful to match columns by name in some way, for instance if a number of columns that you want to contains a specific word. To be able to do this, tidyr provides a set of helper functions, such as
starts_with()
,ends_with()
, andcontains()
.The manual entry for select()
contains several examples using these helper
functions, as does the individual entries for each function.
Consider taking a little time to play around with select()
and its helper
functions on the us_cars
dataset.
Frequently when visualizing data you will want to transform that data in some
way. Perhaps you’re more interested in the proportion than a number, want to
convert a value to a different unit or you just need to change the names of some
factor variables. In this cases, mutate()
is your best friend.
Let’s start by example. Say that we want to convert the price of the cars
in the us_cars
dataset to Swedish kronor instead. Here’s how to do this:
## # A tibble: 2,499 × 12
## price brand model year title_status mileage color vin lot state
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 56196 toyota cruiser 2008 clean vehic… 274117 black jtez… 1.59e8 new …
## 2 25859. ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn…
## 3 47722 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor…
## 4 223000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg…
## 5 247084 chevrolet 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor…
## 6 50844 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas
## 7 65116 chevrolet pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor…
## 8 119082 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali…
## 9 130232 chevrolet malibu 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor…
## 10 46830 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas
## # ℹ 2,489 more rows
## # ℹ 2 more variables: country <chr>, condition <chr>
In this instance we simply overwrote the price
variable with a new value,
but mutate can also be used to create new variables.
## # A tibble: 2,499 × 13
## price brand model year title_status mileage color vin lot state country
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa
## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa
## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa
## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa
## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa
## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa
## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa
## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa
## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa
## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa
## # ℹ 2,489 more rows
## # ℹ 2 more variables: condition <chr>, price_sek <dbl>
Inside mutate()
, you can basically use any operation that would work on a
typical vector in R. You can, for instance, convert between different vector
types, or use arbitrary functions (as long as they return a new vector).
mutate(
us_cars,
year = as.integer(year), # convert year from a double to an integer
state = toupper(state), # capitalize state names
price = round(price, -2), # round to hundreds
brand = fct_lump_prop(brand, 0.1) # lump together low-frequent brands
)
## # A tibble: 2,499 × 12
## price brand model year title_status mileage color vin lot state country
## <dbl> <fct> <chr> <int> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6300 Other crui… 2008 clean vehic… 274117 black jtez… 1.59e8 NEW … usa
## 2 2900 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 TENN… usa
## 3 5400 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 GEOR… usa
## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 VIRG… usa
## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 FLOR… usa
## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 TEXAS usa
## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 GEOR… usa
## 8 13400 Other door 2017 clean vehic… 23525 gray 1gks… 1.68e8 CALI… usa
## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 FLOR… usa
## 10 5200 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 TEXAS usa
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>
When you produce visualizations, you need to make sure that annotation in the plot, such as the axis titles, are legible. Many datasets, however, have variable (column) names that are not. You can always fix this when you set up your plot, but if you’re producing visualizations it may become tedious to rename the axes with every plot.
That is why it is often useful to rename your variables during the
data wrangling step. This also has the benefit of making your data wrangling
steps more readable. There are multiple ways to rename variables with
the tidyverse approach. We’ve already seen that you can use select()
to
rename variables, but if you only want to rename and not select, it’s
you should instead use rename()
.
Just as with select()
, you use a new_name = old_name
pair to do so. Let’s
rename “vin” to “vintage”.
## # A tibble: 2,499 × 12
## price brand model year title_status mileage color vintage lot state
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 6300 toyota cruiser 2008 clean vehic… 274117 black jtezu1… 1.59e8 new …
## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmdk3… 1.67e8 tenn…
## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4pdc… 1.68e8 geor…
## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftfw1… 1.68e8 virg…
## 5 27700 chevrolet 1500 2018 clean vehic… 6654 red 3gcpcr… 1.68e8 flor…
## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4rdg… 1.68e8 texas
## 7 7300 chevrolet pk 2010 clean vehic… 149050 black 1gcsks… 1.68e8 geor…
## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks2g… 1.68e8 cali…
## 9 14600 chevrolet malibu 2018 clean vehic… 9371 silv… 1g1zd5… 1.68e8 flor…
## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmpk3… 1.68e8 texas
## # ℹ 2,489 more rows
## # ℹ 2 more variables: country <chr>, condition <chr>
You can have spaces in your variable names if you want to, but this is bad
practice because including special characters
like %
, &
, $
in variable names can have undesired side-effects that you
do best to avoid.
Another common operation in data-visualization is to group and summarize data. This is useful when you want to visualize a summary statistic rather than the raw data. Before taking this route in data visualization, however, make it a habit to ask yourself whether this aggregation is needed. If you are able to craft a visualization where you showcase all your data without sacrificing the communicative properties of the visualization, then this is always a better alternative.
To group data using the tidyverse methodology, we use group_by()
(from dplyr). This is a very simple function. You simply list all the
variables (columns) that you want to group by. Let’s group the us_cars
dataset
by country (USA or Canada).
## # A tibble: 2,499 × 12
## # Groups: country [2]
## price brand model year title_status mileage color vin lot state country
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa
## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa
## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa
## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa
## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa
## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa
## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa
## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa
## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa
## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa
## # ℹ 2,489 more rows
## # ℹ 1 more variable: condition <chr>
Okay, so not much happened! If you look closely, however, you’ll see that this
tibble is now grouped—but what does this mean? Well, the truth is that
grouping is only useful when combined with functions that modify the data,
especially summarize()
.
summarize()
takes a dataset—usually a grouped dataset—and a set of
functions that takes vectors as input and returns summary statistics,
such as
mean()
,median()
,sum()
, orquantile(..., probs = 0.25)
.Let’s see how this looks in practice by computing the median and mean prices of cars in USA and Canada.
us_cars_grouped <- group_by(us_cars, country)
summarize(
us_cars_grouped,
median_price = median(price),
mean_price = mean(price),
number_of_cars = n() # captures the size of the group
)
## # A tibble: 2 × 4
## country median_price mean_price number_of_cars
## <chr> <dbl> <dbl> <int>
## 1 canada 30000 30357. 7
## 2 usa 16894 18735. 2492
Perfect! Note the use of the n()
function that counts the number of
observations in each group. In this case, it highlights the problem that
can occur when we group, summarize, and plot, namely, that we lose information
about the size of the groups. It would not be reasonable to
draw conclusions about cars from Canada based on only seven observations.
Missing data is an issue that is prevalent in much of the real-life data that you may encounter, particularly when it comes to data involving people. Dealing with missing data is an important topic but, as we said in the introduction, it is a topic that lies mostly outside the scope of this course. We do, however, recommend that you always take time to consider the reasons for why data may be missing and for the most part make it a habit to include missing data in your visualizations instead of simply removing it.
What you do need to know, however, is that missing data (coded as NA
in R)
may interfere with your work in R. Let’s say that you’re working with
sleep data on mammals using the msleep
dataset from gplot2, and want
to compute the mean number of hours in REM sleep:
## [1] NA
The result here, NA
, is a consequence of the presence of NA
values in
the column sleep_rem
in msleep
and this behavior is actually the default
for many functions in R. To deal with this, you have
two options: 1) directly remove the observations with NA
values from the data
set using drop_na()
or 2) set na.rm = TRUE
in the call to mean()
, which
excludes the NA
observations in that particular function call.
The second option is usually the best choice, since it means that
you can still keep the NA
values around when you want to produce
visualizations, but in this course we’ll sometimes do drop_na()
to
make working with the datasets a little bit smoother.
The pipe operator, %>%
, is a staple in data science workflows because
it makes the process of managing data much more manageable, modular, and
readable. The pipe operator can be hard to get to grips with at first, but
once you do, you will never want to go back.
The pipe operator takes an object on its left-hand side (or line above) and pipes it through to an object (almost exclusively a function) on the right-hand side. The object on the left-hand side will enter the function on the right hand side as the first argument of that function, replacing (pushing back) any other argument that’s already there.
Let’s say that we have the following function, my_frac()
, which takes
arguments x
and y
and returns x/y
.
Now we can use this function on two numbers, like so:
## [1] 0.4
If we were to do this with the pipe operator instead, it would look like this:
## [1] 0.4
## [1] 0.4
You’re probably thinking that this doesn’t seem very useful at all, given
that we’ve now spent more code to accomplish exactly what my_frac()
already accomplished very well without the pipe. Where the pipe operator
comes into its own, however, is when we need to chain multiple functions.
Let’s say that we want to take our us_cars
dataset and perform a set of
options, namely:
state
,There are more or less three ways to do this:
To solve this by storing temporary objects, here’s what we would do:
black_cars <- filter(us_cars, color == "black")
black_cars_groupedby_states <- group_by(black_cars, state)
black_cars_state_mileage <-
summarize(black_cars_groupedby_states, mean_mileage = mean(mileage))
arrange(black_cars_state_mileage, mean_mileage)
## # A tibble: 35 × 2
## state mean_mileage
## <chr> <dbl>
## 1 kentucky 7232
## 2 arizona 17987.
## 3 washington 23432
## 4 virginia 31971.
## 5 nevada 35147.
## 6 minnesota 36396.
## 7 pennsylvania 36828.
## 8 west virginia 37832.
## 9 california 39376.
## 10 illinois 40350.
## # ℹ 25 more rows
The largest downside to this approach is that you’re cluttering
the workspace with names of intermediary objects that you need to keep track of
and name with meaningful names, or alternatively name by incrementing a
counter such as cars1
, cars2
, etc. This latter approach is actually
what you’ll often end up doing, and (at least for me) often leads to
mixing up the counters and ending up with the wrong results.
Storing intermediate results is sometimes precisely the right way to go about this, especially when you will later use the intermediary result, but in this case it leads to code that is hard to read and manage.
An alternative is to use composite functions for our call. The code above then looks like this.
arrange(
summarize(
group_by(
filter(us_cars, color == "black"),
state
),
mean_mileage = mean(mileage)
),
mean_mileage
)
## # A tibble: 35 × 2
## state mean_mileage
## <chr> <dbl>
## 1 kentucky 7232
## 2 arizona 17987.
## 3 washington 23432
## 4 virginia 31971.
## 5 nevada 35147.
## 6 minnesota 36396.
## 7 pennsylvania 36828.
## 8 west virginia 37832.
## 9 california 39376.
## 10 illinois 40350.
## # ℹ 25 more rows
Code like this is incredibly hard to manage (and read). Simply keeping track of which argument belongs to which function is strenuous in itself. On top of that, imagine that you want to swap order of one of these operations or add another step somewhere, and it’s easy to see why this is not an attractive approach.
With pipes, we get the following result:
us_cars %>%
filter(color == "black") %>%
group_by(state) %>%
summarize(mean_mileage = mean(mileage)) %>%
arrange(mean_mileage)
## # A tibble: 35 × 2
## state mean_mileage
## <chr> <dbl>
## 1 kentucky 7232
## 2 arizona 17987.
## 3 washington 23432
## 4 virginia 31971.
## 5 nevada 35147.
## 6 minnesota 36396.
## 7 pennsylvania 36828.
## 8 west virginia 37832.
## 9 california 39376.
## 10 illinois 40350.
## # ℹ 25 more rows
The code is clean and expressive. If we want to switch the order of the steps or add another, it’s only a matter of adding or removing a row of the code. On top of that, you don’t need to name any intermediate objects (but remember that doing so is sometimes actually desirable).
The tidyverse is designed specifically with the pipe operator in mind. The reason that it works so well with the tidyverse is that virtually all of the main functions in the tidyverse take a data set as the first argument, which means that piping always works. The base R functions and many other packages out there don’t necessarily abide by this principle, however, so make sure you know what your functions are doing when you’re using pipes or you may end up with unexpected (and undesirable) results.
Note that the pipe operator that we have so far (%>%
) covered is not part of
the base R distribution but is, however, included via all of the tidyverse
packages. So simply calling library(tidyverse)
or library(dplyr)
will get
you access to it.
|>
R has actually recently3 implemented its own native pipe
operator, which you don’t need any additional packages to use. It is written
as |>
. So in the last example we covered, you may as well have written
the first two lines as the following.
## # A tibble: 516 × 12
## price brand model year title_status mileage color vin lot state country
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa
## 2 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa
## 3 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa
## 4 31900 chev… 1500 2018 clean vehic… 22909 black 3gcu… 1.68e8 tenn… usa
## 5 20700 ford door 2013 clean vehic… 100757 black 1ftf… 1.68e8 virg… usa
## 6 7300 kia forte 2018 clean vehic… 38823 black 3kpf… 1.68e8 nort… usa
## 7 15000 chev… door 2015 clean vehic… 61578 black 2gnf… 1.68e8 ohio usa
## 8 11900 gmc door 2017 clean vehic… 27965 black 1gks… 1.68e8 cali… usa
## 9 55000 ford srw 2017 clean vehic… 15273 black 1ft7… 1.68e8 penn… usa
## 10 12520 gmc door 2017 clean vehic… 28972 black 1gks… 1.68e8 cali… usa
## # ℹ 506 more rows
## # ℹ 1 more variable: condition <chr>
For all purposes that we use it for in this course, you may as
well use the native pipe operator if you prefer to. %>%
has a bit of
additional functionality, but we will not need it in this course.
If you want to read more about pipes or just need a longer introduction, we recommend to take a look at the piping section of the R for Data Science book.
Have fun piping in the tidyverse!
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/cars.Rmd.