1 Arthritis

In this worked example, we will be working with data from a treatment trial for Arthritis. The data comes from the vcd package. To begin, let’s load the package and take a glimpse() at the data.

# install.packages("vcd") # if you haven't installed it yet
library(vcd)
library(tidyverse)

glimpse(Arthritis)
## Rows: 84
## Columns: 5
## $ ID        <int> 57, 46, 77, 17, 36, 23, 75, 39, 33, 55, 30, 5, 63, 83, 66, 4…
## $ Treatment <fct> Treated, Treated, Treated, Treated, Treated, Treated, Treate…
## $ Sex       <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, …
## $ Age       <int> 27, 29, 30, 32, 46, 58, 59, 59, 63, 63, 64, 64, 69, 70, 23, …
## $ Improved  <ord> Some, None, None, Marked, Marked, Marked, None, Marked, None…

Notice that Improved is helpfully formatted as an ordinal (ordered) variable already, which is going to help ggplot to choose the correct color palettes for us.

2 Bar Charts

Let’s start with a simple bar chart to see how many people were allocated to the two treatment arms.

library(tidyverse)

ggplot(Arthritis, aes(Treatment)) +
  geom_bar()
Treatment allocation for the arthritis study.

Figure 1: Treatment allocation for the arthritis study.

2.1 Was the Treatment Effective?

Now for something more interesting. Let’s try to look at how well the treatment seems to have worked. We start with a stacked bar chart.

ggplot(Arthritis, aes(Treatment, fill = Improved)) +
  geom_bar(color = "black") +
  labs(fill = "Improvement") # add a better legend title
Treatment effect in the arthritis study.

Figure 2: Treatment effect in the arthritis study.

It’s immediately clear that the treatment has worked well! The count of people with a improvement is much higher in the treatment arm compared with the placebo arm. But how do the treatments compare in terms of people who’ve made some improvement? This is difficult to say from our approach here. This is due to two reasons:

  1. The stacked bar chart makes it hard to compare all but the first category, since the other portions of the bars are not aligned.1
  2. The bars are not proportional.2

Let’s try a grouped proportional bar chart instead. To achieve this, we need to do a little preparatory work, since ggplot2 does not offer a direct option for this purpose.

What we need to do is to prepare a data set with the proportions that we would like to plot. In the following lines, we

  1. use count() to count the number of observations in each combination of Treatment and Improved,
  2. group the data set by Treatment, and
  3. compute the proportions using mutate().
arthritis_prop <-
  Arthritis %>%
  count(Treatment, Improved) %>% # like group_by + summarize
  group_by(Treatment) %>%
  mutate(prop = n / sum(n))

Now, we’re ready to construct our plot. This will be done similarly to a standard grouped bar plot, except we will use geom_col() and map the y aesthetic, as we are working with summary data.

ggplot(arthritis_prop, aes(Treatment, prop, fill = Improved)) +
  geom_col(color = "black", position = "dodge") +
  labs(fill = "Improvement", y = "Proportion")
Grouped, proportional bar chart of the arthritis data.

Figure 3: Grouped, proportional bar chart of the arthritis data.

Now, it becomes quite easy to compare the proportion of people who showed some improvement between the two groups.

See if you can also create a proportional stacked bar chart and compare the result with your previous charts.

3 Mosaic Plots

In the lecture, we also encountered mosaic plots, which are proportional bar charts that map not only the height of the bars but also the width of the bars to group sizes. We’ll use the ggmosaic package here.

# install.packages("ggmosaic") # if don't already have the package

library(ggmosaic)

ggmosaic works directly with a ggplot2 object, adding auxiliary functions (geom_*() functions) to extend its functionality.

ggplot(Arthritis) +
  geom_mosaic(aes(x = product(Improved, Treatment), fill = Improved))

The key features to notice about this call are

  1. that the aes() specification must reside inside the call to geom_mosaic() and not in the ggplot() function, and
  2. the call to product(), which acts simply as a list of variables that you want to use in the mosaic plot.

The main benefit of the mosaic plot is that it allows us to see more of the data. With bar charts, separate visualizations were needed to provide information on the overall size of the groups (treatment arms). In contract, the mosaic plot enables us to use just a single plot.3

Try to switch the order of the variables in the product() call. What happens? How has the information you see changed?

4 Waffle Plots

Waffle plots are easy-to-read visualizations that, while quite uncommon in statistical visualizations, are much more prevalent in media outlets such as newspapers. As we mentioned in the lecture, they are at their best when some of the categories are so small that their sizes are hard to estimate in a bar chart or mosaic plot.

# install.packages("waffle") # if you haven't already installed it

The waffle package requires that our data contains counts of the categories, so before we can use it, we need to do a bit of wrangling. More precisely, you need to group the data by Improved and Treatment and then use summarize() together with either the n() function to get either counts or some other expression for proportions. We’ll settle for counts in this example. Before looking at the following code, see if you can do this by yourself. The result should be a 6-row tibble with columns Treatment, Improved and Count (or whatever you want to call it). Save the result to an object called arthritis_summarized.

arthritis_summarized <-
  Arthritis %>%
  group_by(Treatment, Improved) %>%
  summarize(Count = n())

Now we’re ready to start making waffles! The aesthetics you need to know about here are fill and values. Values is what we want to map the counts or proportions that we just created to and fill will connect our inner-most group to fill color. We start with the simplest possible waffle chart, just looking at improvement among the participants in the treatment arm.

library(waffle)

filter(arthritis_summarized, Treatment == "Treated") %>%
  ggplot(aes(fill = Improved, values = Count)) +
  geom_waffle() +
  labs(fill = "Improvement")
A basic waffle chart of the results from the treatment arm.

Figure 4: A basic waffle chart of the results from the treatment arm.

This looks kind of like a waffle chart, but there’s a lot of superfluous information that we’d rather get rid off. Thankfully, waffle provides the helpful theme_enhance_waffle() function for this. Moreover, we probably prefer the symbols (rectangles) to be squares — it’s supposed to look like a waffle after all, right? — which we can accomplish by forcing the coordinate system to be isometric using coord_equal(). Finally, I think it looks more aesthetically pleasing if we swap the black line color for white, increase their widths slightly, and change theme to the minimal variant.

filter(arthritis_summarized, Treatment == "Treated") %>%
  ggplot(aes(fill = Improved, values = Count)) +
  geom_waffle(color = "white", size = 1.5) +
  coord_equal() +
  theme_minimal() +
  theme_enhance_waffle() +
  labs(fill = "Improvement")
A prettier waffle plot.

Figure 5: A prettier waffle plot.

This looks much better, doesn’t it?

To be able to use waffle charts to see both of the treatment arms at once, we need to use something called facets, which we haven’t covered in the course yet. If you already know about facets, however, feel free to modify the plot now by faceting on the type of treatment.

5 Source Code

The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/arthritis.Rmd.


  1. This is not quite as severe a problem for this particular visualization since the inner grouping variable is ordinal. For other problems, however, this can be much more detrimental to the result.↩︎

  2. To be fair, the differences between the group sizes are small here, but this also means that we don’t lose much information when using proportional bars instead.↩︎

  3. However, it is important to note that we’re showing the relative size. If we want to display the actual counts in the outer category, we still need additional plots or must provide this information in text.↩︎