In this worked example, we will be working with data from a treatment trial for
Arthritis. The data comes from the vcd
package. To begin, let’s load the
package and take a glimpse()
at the data.
# install.packages("vcd") # if you haven't installed it yet
library(vcd)
library(tidyverse)
glimpse(Arthritis)
## Rows: 84
## Columns: 5
## $ ID <int> 57, 46, 77, 17, 36, 23, 75, 39, 33, 55, 30, 5, 63, 83, 66, 4…
## $ Treatment <fct> Treated, Treated, Treated, Treated, Treated, Treated, Treate…
## $ Sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male, Male, …
## $ Age <int> 27, 29, 30, 32, 46, 58, 59, 59, 63, 63, 64, 64, 69, 70, 23, …
## $ Improved <ord> Some, None, None, Marked, Marked, Marked, None, Marked, None…
Notice that Improved
is helpfully formatted as an ordinal (ordered) variable
already, which is going to help ggplot to choose the correct color palettes for
us.
Let’s start with a simple bar chart to see how many people were allocated to the two treatment arms.
Figure 1: Treatment allocation for the arthritis study.
Now for something more interesting. Let’s try to look at how well the treatment seems to have worked. We start with a stacked bar chart.
ggplot(Arthritis, aes(Treatment, fill = Improved)) +
geom_bar(color = "black") +
labs(fill = "Improvement") # add a better legend title
Figure 2: Treatment effect in the arthritis study.
It’s immediately clear that the treatment has worked well! The count of people with a improvement is much higher in the treatment arm compared with the placebo arm. But how do the treatments compare in terms of people who’ve made some improvement? This is difficult to say from our approach here. This is due to two reasons:
Let’s try a grouped proportional bar chart instead. To achieve this, we need to do a little preparatory work, since ggplot2 does not offer a direct option for this purpose.
What we need to do is to prepare a data set with the proportions that we would like to plot. In the following lines, we
count()
to count the number of observations in each combination of
Treatment
and Improved
,Treatment
, andmutate()
.arthritis_prop <-
Arthritis %>%
count(Treatment, Improved) %>% # like group_by + summarize
group_by(Treatment) %>%
mutate(prop = n / sum(n))
Now, we’re ready to construct our plot. This will be done similarly to a standard
grouped bar plot, except we will use geom_col()
and map the y
aesthetic, as we
are working with summary data.
ggplot(arthritis_prop, aes(Treatment, prop, fill = Improved)) +
geom_col(color = "black", position = "dodge") +
labs(fill = "Improvement", y = "Proportion")
Figure 3: Grouped, proportional bar chart of the arthritis data.
Now, it becomes quite easy to compare the proportion of people who showed some improvement between the two groups.
See if you can also create a proportional stacked bar chart and compare the result with your previous charts.
In the lecture, we also encountered mosaic plots, which are proportional bar charts that map not only the height of the bars but also the width of the bars to group sizes. We’ll use the ggmosaic package here.
ggmosaic works directly with a ggplot2 object,
adding auxiliary functions (geom_*()
functions) to extend its functionality.
The key features to notice about this call are
aes()
specification must reside inside the call to
geom_mosaic()
and not in the ggplot()
function, andproduct()
, which acts simply as a list of variables that
you want to use in the mosaic plot.The main benefit of the mosaic plot is that it allows us to see more of the data. With bar charts, separate visualizations were needed to provide information on the overall size of the groups (treatment arms). In contract, the mosaic plot enables us to use just a single plot.3
Try to switch the order of the variables in the product()
call. What
happens? How has the information you see changed?
Waffle plots are easy-to-read visualizations that, while quite uncommon in statistical visualizations, are much more prevalent in media outlets such as newspapers. As we mentioned in the lecture, they are at their best when some of the categories are so small that their sizes are hard to estimate in a bar chart or mosaic plot.
The waffle package requires that our data contains counts of the categories, so
before we can use it, we need to do a bit of wrangling. More precisely, you need
to group the data by Improved
and Treatment
and then use summarize()
together with either the n()
function to get either counts or some other
expression for proportions. We’ll settle for counts in this example. Before
looking at the following code, see if you can do this by yourself. The result
should be a 6-row tibble with columns Treatment
, Improved
and Count
(or
whatever you want to call it). Save the result to an object called
arthritis_summarized
.
Now we’re ready to start making waffles! The aesthetics you need to know about
here are fill
and values
. Values is what we want to map the counts or
proportions that we just created to and fill
will connect our inner-most group
to fill color. We start with the simplest possible waffle chart, just looking at
improvement among the participants in the treatment arm.
library(waffle)
filter(arthritis_summarized, Treatment == "Treated") %>%
ggplot(aes(fill = Improved, values = Count)) +
geom_waffle() +
labs(fill = "Improvement")
Figure 4: A basic waffle chart of the results from the treatment arm.
This looks kind of like a waffle chart, but there’s a lot of superfluous
information that we’d rather get rid off. Thankfully, waffle provides the
helpful theme_enhance_waffle()
function for this. Moreover, we probably prefer
the symbols (rectangles) to be squares — it’s supposed to look like a waffle
after all, right? — which we can accomplish by forcing the coordinate system to
be isometric using coord_equal()
. Finally, I think it looks more aesthetically
pleasing if we swap the black line color for white, increase their widths
slightly, and change theme to the minimal variant.
filter(arthritis_summarized, Treatment == "Treated") %>%
ggplot(aes(fill = Improved, values = Count)) +
geom_waffle(color = "white", size = 1.5) +
coord_equal() +
theme_minimal() +
theme_enhance_waffle() +
labs(fill = "Improvement")
Figure 5: A prettier waffle plot.
This looks much better, doesn’t it?
To be able to use waffle charts to see both of the treatment arms at once, we need to use something called facets, which we haven’t covered in the course yet. If you already know about facets, however, feel free to modify the plot now by faceting on the type of treatment.
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/arthritis.Rmd.
This is not quite as severe a problem for this particular visualization since the inner grouping variable is ordinal. For other problems, however, this can be much more detrimental to the result.↩︎
To be fair, the differences between the group sizes are small here, but this also means that we don’t lose much information when using proportional bars instead.↩︎
However, it is important to note that we’re showing the relative size. If we want to display the actual counts in the outer category, we still need additional plots or must provide this information in text.↩︎