Let’s say that we have data from an unknown source, containing two
variables, x
and y
. The data has been uploaded to the course website,
and you can download and read it into your workspace with the following command.
library(tidyverse)
xy <- read_csv(
"https://raw.githubusercontent.com/stat-lu/dataviz/main/data/xy.csv"
)
Let’s try to explore the univariate distributions of these two variables. We begin with a box plot.
Figure 1: A boxplot of the x variable
As you see, our call to ggplot2 is quite simple. We use ggplot()
with the
data as the first argument and a call to aes()
for our aesthetics
specification as our second argument. The call to aes()
is actually optional
and you can provide your mappings directly within the layers (geom
functions) if you prefer.
The distribution seems to be reasonably symmetric but with a longer tail to the right and with a single outlier at around 100. The box plot also indicates that the distribution is uni-modal, with a center around 55. But this is the problem with box plots: they can never “discover” anything but a single mode, since box plots are based on statistics only related to the quantiles of the distribution.
When studying a distribution, it is usually better to start off with a plot that does less abstraction, such as a simple dot plot. In a dot plot, each observation is indicated by a dot.
When constructing the dot plot, the values that are used are cut in intervals
(much like a histogram), and observations that fall within a given interval are
stacked on top of one another. The maximum width of these intervals is decided by
the binwidth
, which by default is 1/30 times the range of the data. The exact
behavior then depends on the setting of the method
argument, which by default
is "dotdensity"
, which uses an algorithm from a paper by Wilkinson1. Here, we try to use the
"histodot"
method instead, which works exactly like a histogram. We set the binwidth
to 3.
ggplot(xy, aes(x)) +
geom_dotplot(method = "histodot", binwidth = 3) +
guides(y = "none") # do you remember why we did this?
Figure 2: A dotplot of the x variable
Try changing the method
argument to the default, and then try playing a bit
with the binwidth
settings.
It’s hard to summarize this distribution in a few words, but it does not quite seem to be uni-modal. There is also a slight indication that the values may be biased towards certain values here, which for instance is a common problem in questionnaires, where many will “round off” their answers to a whole number when they are uncertain about the answer.
Perhaps this is the case here, so let’s use a histogram instead. Just as with
the dot plot, we need to specify the binwidth
(or number of bins, bins
) here. Given the number of observations here, and the dot plot we previously plotted, a bin
width of five might be appropriate to remove the effect of possible round-off
bias and elucidate the shape of the distribution better.
Figure 3: A histogram of the x variable
Play around a little bit with the various settings of bins
and/or binwidth
to see how large the effects of these parameters can be.
We could also try a density plot here. Density plots are best when you know that your data is measured on a continuous scale and you can make strong assumptions about the distributions of your variables. This is because the choice of parameters for the density kernel has a large effect on the result.
Here, we try a few different choices to highlight this.
ggplot(xy, aes(x)) +
geom_density() + # kernel = "gaussian", the default
geom_density(col = "dark orange", bw = 10)
Figure 4: Density plots with various density estimate choices
Try adding another density estimate, this time with a rectangular kernel.
The default density layer uses Silverman’s rule-of-thumb (see ?bw.nrd0
) for
choosing band width but there are other choices that are probably better, such
as bw = "SJ"
. Try this out too.
It can sometimes be useful to display several plots of univariate distributions in a compact design. One way to do so is via the patchwork package, which can be used to patch together plots that you have created separately.
Let’s install the package first.
patchwork is simple and intuitive yet powerful. Most of its functionality
can be obtained simply by using unary operators (such as +
) to patch plots
together.
+
and |
are used to patch plots horizontally, and/
to patch them vertically.In the following plot, we patch a histogram and density plot together.
library(patchwork)
p1 <- ggplot(xy, aes(x)) +
geom_density(bw = "SJ")
p2 <- ggplot(xy, aes(y)) +
geom_histogram(bins = 12)
p1 | p2
Figure 5: Density plot of x and histogram of y
Next, we try to stack them horizontally.
Figure 6: Density plot of x and histogram of y.
And here is an example where we go one step further, using a more complicated layout and adding labels to the plots.
p3 <- ggplot(xy, aes(x, y = 1)) +
geom_violin()
pw <- (p1 | p2) / p3
pw + plot_annotation(tag_levels = "A")
Simple, right? Play around with patchwork and it’s various options.
It’s particularly useful to know that if you have a shared legend, such as a color legend, between the plots you’re combining, patchwork will automatically merge these guides into one.
For more examples, check out documentation.
We’ve now plotted several types of plots of the distributions of this dataset and might well feel comfortable about making a few conclusions about the data. However, stopping here would be a severe mistake. The reason is that we’ve overlooked plotting the joint distribution of these variables, which can be easily done using a scatter plot. We will explore this in much more detail in the next part of the course.
Figure 7: The terrifying Datasaurus Rex
What is the take-home message here? Well, it’s twofold:
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/drex.Rmd.
Remember him? The author of the original grammar of graphics.↩︎