1 Probing an Unknown Dataset

Let’s say that we have data from an unknown source, containing two variables, x and y. The data has been uploaded to the course website, and you can download and read it into your workspace with the following command.

library(tidyverse)

xy <- read_csv(
  "https://raw.githubusercontent.com/stat-lu/dataviz/main/data/xy.csv"
)

1.1 Box Plot

Let’s try to explore the univariate distributions of these two variables. We begin with a box plot.

ggplot(xy, aes(x)) +
  geom_boxplot()

Figure 1: A boxplot of the x variable

As you see, our call to ggplot2 is quite simple. We use ggplot() with the data as the first argument and a call to aes() for our aesthetics specification as our second argument. The call to aes() is actually optional and you can provide your mappings directly within the layers (geom functions) if you prefer.

The distribution seems to be reasonably symmetric but with a longer tail to the right and with a single outlier at around 100. The box plot also indicates that the distribution is uni-modal, with a center around 55. But this is the problem with box plots: they can never “discover” anything but a single mode, since box plots are based on statistics only related to the quantiles of the distribution.

1.2 Dot Plot

When studying a distribution, it is usually better to start off with a plot that does less abstraction, such as a simple dot plot. In a dot plot, each observation is indicated by a dot.

When constructing the dot plot, the values that are used are cut in intervals (much like a histogram), and observations that fall within a given interval are stacked on top of one another. The maximum width of these intervals is decided by the binwidth, which by default is 1/30 times the range of the data. The exact behavior then depends on the setting of the method argument, which by default is "dotdensity", which uses an algorithm from a paper by Wilkinson¹. Here, we try to use the "histodot" method instead, which works exactly like a histogram. We set the binwidth to 3.

ggplot(xy, aes(x)) +
  geom_dotplot(method = "histodot", binwidth = 3) +
  guides(y = "none") # do you remember why we did this?

Figure 2: A dotplot of the x variable

Try changing the method argument to the default, and then try playing a bit with the binwidth settings.

It’s hard to summarize this distribution in a few words, but it does not quite seem to be uni-modal. There is also a slight indication that the values may be biased towards certain values here, which for instance is a common problem in questionnaires, where many will “round off” their answers to a whole number when they are uncertain about the answer.

1.3 Histogram

Perhaps this is the case here, so let’s use a histogram instead. Just as with the dot plot, we need to specify the binwidth (or number of bins, bins) here. Given the number of observations here, and the dot plot we previously plotted, a bin width of five might be appropriate to remove the effect of possible round-off bias and elucidate the shape of the distribution better.

ggplot(xy, aes(x)) +
  geom_histogram(binwidth = 5)

Figure 3: A histogram of the x variable

Play around a little bit with the various settings of bins and/or binwidth to see how large the effects of these parameters can be.

1.4 Density Plot

We could also try a density plot here. Density plots are best when you know that your data is measured on a continuous scale and you can make strong assumptions about the distributions of your variables. This is because the choice of parameters for the density kernel has a large effect on the result.

Here, we try a few different choices to highlight this.

ggplot(xy, aes(x)) +
  geom_density() + # kernel = "gaussian", the default
  geom_density(col = "dark orange", bw = 10)

Figure 4: Density plots with various density estimate choices

Try adding another density estimate, this time with a rectangular kernel.

The default density layer uses Silverman’s rule-of-thumb (see ?bw.nrd0) for choosing band width but there are other choices that are probably better, such as bw = "SJ". Try this out too.

1.5 Compactly Displaying Several Univariate Distributions

It can sometimes be useful to display several plots of univariate distributions in a compact design. One way to do so is via the patchwork package, which can be used to patch together plots that you have created separately.

Let’s install the package first.

install.packages("patchwork")

patchwork is simple and intuitive yet powerful. Most of its functionality can be obtained simply by using unary operators (such as +) to patch plots together.

+ and | are used to patch plots horizontally, and
/ to patch them vertically.

In the following plot, we patch a histogram and density plot together.

library(patchwork)

p1 <- ggplot(xy, aes(x)) +
  geom_density(bw = "SJ")

p2 <- ggplot(xy, aes(y)) +
  geom_histogram(bins = 12)

p1 | p2

Figure 5: Density plot of x and histogram of y

Next, we try to stack them horizontally.

p1 / p2

Figure 6: Density plot of x and histogram of y.

And here is an example where we go one step further, using a more complicated layout and adding labels to the plots.

p3 <- ggplot(xy, aes(x, y = 1)) +
  geom_violin()

pw <- (p1 | p2) / p3

pw + plot_annotation(tag_levels = "A")

Simple, right? Play around with patchwork and it’s various options.

It’s particularly useful to know that if you have a shared legend, such as a color legend, between the plots you’re combining, patchwork will automatically merge these guides into one.

For more examples, check out documentation.

2 Rawr!

We’ve now plotted several types of plots of the distributions of this dataset and might well feel comfortable about making a few conclusions about the data. However, stopping here would be a severe mistake. The reason is that we’ve overlooked plotting the joint distribution of these variables, which can be easily done using a scatter plot. We will explore this in much more detail in the next part of the course.

ggplot(xy, aes(x, y)) +
  geom_point()

Figure 7: The terrifying Datasaurus Rex

What is the take-home message here? Well, it’s twofold:

When you visualize distributions of several variables one by one, you are projecting these distributions from a multivariate space into a univariate space. In doing so, you effectively ignore all the ways in which these variables are related to each another. To be fair, visualizing many variables at once isn’t always easy (or even possible). However, in this case, we would have been far better off considering a scatter plot from the start.
Knowing your data is critical. You should never throw yourself into a dataset without at least some understanding of what the variables represent, and ideally, how they were collected. Make it a habit to familiarize yourself with the dataset before you start visualizing it. For most of the data we’ll tackle in this course, documentation is often available, which you can access by looking at the help file for the dataset.

3 Source Code

The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/drex.Rmd.

Remember him? The author of the original grammar of graphics.↩︎

Worked Example: D-Rex

Johan Larsson

Behnaz Pirzamanbein