The 1988 Chilean national plebiscite ended the reign of Chile’s then president (and former dictator) August Pinochet. Those in opposition to extend Pinochet’s rule for another eight years won with 56% of the vote. In this worked example we’ll try to probe the results of this plebiscite (referendum) by looking at a survey of voting intentions conducted a couple of months prior to the election.
The dataset is available in the carData package. Start by installing it (if you haven’t already).
# install the carData package
install.packages("carData")
Then load the dataset directly into your workspace by calling the data()
function as shown next.
data("Chile", package = "carData")
There are eight variables in this dataset:
region
population
sex
age
education
income
statusquo
vote
Before we start visualizing this dataset, we need to format it properly. The names of the variables are fine, but the factor levels are hard to decipher. The education
variable is also not coded as an ordinal variable although it should be. We save the new dataset in a new object called Chile2
.
library(tidyverse)
Chile2 <- Chile %>%
as_tibble() %>%
mutate(
region = recode(
region,"C" = "Central",
"M" = "Metropolitan Santiago",
"N" = "North",
"S" = "South",
"SA" = "City of Santiago"
),sex = recode(sex, "F" = "Women", "M" = "Men"),
education = recode_factor(
education,"P" = "Primary",
"S" = "Secondary",
"PS" = "Post-Secondary",
.ordered = TRUE
),vote = recode(
vote,"A" = "Abstain",
"N" = "No",
"U" = "Undecided",
"Y" = "Yes"
)
)
Chile2
## # A tibble: 2,700 x 8
## region population sex age education income statusquo vote
## <fct> <int> <fct> <int> <ord> <int> <dbl> <fct>
## 1 North 175000 Men 65 Primary 35000 1.01 Yes
## 2 North 175000 Men 29 Post-Secondary 7500 -1.30 No
## 3 North 175000 Women 38 Primary 15000 1.23 Yes
## 4 North 175000 Women 49 Primary 35000 -1.03 No
## 5 North 175000 Women 23 Secondary 35000 -1.10 No
## 6 North 175000 Women 28 Primary 7500 -1.05 No
## 7 North 175000 Men 26 Post-Secondary 35000 -0.786 No
## 8 North 175000 Women 24 Secondary 15000 -1.11 No
## 9 North 175000 Women 41 Primary 15000 -1.01 Undecided
## 10 North 175000 Men 41 Primary 15000 -1.30 No
## # … with 2,690 more rows
That looks much better.
Now let’s start by simply checking what the people in this survey actually intended to vote.
ggplot(Chile2, aes(vote)) +
geom_bar()
As we can see the referendum seem to be engaging much of the population under study. The no voters lead by a slim margin here, but there are many who are yet undecided.
Let’s try to see if we can flesh out these differences more in detail. We begin by studying age and voting intentions. Our first attempt is a proportional stacked density plot here, to focus on the proportions of voting intentions within each age group.
But before we do so, we will switch to a better default color palette for categorical (discrete) data, since the default palette in ggplot2 is really not quite ideal. A good choice could for instance be the “Accent” palette from RColorBrewer, but to spice it up we’ll go with a color palette inspired from The Darjeeling Limited by Wes Anderson that is available via the wesanderson package.1
# install.packages("wesanderson")
library(wesanderson)
function(...) {
darjeeling_pal <-scale_fill_manual(...,
values = wes_palette("Darjeeling2"),
na.value = "grey")
}
# this option replaces the default palette for all future ggplots
options(ggplot2.discrete.fill = darjeeling_pal)
ggplot(Chile2, aes(age, fill = vote)) +
geom_density(position = "fill", col = "white")
Voting intentions and age.
This is probably quite fine as a comparison between the different voting categories, but does make it very hard to see the actual age distribution within each category. An alternative could be to use the ggridges package, which lets us plot shifted density plots.
# install.packages("ggridges")
library(ggridges)
ggplot(Chile2, aes(age, vote, fill = vote)) +
geom_density_ridges2() +
theme(legend.position = "none") # we don't need the legend here
Distributions of age and voting intentions.
All of the distributions are to some extent right-skewed2 The problem with this plot, however, is that it does not really answer the question that we’re interested in, which is to try to sift out to which degree various voter groups voted for the different options.
Taken together, it’s clear that those who are older are more prone to vote yes for another term form Pinochet, while no voters as well as people who intend to abstain are more frequent among the younger voters. One thing that is devious about these visualizations is that the missing data seems more prevalent from the plot types than is in fact the case. In the first plot, this happens because the NA
values are relatively more frequent among the older ages and this part of the chart is enlarged in the proportional version of the chart due to fact that this age category is in minority. In the second, ggridges version, this happens simply because each geom shows the distribution within each category and all sum up to the same area.
We now continue to build our plot by introducing information on sex as well. The best way to do this is to use facets. When you are faceting on a single variable, you should mostly use facet_wrap()
since it works better than facet_grid()
when there are many levels.
ggplot(Chile2, aes(age, fill = vote)) +
geom_density(position = "fill", col = "white") +
geom_rug(alpha = 0.05) + # display marginal distribution
facet_wrap("sex")
Voting intentions across the ages and sexes.
Looking at this plot, it seems like there are a few differences between men and women, for instance that more women then men seem poised to vote yes among the younger population.
Note that we’ve also included a rug geom in our plot (via geom_rug()
), which displays the marginal distribution for sex. Rugs (or other geoms that display marginal distributions) are a good idea when you use proportional density plots since it indicates for the reader (and you) when they (or you) should be skeptical about the accuracy of the results.
We can take this even one step further, by involving our variables on education, income, or population. Let’s see if education matters.
%>%
Chile2 drop_na(education) %>% # very small groups
ggplot(aes(age, fill = vote)) +
geom_density(position = "fill", col = "white") +
geom_rug(alpha = 0.05) +
facet_grid(sex ~ education)
Voting intentions across age, sex, and educational attainment.
Education certainly seems to matter for voting intentions. You can for instance spot this among the youngest voters, where those with a secondary level of education seems less prone to vote yes and more uncertain too. Interestingly, the addition of education also seems too exacerbate the differences between the sexes in the younger population.
One thing we really need to be very careful about is that we now have very little data in some regions of the plot, notably the older age groups that have post-secondary education (especially the women), which seem to be very few by considering the rug plot. For an even more clear demonstration of this, see the following plot where we’ve switched to a stacked non-proportional density plot.
%>%
Chile2 drop_na(education) %>% # very small groups
ggplot(aes(age, fill = vote)) +
geom_density(position = "stack", col = "white") +
facet_grid(sex ~ education)
Voting intentions across age, sex, and educational attainment. A standard stacked density plot.
This means that we should think again about our design. One possibility may be to cut age into age groups before plotting.
The survey also included a question on satisfaction regarding the current state of affairs, which is recorded in the statusquo
variable. Let’s take a look at this variable across age and region with a simple scatter plot.
%>%
Chile2 ggplot(aes(age, statusquo)) +
geom_point(alpha = 0.5) +
facet_wrap("region") +
ylab("Support for the Status Quo")
A simple (but unsuccessful) attempt to plot the satisfaction with the status quo in Chile against age and region.
The results are not attractive. There clearly appears to be some kind of relationship between age, location, and support for the status quo, but this plot fails to make those relationships clear, despite trying to mitigate some of the issues of overlapping with opacity.
Our next attempt is to try a hexagonal binning plot instead.
%>%
Chile2 ggplot(aes(age, statusquo)) +
geom_hex(bins = 10) +
facet_wrap("region") +
ylab("Support for the Status Quo")
A hexagonal binning plot for the Chile survey data.
This certainly looks better. We now see that there appears to be a difference between the various regions and age groups. Support for the current regime seems to be weak in the City of Santiago, particularly for the younger age group.
One issue with using hexagon plots like this is that they put equal emphasis on both of the variables mapped to the x and y axes. When we are primarily interested in one of the variables—the support for the status quo here—this does not communicate this message effectively. To highlight the distribution of the status quo support, instead, it might therefore be more appropriate to use a different type of design.
To do so, we now go back to the ridged density plots. We cut the age variable into categories before plotting. To top it off, we also try a more minimalist theme (theme_minimal()
). Finally, please make note of the labeller
argument in facet_grid()
, which is important to remember whenever you have factor levels that are not self-explanatory (as here).
%>%
Chile2 drop_na(age) %>%
mutate(`Age group` = cut_interval(age, 4)) %>%
ggplot(aes(statusquo, region, fill = region)) +
facet_wrap(~`Age group`, nrow = 4, labeller = label_both) +
geom_density_ridges() +
theme_minimal() +
theme(legend.position = "none") +
labs(y = "Region",
x = "Satisfaction with the Status Quo")
Gaussian density plots of satisfaction with the status quo in Chile in 1988 separated on age and region.
The plot is now a piece of work that we should be happy with. Yes, we did lose some information when we cut the age variable into categories, but the final plot does a much better job of highlighting the goal of the visualization, which is to show satisfaction with the current government in Chile ahead of the upcoming plebiscite.
There remains some possible issues here. One is the use of density estimates here, which may not be appropriate for all of these categories, particularly not the Metropolitan Santiago region since it is small. An alternative could be to switch to histograms instead.3
The source code for this document is available at https://github.com/stat-lu/STAE04/blob/master/worked-example-chile-plebiscite.Rmd.
Feel free to replace the palette with your favorite Wes Anderson movie palette. How about Moonrise Kingdom?↩︎
The nomenclature for skewness is confusing. When a distribution is said to be skewed to the right, it means that the tail is on the right side, even though it would probably make more sense to refer to the center of the distribution (and say that it was then left-skewed), but we need to stick to convention here. For more information, go to https://en.wikipedia.org/wiki/Skewness.↩︎
See https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html for a guide on how to use histograms instead of density plots.↩︎