Before we start visualizing this data set, we need to format it properly.
The names of the variables are fine, but the town
variable
should be formatted as an ordered factor and age
and size
should
be formatted as integers.
Let’s start naively, trying out a simple scatter plot of food expenditure versus total expenditure.
ggplot(budget_food, aes(totexp, wfood)) +
geom_point() +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 1: A scatter plot of total expenditure versus expenditure on food as a fraction of total expenditure
Despite the large amount of overlap, we can still glean a few insights. First off, food expenditure as a proportion of total expenditure decreases with total expenditure. This speaks to our intuition. Second, there is a number of observations that report no food expenditure whatsoever. These observations are likely outliers, perhaps people living in elderly care centers or something similar. Let’s remove these observations from the data set.
At this point we could try to combat the overlap by reducing opacity, but another trick that is sometimes overlooked is to instead try a scale transformation. Instead of a linear scale for total expenditure, we could use a logarithmic scale.
ggplot(budget_food_cleaned, aes(totexp, wfood)) +
scale_x_log10() +
geom_point() +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 2: A scatter plot of total expenditure versus expenditure on food with total expenditure on a logarithmic scale
Since we have learned a few new tools, let’s make use of them now. For this purpose, we are going to try a hexbin plot.
ggplot(budget_food_cleaned, aes(totexp, wfood)) +
geom_hex() +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 3: A hexbin plot of total expenditure versus expenditure on food as a fraction of total expenditure
This plot definitely works better, showing the relationship in a clear fashion.
Now that we have managed to create a plot of food and total expenditure, let’s see if we can also find a relationship with the size of the household. Before we look at the complete picture, it is a good idea to first check what the overall relationship between household size and food expenditure is. To do so, we first construct a boxplot.
ggplot(budget_food_cleaned, aes(as.factor(size), wfood)) +
geom_boxplot() +
labs(x = "Household Size", y = "Food Expenditure")
Figure 4: A boxplot of food expenditure as a fraction of total expenditure versus household size
In general, food expenditure tends to be higher with larger households, which is probably what one might expect. However, this trend does not hold for households with only one or two members. Take a minute to figure out if that makes sense to you.
Let’s return to our hexbin plot design, but now introduce household size as facets.
ggplot(budget_food_cleaned, aes(totexp, wfood)) +
geom_hex() +
facet_wrap("size", labeller = labeller(size = label_both)) +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 5: A hexbin plot of total expenditure versus expenditure on food across household sizes
This plot looks promising but is crippled by the fact that some of the groups are small, which creates problems for the hexbin plot. In fact, hexbin plots can be problematic when the number of observations differs across facets. This is because all facets share the same binning scale..
To get around this problem, we are going to lump all of the households of
size 6 or larger into one level, using fct_other()
, a function from the
forcats package. We specify the levels we want to keep using the
keep
argument, and all of the others are lumped together.
budget_food_binned <-
budget_food_cleaned %>%
mutate(
size = as.factor(size),
size = fct_other(size, keep = paste(c(1:5)), other_level = "+6")
)
ggplot(budget_food_binned, aes(totexp, wfood)) +
geom_hex() +
facet_wrap("size", labeller = labeller(size = label_both)) +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 6: A hexbin plot of total expenditure versus expenditure on food across household sizes. The 6+ category represents household sizes of 6 or more people.
This is now starting to look quite interesting.
As an alternative to the hexbin plots, we could instead use a smoother to look at the relationship between total expenditure and food expenditure.
Smoothers work particularly well when you expect there to be some kind of causal relationship between two variables. They also help when you have a smaller (but still sufficiently large) number of observations in the upper and lower ranges of the variable on the x axis.
Let’s try a smoother using a LOESS model, which makes few assumptions about the data and is relatively robust to outliers.
ggplot(budget_food_binned, aes(totexp, wfood)) +
geom_point(alpha = 0.05) +
geom_smooth(method = "loess") +
facet_wrap("size", labeller = labeller(size = label_both)) +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 7: A scatter plot of total expenditure, expenditure on food as a fraction of total expenditure, and household size. A LOESS smoother, including approximate 95% confidence intervals, is used.
The problem with this plot is that the smoothers are clearly
inappropriate for the extreme values of total expenditure. To work around
this problem, we can create a separate data set for use exclusively in the
geom_smooth()
function.
To do so, we create a new data set, first grouping by the
variable we are faceting on, size
, and then filtering the data set to include only observations of totexp
(total expenditure) that are large than
the 0.05 quantile and smaller than the 0.95 quantile.
We then use this data in the data
argument to geom_smooth()
function.
budget_food_smoother_data <-
budget_food_binned %>%
group_by(size) %>%
filter(totexp > quantile(totexp, 0.05) & totexp < quantile(totexp, 0.95))
ggplot(budget_food_binned, aes(totexp, wfood)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "loess", data = budget_food_smoother_data) +
facet_wrap("size", labeller = labeller(size = label_both)) +
labs(x = "Total Expenditure (Pesetas)", y = "Food Expenditure")
Figure 8: A hexbin plot of total expenditure versus expenditure on food
Go ahead and see if you can figure out ways to further improve this plot. Consider using the scale transformation we touch on earlier, and explore how it works together with the smoothers and hex binning.
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/food-for-thought.Rmd .