As soon as you encounter two numeric variables, your first instinct should
always be visualizing them on a scatter plot. For example, the airquality
dataset, which includes air quality measurements from New York between May and September 1973, presents such an opportunity.
The variables in this data set are
Ozone
: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island
Solar.R
: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind
: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia
Airport
Temp
: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Day
, Month
: Self-explanatory.
Here are the first few rows of the data set.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Let’s begin by studying how wind affects air pollution, specifically the level of ozone, in the air. Take a minute to consider how you would map the variables to the axes to study this relationship.
Well, our layman intuition tells us that it’s likely that the wind affects air pollution, and not the other way around, so the natural mapping here is to place wind speed on the x axis and ozone on the y axis.
library(tidyverse)
airquality %>%
ggplot(aes(Wind, Ozone)) +
geom_point() +
labs(x = "Wind Speed (mph)", y = "Ozone (ppb)")
Figure 1: Ozone and solar radiation from September to May in New York.
There is clearly a strong negative relationship between wind speed and ozone — as wind speed increases, air quality seems to improve. That’s hardly surprising if you think about it. But what kind of relationship is this? Describing it as a linear relationship is not accurate.
The best way to describe this relationship is as an inverse relationship. That is, ozone is inversely proportional to wind speed, which is equivalent to saying that the relationship could be described as \[ \text{ozone} = \frac{c}{\text{wind speed}} \] for some choice of a constant \(c\).
Take some time to plot the associations between Ozone
, Solar.R
, Wind
,
and Temp
and try to identify any patterns you observe.
These data are longitudinal in nature, so examining how these measures vary over time is a reasonable approach. First, however, we need to
combine the day and month variables into a single variable representing time.
To do this, we can use the handy make_date()
function from the lubridate package.
library(lubridate)
airquality_formatted <-
airquality %>%
as_tibble() %>% # tibbles are nicer to work with than data frames, do you know why?
mutate(Date = make_date("1973", Month, Day))
airquality_formatted
## # A tibble: 153 × 7
## Ozone Solar.R Wind Temp Month Day Date
## <int> <int> <dbl> <int> <int> <int> <date>
## 1 41 190 7.4 67 5 1 1973-05-01
## 2 36 118 8 72 5 2 1973-05-02
## 3 12 149 12.6 74 5 3 1973-05-03
## 4 18 313 11.5 62 5 4 1973-05-04
## 5 NA NA 14.3 56 5 5 1973-05-05
## 6 28 NA 14.9 66 5 6 1973-05-06
## 7 23 299 8.6 65 5 7 1973-05-07
## 8 19 99 13.8 59 5 8 1973-05-08
## 9 8 19 20.1 61 5 9 1973-05-09
## 10 NA 194 8.6 69 5 10 1973-05-10
## # ℹ 143 more rows
Now we’re ready to explore the data. The primary focus of this data set is air quality, so our main interest is in examining ozone levels over time.
Here is our first attempt: we will use the line geom, which is a natural choice for visualizing this type of data.
Figure 2: Air quality (ozone levels) in 1973
The result is possibly not quite as nice as you’d expect. Can you figure out why there are multiple holes in the plot?
The reason is that there are missing values in the data set. Normally, missing values wouldn’t necessarily pose a problem for visualization. However, in this case, observations adjacent to missing ones also disappear from the visualization when using the line geom.
A possible solution here is to drop the missing values, which allows the points to connect.
airquality_nona <- drop_na(airquality_formatted, Ozone)
p2 <-
airquality_nona %>%
ggplot(aes(Date, Ozone)) +
geom_line() +
labs(y = "Ozone (ppb)")
p2
Figure 3: Air quality (ozone levels) in 1973
One potential problem with this approach is that it’s hard to tell which date ranges correspond to missing values. To address this, adding a point geom can help make these ranges clear.
Figure 4: Air quality (ozone levels) in 1973.
It seems quite clear that the air quality takes a turn for the worse during the summer months.
One aspect currently missing from the visualization, which could provide further insight, is the relationship between air pollution and the type of day — specifically, distinguishing between weekdays and weekends. A possible
solution for this might be to color the weekend days differently. However, in this case, we’ll use the grid and axis ticks to display this information. We can achieve this by specifying scales with scale_x_date()
and inputting "weeks"
into the
minor_breaks
argument.
Figure 5: Air quality, in terms of ozone levels, in 1973
Unfortunately, it is hard to see a clear pattern here, likely because
air pollution varies significantly from day to day due to other factors.
A possible way to solve this is to create a new variable, Weekday
, in our
data set, representing the day of the week. For this purpose, we use the wday()
function from lubridate.
airquality_1973 <-
airquality_nona %>%
mutate(Weekday = wday(Date))
airquality_1973 %>%
ggplot(aes(Weekday, Ozone)) +
geom_point() +
labs(y = "Ozone (ppb)")
Figure 6: Air pollution across the week from May to June in 1973
Great, but now the point geom is probably not the best candidate for this job,
specially without adjustments like jittering and/or opacity. A better alternative could be to use box plots. To effectively use box plots, we need to convert the
Weekday
variable into a factor variable. Ensuring the correct order of days, we’ll use as.ordered()
. This conversion can be done directly in the ggplot()
call. However, it’s important to note that this approach requires manually adjusting the axis label, as it will default to "as.ordered(Weekday)"
otherwise.
airquality_1973 %>%
ggplot(aes(as.ordered(Weekday), Ozone)) +
geom_boxplot() +
labs(x = "Weekday", y = "Ozone (ppb)")
Figure 7: Air pollution across weekdays from May to June in 1973
Now, we can actually see a slight improvement in air quality over the weekends, at least in terms of the median values. At this point, it would be interesting to extend our investigation by examining the outliers in this plot to see if they correspond to specific calendar events.
Additionally, take some time to visualize the relationships between time and the other variables in the data set.
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/airquality.Rmd.