1 Air Quality in New York

As soon as you encounter two numeric variables, your first instinct should always be visualizing them on a scatter plot. For example, the airquality dataset, which includes air quality measurements from New York between May and September 1973, presents such an opportunity.

The variables in this data set are

  • Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at Roosevelt Island

  • Solar.R: Solar radiation in Langleys in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park

  • Wind: Average wind speed in miles per hour at 0700 and 1000 hours at LaGuardia Airport

  • Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.

  • Day, Month: Self-explanatory.

Here are the first few rows of the data set.

head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Let’s begin by studying how wind affects air pollution, specifically the level of ozone, in the air. Take a minute to consider how you would map the variables to the axes to study this relationship.

Well, our layman intuition tells us that it’s likely that the wind affects air pollution, and not the other way around, so the natural mapping here is to place wind speed on the x axis and ozone on the y axis.

library(tidyverse)

airquality %>%
  ggplot(aes(Wind, Ozone)) +
  geom_point() +
  labs(x = "Wind Speed (mph)", y = "Ozone (ppb)")
Ozone and solar radiation from September to May in New York.

Figure 1: Ozone and solar radiation from September to May in New York.

There is clearly a strong negative relationship between wind speed and ozone — as wind speed increases, air quality seems to improve. That’s hardly surprising if you think about it. But what kind of relationship is this? Describing it as a linear relationship is not accurate.

The best way to describe this relationship is as an inverse relationship. That is, ozone is inversely proportional to wind speed, which is equivalent to saying that the relationship could be described as \[ \text{ozone} = \frac{c}{\text{wind speed}} \] for some choice of a constant \(c\).

Take some time to plot the associations between Ozone, Solar.R, Wind, and Temp and try to identify any patterns you observe.

1.1 Longitudinal Data

These data are longitudinal in nature, so examining how these measures vary over time is a reasonable approach. First, however, we need to combine the day and month variables into a single variable representing time. To do this, we can use the handy make_date() function from the lubridate package.

library(lubridate)

airquality_formatted <-
  airquality %>%
  as_tibble() %>% # tibbles are nicer to work with than data frames, do you know why?
  mutate(Date = make_date("1973", Month, Day))

airquality_formatted
## # A tibble: 153 × 7
##    Ozone Solar.R  Wind  Temp Month   Day Date      
##    <int>   <int> <dbl> <int> <int> <int> <date>    
##  1    41     190   7.4    67     5     1 1973-05-01
##  2    36     118   8      72     5     2 1973-05-02
##  3    12     149  12.6    74     5     3 1973-05-03
##  4    18     313  11.5    62     5     4 1973-05-04
##  5    NA      NA  14.3    56     5     5 1973-05-05
##  6    28      NA  14.9    66     5     6 1973-05-06
##  7    23     299   8.6    65     5     7 1973-05-07
##  8    19      99  13.8    59     5     8 1973-05-08
##  9     8      19  20.1    61     5     9 1973-05-09
## 10    NA     194   8.6    69     5    10 1973-05-10
## # ℹ 143 more rows

Now we’re ready to explore the data. The primary focus of this data set is air quality, so our main interest is in examining ozone levels over time.

Here is our first attempt: we will use the line geom, which is a natural choice for visualizing this type of data.

p1 <-
  airquality_formatted %>%
  ggplot(aes(Date, Ozone)) +
  geom_line() +
  labs(y = "Ozone (ppb)")

p1
Air quality (ozone levels) in 1973

Figure 2: Air quality (ozone levels) in 1973

The result is possibly not quite as nice as you’d expect. Can you figure out why there are multiple holes in the plot?

The reason is that there are missing values in the data set. Normally, missing values wouldn’t necessarily pose a problem for visualization. However, in this case, observations adjacent to missing ones also disappear from the visualization when using the line geom.

A possible solution here is to drop the missing values, which allows the points to connect.

airquality_nona <- drop_na(airquality_formatted, Ozone)

p2 <-
  airquality_nona %>%
  ggplot(aes(Date, Ozone)) +
  geom_line() + 
  labs(y = "Ozone (ppb)")

p2
Air quality (ozone levels) in 1973

Figure 3: Air quality (ozone levels) in 1973

One potential problem with this approach is that it’s hard to tell which date ranges correspond to missing values. To address this, adding a point geom can help make these ranges clear.

p3 <- p2 + geom_point()
p3
Air quality (ozone levels) in 1973.

Figure 4: Air quality (ozone levels) in 1973.

It seems quite clear that the air quality takes a turn for the worse during the summer months.

1.1.1 Air Pollution across the Week

One aspect currently missing from the visualization, which could provide further insight, is the relationship between air pollution and the type of day — specifically, distinguishing between weekdays and weekends. A possible solution for this might be to color the weekend days differently. However, in this case, we’ll use the grid and axis ticks to display this information. We can achieve this by specifying scales with scale_x_date() and inputting "weeks" into the minor_breaks argument.

p4 <- p3 + scale_x_date(minor_breaks = "weeks")
p4
Air quality, in terms of ozone levels, in 1973

Figure 5: Air quality, in terms of ozone levels, in 1973

Unfortunately, it is hard to see a clear pattern here, likely because air pollution varies significantly from day to day due to other factors. A possible way to solve this is to create a new variable, Weekday, in our data set, representing the day of the week. For this purpose, we use the wday() function from lubridate.

airquality_1973 <-
  airquality_nona %>%
  mutate(Weekday = wday(Date))

airquality_1973 %>%
  ggplot(aes(Weekday, Ozone)) +
  geom_point() + 
  labs(y = "Ozone (ppb)")
Air pollution across the week from May to June in 1973

Figure 6: Air pollution across the week from May to June in 1973

Great, but now the point geom is probably not the best candidate for this job, specially without adjustments like jittering and/or opacity. A better alternative could be to use box plots. To effectively use box plots, we need to convert the Weekday variable into a factor variable. Ensuring the correct order of days, we’ll use as.ordered(). This conversion can be done directly in the ggplot() call. However, it’s important to note that this approach requires manually adjusting the axis label, as it will default to "as.ordered(Weekday)" otherwise.

airquality_1973 %>%
  ggplot(aes(as.ordered(Weekday), Ozone)) +
  geom_boxplot() +
  labs(x = "Weekday", y = "Ozone (ppb)")
Air pollution across weekdays from May to June in 1973

Figure 7: Air pollution across weekdays from May to June in 1973

Now, we can actually see a slight improvement in air quality over the weekends, at least in terms of the median values. At this point, it would be interesting to extend our investigation by examining the outliers in this plot to see if they correspond to specific calendar events.

Additionally, take some time to visualize the relationships between time and the other variables in the data set.

2 Source Code

The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/airquality.Rmd.