class: middle, center, title-slide .title[ # Two Variables ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Associations .pull-left[ How does `\(x\)` relate to `\(y\)`? It is straightforward when both variables are continuous. Scatter plots are the bread and butter of visualizations of two continuous variables. ```r library(tidyverse) ggplot(swiss, aes(Examination, Fertility)) + * geom_point() ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="two-variables_files/figure-html/unnamed-chunk-2-1.png" alt="Fertility and % of draftees receiving highest mark on army examination in French-speaking provinces in Switzerland." width="360" /> <p class="caption">Fertility and % of draftees receiving highest mark on army examination in French-speaking provinces in Switzerland.</p> </div> ] --- ## Patterns Understanding patterns is key to being able to visualize data effectively. <img src="two-variables_files/figure-html/unnamed-chunk-3-1.png" width="576" style="display: block; margin: auto;" /> --- ## Variation .pull-left-40[ A good visualization should describe the variation in the data. ] .pull-right-60[ <img src="two-variables_files/figure-html/unnamed-chunk-4-1.png" width="396" style="display: block; margin: auto;" /> ] -- .pull-left-60[ <img src="two-variables_files/figure-html/unnamed-chunk-5-1.png" width="396" style="display: block; margin: auto;" /> ] .pull-right-40[ Sometimes there is a pattern to the variation. ] --- ## Time Time has a specific order, which makes the line geom a natural choice. Always place time on the horizontal (x) axis. ```r ggplot(economics, aes(date, uempmed)) + * geom_line() + xlab("Time") + ylab("Unemployment duration") ``` <div class="figure" style="text-align: center"> <img src="two-variables_files/figure-html/unnamed-chunk-6-1.png" alt="Median unemployment duration in the US." width="648" /> <p class="caption">Median unemployment duration in the US.</p> </div> --- ## Geoms and Time It is often useful to combine points and lines when measurements are taken at irregular intervals or infrequently. ```r filter(ChickWeight, Chick == 5) %>% ggplot(aes(Time, weight)) + * geom_point() + * geom_line() ``` <div class="figure" style="text-align: center"> <img src="two-variables_files/figure-html/unnamed-chunk-7-1.png" alt="Weight of a chick over time." width="432" /> <p class="caption">Weight of a chick over time.</p> </div> --- ## Categorical Variables Categorical variables (factors) can be used as the second dimension. ```r ggplot(msleep, aes(vore, sleep_total)) + geom_boxplot() ``` <div class="figure" style="text-align: center"> <img src="two-variables_files/figure-html/unnamed-chunk-8-1.png" alt="Boxplot of total sleep duration for different mammals." width="504" /> <p class="caption">Boxplot of total sleep duration for different mammals.</p> </div> --- ## Ordered Factors Pay attention to whether or not the factor is **ordered**. .pull-left[ <img src="two-variables_files/figure-html/unnamed-chunk-9-1.png" width="360" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="two-variables_files/figure-html/unnamed-chunk-10-1.png" width="360" style="display: block; margin: auto;" /> ] --- ## Discrete Data A problem when one (or both) variables are discrete (or categorical) is that they might **overlap**. A simple solution: **opacity** (`alpha`) .pull-left[ ```r ggplot(mpg, aes(cty, hwy)) + geom_point() ``` <img src="two-variables_files/figure-html/unnamed-chunk-11-1.png" width="360" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r ggplot(mpg, aes(cty, hwy)) + * geom_point(alpha = 0.2) ``` <img src="two-variables_files/figure-html/unnamed-chunk-12-1.png" width="360" style="display: block; margin: auto;" /> ] --- ## Scales For some variables, it is more natural to study them on a non-standard scale. Other times it is simply **convenient** (more compact). .pull-left[ ```r ggplot(msleep, aes(brainwt, sleep_total)) + geom_point() + * scale_y_continuous() ``` <img src="two-variables_files/figure-html/unnamed-chunk-13-1.png" width="360" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r ggplot(msleep, aes(brainwt, sleep_total)) + geom_point() + * scale_x_log10() ``` <img src="two-variables_files/figure-html/unnamed-chunk-14-1.png" width="360" style="display: block; margin: auto;" /> ] --- ## Pairs Plots Pairs plot plots the pairwise associations of several variables. It is complicated in ggplot2 (requires **pivoting** and **facets**). The [GGally](https://ggobi.github.io/ggally) package <a name=cite-emerson2013></a>([Emerson, Green, Schloerke, Crowley, Cook, Hofmann, and Wickham, 2013](https://www.tandfonline.com/doi/ref/10.1080/10618600.2012.694762)) presents a simpler approach. --- ```r library(GGally) ggpairs(mpg, columns = c("displ", "drv", "cty", "hwy")) ``` <img src="images/ggally.png" width="85%" style="display: block; margin: auto;" /> --- ## References <a name=bib-emerson2013></a>[Emerson, J. W., W. A. Green, B. Schloerke, et al.](#cite-emerson2013) (2013). "The Generalized Pairs Plot". In: _Journal of Computational and Graphical Statistics_ 22.1, pp. 79-91. ISSN: 1061-8600. DOI: [10.1080/10618600.2012.694762](https://doi.org/10.1080%2F10618600.2012.694762). URL: [https://www.tandfonline.com/doi/ref/10.1080/10618600.2012.694762](https://www.tandfonline.com/doi/ref/10.1080/10618600.2012.694762) (visited on Sep. 19, 2020).