class: middle, center, title-slide .title[ # One Variable ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Visualizing a Single Variable Best place to start: continuous or discrete one-dimensional data. -- .pull-left[ ### Example The time between eruptions of the Old Faithful geyser. Always start by thinking about what the data represents. ] .pull-right[ <div class="figure" style="text-align: center"> <img src="images/old-faithful.jpg" alt="The Old Faithful Geyser (Albert Bierstadt, Public domain, via Wikimedia Commons)" width="70%" /> <p class="caption">The Old Faithful Geyser (Albert Bierstadt, Public domain, via Wikimedia Commons)</p> </div> ] --- ## Dot Plots Dot plot is good for small data sets. ```r library(tidyverse) ggplot(faithful, aes(waiting)) + * geom_dotplot(binwidth = 1) + guides(y = "none") # default guide is incorrect ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-2-1.png" alt="Duration between eruptions of the Old Faithful geyser." width="612" /> <p class="caption">Duration between eruptions of the Old Faithful geyser.</p> </div> --- ## Histograms Histogram separates data into bins and counts the number of observations. Choice of bin widths is critical: .pull-left[ ```r faithful %>% ggplot(aes(waiting)) + * geom_histogram(binwidth = 1) ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-3-1.png" alt="Too narrow bin width. (This is actually equivalent to a bar plot.)" width="360" /> <p class="caption">Too narrow bin width. (This is actually equivalent to a bar plot.)</p> </div> ] .pull-right[ ```r faithful %>% ggplot(aes(waiting)) + * geom_histogram(binwidth = 3) ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-4-1.png" alt="Reasonable bin width" width="360" /> <p class="caption">Reasonable bin width</p> </div> ] --- ## Box Plots .pull-left[ Box plot - is the most common type: - middle bar: **median** (2nd quartile) - edges of box: 1st and 3rd **quartiles** - whiskers: last observation within 1.5 times the inter-quartile range (IQR) - points: observations outside 1.5 times the IQR - is compact, - but not suitable for data with multiple modes. ] .pull-right[ ```r faithful %>% ggplot(aes(waiting)) + * geom_boxplot() ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-5-1.png" alt="Box plot of the Old Faithful data, which fails completely in accurately describing the distribution." width="360" /> <p class="caption">Box plot of the Old Faithful data, which fails completely in accurately describing the distribution.</p> </div> ] --- ## Density Plots Density plot - is great when you have lots of data and the variable is continuous, - but it is sensitive to settings (type of kernel, bandwidth) .pull-left[ ```r faithful %>% ggplot(aes(waiting)) + * geom_density(bw = 5) ``` <img src="one-variable_files/figure-html/unnamed-chunk-6-1.png" width="360" style="display: block; margin: auto;" /> ] .pull-right[ ```r faithful %>% ggplot(aes(waiting)) + * geom_density(bw = 10) ``` <img src="one-variable_files/figure-html/unnamed-chunk-7-1.png" width="360" style="display: block; margin: auto;" /> ] - often good idea to add a **rug** layer (`geom_rug()`) to density plots --- ### How Does a Density Plot Work? We start with an empty canvas. <img src="one-variable_files/figure-html/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" /> --- ### How Does a Density Plot Work? Then we mark the points with a rug layer (`geom_rug()`). <img src="one-variable_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> --- ### How Does a Density Plot Work? Next, we add a Gaussian (Normal) density kernel for each point. <img src="one-variable_files/figure-html/unnamed-chunk-11-1.png" width="504" style="display: block; margin: auto;" /> --- ### How Does a Density Plot Work? Finally, we sum the kernels together. <img src="one-variable_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> The black line is our final density estimate. --- ## Violin Plots Violin plot is a type of density plot, often as an alternative to box plots when you have lots of (continuous) data. ```r faithful %>% ggplot(aes(waiting, y = 1)) + * geom_violin() ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-13-1.png" alt="geom_violin() does not work with a single variable, so we use the trick y = 1 here." width="468" /> <p class="caption">geom_violin() does not work with a single variable, so we use the trick y = 1 here.</p> </div> --- ## Combining Layers With ggplot, it's easy to combine layers. ```r faithful %>% ggplot(aes(waiting, y = 1)) + geom_violin() + geom_boxplot(width = 0.1, fill = "slategray2") ``` <div class="figure" style="text-align: center"> <img src="one-variable_files/figure-html/unnamed-chunk-14-1.png" alt="A combined box and violin plot" width="468" /> <p class="caption">A combined box and violin plot</p> </div> --- ## Recap -  Few observations? Try a **dot plot**. -   Lots of data on a continuous variable? **Density** and **violin** plots are great! -  The data is uni-modal and you only care about quartiles (e.g. the median) and extreme values? The **box plot** will fit the bill nicely. -  In most other cases, a **histogram** is usually your best best.