class: middle, center, title-slide .title[ # Categorical Data ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Visualizing Categorical Data Usually, the real aim is to visualize **proportions**. .pull-left[ ### Example: happiness data from the US in 2006 ```r library(tidyverse) data( "happiness", package = "wooldridge" ) # see slide source code for # data wrangling code ``` ] -- .pull-right[ ### Bar Charts simple and readable ```r ggplot(happiness, aes(happy)) + * geom_bar() ``` <img src="categorical-data_files/figure-html/unnamed-chunk-3-1.png" width="345.6" style="display: block; margin: auto;" /> ] --- ## Grouped Bar Charts Grouped bar chart is good - when counts are what matters or - when you have small data sets ```r ggplot(happiness, aes(gender, fill = happy)) + * geom_bar(position = "dodge", col = "black") ``` <img src="categorical-data_files/figure-html/unnamed-chunk-4-1.png" width="648" style="display: block; margin: auto;" /> --- ## Stacked Bar Charts Stacked bar chart is compact way to visualize counts, but seldom preferable over grouped bar chart. ```r ggplot(happiness, aes(gender, fill = happy)) + * geom_bar(col = "black") # position = "stack" is the default ``` <img src="categorical-data_files/figure-html/unnamed-chunk-5-1.png" width="504" style="display: block; margin: auto;" /> --- ## Proportional Grouped Bar Charts Proportional grouped bar chart is - good default choice for bar charts (when you have enough data), - but need to summarize data before plotting! ```r happiness_props <- group_by(happiness, gender, happy) %>% summarize(n = n()) %>% mutate(prop = n / sum(n)) ggplot(happiness_props, aes(gender, prop, fill = happy)) + * geom_col(position = "dodge", col = "black") ``` <img src="categorical-data_files/figure-html/unnamed-chunk-6-1.png" width="504" style="display: block; margin: auto;" /> --- ## Proportional Stacked Bar Charts Proportional stacked bar chart is - compact way to visualize proportions. - slightly more intuitive than grouped proportional bar charts, - but harder to read. ```r ggplot(happiness, aes(gender, fill = happy)) + * geom_bar(position = "fill", col = "black") + labs(y = "proportion") ``` <img src="categorical-data_files/figure-html/unnamed-chunk-7-1.png" width="432" style="display: block; margin: auto;" /> --- ## Mosaic Plots Mosaic plot is a type of stacked bar chart that maps category size to the width of bars. - need to use another package for this: [ggmosaic](https://CRAN.R-project.org/package=ggmosaic) ```r library(ggmosaic) ggplot(happiness) + * geom_mosaic(aes(product(happy, gender), fill = happy)) + guides(fill = "none") # remove redundant ink ``` <img src="categorical-data_files/figure-html/unnamed-chunk-8-1.png" width="432" style="display: block; margin: auto;" /> --- ## What About Pie Charts? Pie chart is OK in a few instances, but avoid as a thumb rule. .pull-left[ <img src="categorical-data_files/figure-html/unnamed-chunk-9-1.png" width="288" style="display: block; margin: auto;" /> ] -- .pull-right[ <img src="categorical-data_files/figure-html/unnamed-chunk-10-1.png" width="345.6" style="display: block; margin: auto;" /> ] --- ## Waffle Plots Waffle plot is suitable when there are large differences between categories. - again need to use a new package: [waffle](https://CRAN.R-project.org/package=waffle) <div class="figure" style="text-align: center"> <img src="categorical-data_files/figure-html/unnamed-chunk-11-1.png" alt="A waffle plot of the happiness data. Every square represents approximately 10 people. The code is quite complicated (see the source code)." width="504" /> <p class="caption">A waffle plot of the happiness data. Every square represents approximately 10 people. The code is quite complicated (see the source code).</p> </div> --- ## Pay Attention to Mappings The choice of mappings is critical when dealing with categorical data! ```r *ggplot(happiness, aes(happy, fill = gender)) + geom_bar(position = "fill", col = "black") + ylab("Proportion") ``` <img src="categorical-data_files/figure-html/unnamed-chunk-12-1.png" width="504" style="display: block; margin: auto;" /> --- ## Ordering - if the variable is ordinal, format it as an ordered factor. - if the variable is **not** ordinal, consider ordering it manually! ```r # no ordering happiness %>% count(region) %>% ggplot(aes(region, n)) + geom_col() + guides(x = guide_axis(n.dodge = 2)) # avoid label overlap ``` <img src="categorical-data_files/figure-html/unnamed-chunk-13-1.png" width="720" style="display: block; margin: auto;" /> --- ## Ordering - if the variable is ordinal, format it as an ordered factor. - if the variable is **not** ordinal, consider ordering it manually! ```r happiness %>% count(region) %>% * mutate(region = fct_reorder(region, n, .desc = TRUE)) %>% ggplot(aes(region, n)) + geom_col() + guides(x = guide_axis(n.dodge = 2)) # avoid label overlap ``` <img src="categorical-data_files/figure-html/unnamed-chunk-14-1.png" width="720" style="display: block; margin: auto;" />