class: middle, center, title-slide .title[ # Big Data ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Big Data Big Data: many variables **and/or** observations Here, we consider the case of many observations. The problem with many observations is usually the **overlap** of data points. .pull-left[ ```r ggplot(diamonds, aes(carat, price)) + geom_point() ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="big-data_files/figure-html/unnamed-chunk-1-1.png" alt="Carat and price of diamonds." width="345.6" /> <p class="caption">Carat and price of diamonds.</p> </div> ] --- ## Aesthetics Simple, but not always effective, solutions: modify aesthetics (size, opacity) .pull-left[ ```r ggplot(diamonds, aes(carat, price)) + * geom_point(alpha = 0.01) ``` <img src="big-data_files/figure-html/unnamed-chunk-2-1.png" width="345.6" style="display: block; margin: auto;" /> ] .pull-right[ ```r ggplot(diamonds, aes(carat, price)) + * geom_point(cex = 0.01) ``` <img src="big-data_files/figure-html/unnamed-chunk-3-1.png" width="345.6" style="display: block; margin: auto;" /> ] --- ## Facets Sometimes it is enough to split data into facets. ```r ggplot(diamonds, aes(carat, price)) + geom_point(alpha = 0.01) + facet_wrap(~cut) ``` <img src="big-data_files/figure-html/unnamed-chunk-4-1.png" width="720" style="display: block; margin: auto;" /> --- ## Bin-Summarize or Smooth The bin-summarize-smooth framework <a name=cite-wickham2013a></a>([Wickham, 2013](https://vita.had.co.nz/papers/bigvis.pdf)) is a principled methodology to handle large data. - **Binning** is the process of grouping values into *bins*; **summarizing** is the act of computing summary statistics (count, mean) inside each bin. - **Smoothing** smooths out noisy data, revealing hard-to-see patterns. We have already encountered binning for one variable: histograms --- ## Binning Binning is simple, but not very sophisticated, idea. ```r trees <- tsibble::as_tsibble(treering) ggplot(trees, aes(index, value)) + geom_line() ``` <img src="big-data_files/figure-html/unnamed-chunk-5-1.png" width="720" style="display: block; margin: auto;" /> -- ```r ggplot(trees, aes(index, value)) + geom_line(alpha = 0.2) + geom_hline(yintercept = 1, lty = 2) + * stat_summary_bin(fun = mean, geom = "line", binwidth = 100) ``` <img src="big-data_files/figure-html/unnamed-chunk-6-1.png" width="720" style="display: block; margin: auto;" /> --- ### 1. Two-Dimensional Binning A type of two-dimensional histogram, mapping counts to color instead of height and binning into a two-dimensional geometric shape ```r ggplot(diamonds, aes(carat, price)) + * geom_bin2d(bins = 60) + scale_fill_distiller(palette = "Spectral") ``` <img src="big-data_files/figure-html/unnamed-chunk-7-1.png" width="504" style="display: block; margin: auto;" /> --- ### 2. Hexagonal Binning Rectangles may obscure information for some data and it is therefore usually better to use hexagons instead, via `geom_hex()`. ```r ggplot(diamonds, aes(carat, price)) + * geom_hex(bins = 60) + scale_fill_distiller(palette = "Spectral") ``` <img src="big-data_files/figure-html/unnamed-chunk-8-1.png" width="504" style="display: block; margin: auto;" /> --- ## Smoothing Smoothing is a powerful transformation for noisy data. It is equivalent to applying a statistical model to the data. ```r ggplot(trees, aes(index, value)) + geom_line(alpha = 0.3, size = 0.3) + geom_hline(yintercept = 1, lty = 2) + * geom_smooth() ``` <img src="big-data_files/figure-html/unnamed-chunk-9-1.png" width="720" style="display: block; margin: auto;" /> --- ### Smoothers are Sensitive Thumb rule: stick with smoothers that you understand .pull-left[ ```r p + * geom_smooth(method = "lm") ``` <img src="big-data_files/figure-html/unnamed-chunk-11-1.png" width="345.6" style="display: block; margin: auto;" /> ] -- .pull-right[ ```r p + geom_smooth( method = "gam", * formula = y ~ poly(x, 6) ) ``` <img src="big-data_files/figure-html/unnamed-chunk-12-1.png" width="345.6" style="display: block; margin: auto;" /> ] --- ### 2D Density Plots Two-dimensional density plots is a type of two-dimensional **smoothing**. It uses a density kernel, which is sensitive to parameters. ```r ggplot(diamonds, aes(carat, price)) + * geom_density_2d_filled() ``` <img src="big-data_files/figure-html/unnamed-chunk-13-1.png" width="576" style="display: block; margin: auto;" /> --- - For other data, a density plot may work well. ```r ggplot(faithful, aes(eruptions, waiting)) + geom_density2d_filled() + geom_point() ``` <img src="big-data_files/figure-html/unnamed-chunk-14-1.png" width="576" style="display: block; margin: auto;" /> - Hexagonal binning is in general a **safer bet**. --- ## References <a name=bib-wickham2013a></a>[Wickham, H.](#cite-wickham2013a) (2013). _Bin-Summarise-Smooth: A Framework for Visualising Large Data_. URL: [https://vita.had.co.nz/papers/bigvis.pdf](https://vita.had.co.nz/papers/bigvis.pdf) (visited on Sep. 16, 2020).