Big Data

class: middle, center, title-slide

.title[
# Big Data
]
.subtitle[
## Data Visualization
]
.author[
### Johan Larsson
]
.author[
### Behnaz Pirzamanbein
]
.institute[
### The Department of Statistics, Lund University
]

---

## Big Data

Big Data: many variables **and/or** observations

Here, we consider the case of many observations.

The problem with many observations is usually the **overlap** of data points.

.pull-left[

```r
ggplot(diamonds, 
       aes(carat, price)) +
  geom_point()
```
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="big-data_files/figure-html/unnamed-chunk-1-1.png" alt="Carat and price of diamonds." width="345.6" />
<p class="caption">Carat and price of diamonds.</p>
</div>
]

---

## Aesthetics

Simple, but not always effective, solutions: modify aesthetics (size, opacity)

.pull-left[

```r
ggplot(diamonds, 
       aes(carat, price)) +
* geom_point(alpha = 0.01)
```

<img src="big-data_files/figure-html/unnamed-chunk-2-1.png" width="345.6" style="display: block; margin: auto;" />
]

.pull-right[

```r
ggplot(diamonds, 
       aes(carat, price)) +
* geom_point(cex = 0.01)
```

<img src="big-data_files/figure-html/unnamed-chunk-3-1.png" width="345.6" style="display: block; margin: auto;" />
]

---

## Facets

Sometimes it is enough to split data into facets.

```r
ggplot(diamonds, aes(carat, price)) +
  geom_point(alpha = 0.01) +
  facet_wrap(~cut)
```

---

## Bin-Summarize or Smooth

The bin-summarize-smooth framework <a name=cite-wickham2013a></a>([Wickham, 2013](https://vita.had.co.nz/papers/bigvis.pdf)) is a principled
methodology to handle large data.

- **Binning** is the process of grouping values into *bins*; **summarizing** is the act of computing summary statistics (count, mean) inside each bin.

- **Smoothing** smooths out noisy data, revealing hard-to-see patterns.

We have already encountered binning for one variable: histograms

---

## Binning

Binning is simple, but not very sophisticated, idea.

```r
trees <- tsibble::as_tsibble(treering)
ggplot(trees, aes(index, value)) + geom_line()
```

```r
ggplot(trees, aes(index, value)) + geom_line(alpha = 0.2) +
  geom_hline(yintercept = 1, lty = 2) +
* stat_summary_bin(fun = mean, geom = "line", binwidth = 100)
```

---

### 1. Two-Dimensional Binning

A type of two-dimensional histogram, mapping counts to color instead of height
and binning into a two-dimensional geometric shape

```r
ggplot(diamonds, aes(carat, price)) +
* geom_bin2d(bins = 60) +
  scale_fill_distiller(palette = "Spectral")
```

---

### 2. Hexagonal Binning

Rectangles may obscure information for some data and it is therefore usually better
to use hexagons instead, via `geom_hex()`.

```r
ggplot(diamonds, aes(carat, price)) +
* geom_hex(bins = 60) +
  scale_fill_distiller(palette = "Spectral")
```

---

## Smoothing

Smoothing is a powerful transformation for noisy data.

It is equivalent to applying a statistical model to the data.

```r
ggplot(trees, aes(index, value)) +
  geom_line(alpha = 0.3, size = 0.3) +
  geom_hline(yintercept = 1, lty = 2) +
* geom_smooth()
```

---

### Smoothers are Sensitive

Thumb rule: stick with smoothers that you understand

.pull-left[

```r
p + 
* geom_smooth(method = "lm")
```

<img src="big-data_files/figure-html/unnamed-chunk-11-1.png" width="345.6" style="display: block; margin: auto;" />
]

.pull-right[

```r
p + geom_smooth(
  method = "gam",
* formula = y ~ poly(x, 6)
)
```

<img src="big-data_files/figure-html/unnamed-chunk-12-1.png" width="345.6" style="display: block; margin: auto;" />
]

---

### 2D Density Plots

Two-dimensional density plots is a type of two-dimensional **smoothing**.

It uses a density kernel, which is sensitive to parameters.

```r
ggplot(diamonds, aes(carat, price)) +
* geom_density_2d_filled()
```

---

- For other data, a density plot may work well.

```r
ggplot(faithful, aes(eruptions, waiting)) +
  geom_density2d_filled() +
  geom_point()
```

- Hexagonal binning is in general a **safer bet**.

---

## References

<a name=bib-wickham2013a></a>[Wickham, H.](#cite-wickham2013a) (2013).
_Bin-Summarise-Smooth: A Framework for Visualising Large Data_. URL:
[https://vita.had.co.nz/papers/bigvis.pdf](https://vita.had.co.nz/papers/bigvis.pdf)
(visited on Sep. 16, 2020).