Working with Data in R

class: middle, center, title-slide

.title[
# Working with Data in R
]
.subtitle[
## Data Visualization
]
.author[
### Johan Larsson
]
.author[
### Behnaz Pirzamanbein
]
.institute[
### The Department of Statistics, Lund University
]

---

## Data in R

- R has excellent facilities for handling data
- roughly speaking, two approaches
  * base R
  * tidyverse
- in this course we will use the tidyverse approach

.pull-left[
### Tidyverse

- a collection of packages built around a set of
  principles and design choices. 
- includes the package **ggplot2** that we will use for all our
  visualizations in this course
  
]

.pull-right[
<img src="images/tidyverse-logo.png" width="50%" style="display: block; margin: auto;" />
]

---

class: middle
  
<img src="images/tidyverse-workflow.svg" style="display: block; margin: auto;" />

---

## An Example

```r
library(tidyverse)

# import
read_csv("data/us_cars.csv") %>%
  # wrangle
  mutate(price = price*8.65) %>%
  # visualize
  ggplot(aes(mileage, y = price, color = title_status)) +
  geom_point(alpha = 0.5) +
  labs(x = "mileage (miles)", 
       y = "price (sek)", 
       color = "status") 
```

---

class: center, middle

---

## Importing Data

Data can be stored in a myriad of formats (`.csv`, `.tsv`, `.dat`, `.sav`, etc).
  
--

We will use the **readr** package, in which functions are named
`read_*()`.

```r
data_tsv <- read_tsv("data.tsv") 
data_csv <- read_csv("data.csv")
# and so on
```
  
--
  
Always inspect the result; datasets are not always stored in a clean way.

In this course, however, you will generally be provided data that needs little 
preprocessing.
  
--

Many datasets are available from base R or from packages.

---

### Using RStudio to Import Data

---

## Data Wrangling

Many times, data is not in the format that we need it to be; it needs to be 
**wrangled** with.

Our workhorses: [dplyr](https://dplyr.tidyverse.org/)
and [tidyr](https://tidyr.tidyverse.org/)

.pull-left[
### dplyr

used for data manipulation, such as

* creating new variables,
* modifying existing variables,
* filtering, and
* subsetting data.
]

.pull-right[
### tidyr

used for reshaping data, such as

* pivoting data,
* merging and separating variables, and
* extracting variables.
]

---

## Data Semantics

Most data is made up of a rectangular table of **rows** and **columns**.

The elements (cells) in this table are **values**.

Each value corresponds to an **observation** and a **variable**.

There are usually multiple ways to format the same data:

<br>

.pull-left[
<table>
<caption>Version A</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> person </th>
   <th style="text-align:right;"> control </th>
   <th style="text-align:right;"> treatment </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> James </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Julia </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bixby </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<table>
<caption>Version B</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> intervention </th>
   <th style="text-align:right;"> James </th>
   <th style="text-align:right;"> Julia </th>
   <th style="text-align:right;"> Bixby </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> treatment </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>
]

---

## Tidy Data
.pull-left-60[

In tidy data:
* each observation has its own row,
* each variable its own column, and
* each value its own cell.

Tidy data makes visualization easy.

In our example variables are: **person**, **intervention**, and **result**.
]

.pull-right-40[
<table>
<caption>Tidy treament data</caption>
 <thead>
  <tr>
   <th style="text-align:left;"> person </th>
   <th style="text-align:left;"> intervention </th>
   <th style="text-align:right;"> result </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> James </td>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Julia </td>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bixby </td>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 2 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> James </td>
   <td style="text-align:left;"> treatment </td>
   <td style="text-align:right;"> 5 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Julia </td>
   <td style="text-align:left;"> treatment </td>
   <td style="text-align:right;"> 6 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bixby </td>
   <td style="text-align:left;"> treatment </td>
   <td style="text-align:right;"> 3 </td>
  </tr>
</tbody>
</table>
]

---

## Reshaping Data with tidyr

In `d`, we've stored the treatment data from before.

```r
library(tidyverse) # loads tidyr, dplyr, tibble, etc
d <- tibble(person = c("James", "Julia", "Bixby"), 
            control = c(1, 3, 2), 
            treatment = c(5, 6, 3))
```

To reshape this into **tidy** data, we will use the function
`pivot_longer()` (from **tidyr**).

.pull-left[

```r
d_tidy <- pivot_longer(
  data = d, 
  cols = c("control", 
           "treatment"), 
  names_to = "intervention",
  values_to = "result"
)
```
]

.pull-right[

```
## # A tibble: 6 × 3
##   person intervention result
##   <chr>  <chr>         <dbl>
## 1 James  control           1
## 2 James  treatment         5
## 3 Julia  control           3
## 4 Julia  treatment         6
## 5 Bixby  control           2
## 6 Bixby  treatment         3
```
]

---

## Manipulating Data with dplyr

* `filter()` to filter rows (observations):

```r
filter(d_tidy, 
       person != "Bixby")
```

```
## # A tibble: 4 × 3
##   person intervention result
##   <chr>  <chr>         <dbl>
## 1 James  control           1
## 2 James  treatment         5
## 3 Julia  control           3
## 4 Julia  treatment         6
```

---

## Manipulating Data with dplyr

* `select()` to select columns (variables):

```r
select(d_tidy, person, result)
```

```
## # A tibble: 6 × 2
##   person result
##   <chr>   <dbl>
## 1 James       1
## 2 James       5
## 3 Julia       3
## 4 Julia       6
## 5 Bixby       2
## 6 Bixby       3
```

---

## Manipulating Data with dplyr

* `mutate()` to modify or add new variables:

```r
mutate(d_tidy, rel_result = result/max(result))
```

```
## # A tibble: 6 × 4
##   person intervention result rel_result
##   <chr>  <chr>         <dbl>      <dbl>
## 1 James  control           1      0.167
## 2 James  treatment         5      0.833
## 3 Julia  control           3      0.5  
## 4 Julia  treatment         6      1    
## 5 Bixby  control           2      0.333
## 6 Bixby  treatment         3      0.5
```

---

## Manipulating Data with dplyr

* `group_by()` and `summarize()` to group and summarize data:

```r
group_by(d_tidy, intervention) %>%
  summarize(mean_result = mean(result))
```

```
## # A tibble: 2 × 2
##   intervention mean_result
##   <chr>              <dbl>
## 1 control             2   
## 2 treatment           4.67
```

---

## The Pipe Operator

The `%>%` (pipe) operator is very useful when working with data.

It takes the left-hand (or line above) and forwards it into the next function

```r
d_tidy %>% 
  filter(person == "James") %>%
  select(intervention, result)
```

```
## # A tibble: 2 × 2
##   intervention result
##   <chr>         <dbl>
## 1 control           1
## 2 treatment         5
```

```r
select(filter(d_tidy, person == "James"), intervention, result)
```

```
## # A tibble: 2 × 2
##   intervention result
##   <chr>         <dbl>
## 1 control           1
## 2 treatment         5
```

---

## Missing Data

Normally, missing data should be approached with diligence (e.g. imputation), but in 
this course we will usually simply disregard it (`drop_na()`).

Missing data is denoted by `NA` in R.

Functions applied to missing data often return `NA` unless `na.rm = TRUE`
is specified.

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> name </th>
   <th style="text-align:right;"> sleep_rem </th>
   <th style="text-align:right;"> sleep_cycle </th>
   <th style="text-align:right;"> brainwt </th>
   <th style="text-align:right;"> bodywt </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Cheetah </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 50.000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Owl monkey </td>
   <td style="text-align:right;"> 1.8 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 0.016 </td>
   <td style="text-align:right;"> 0.480 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Mountain beaver </td>
   <td style="text-align:right;"> 2.4 </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> NA </td>
   <td style="text-align:right;"> 1.350 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Greater short-tailed shrew </td>
   <td style="text-align:right;"> 2.3 </td>
   <td style="text-align:right;"> 0.133 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 0.019 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cow </td>
   <td style="text-align:right;"> 0.7 </td>
   <td style="text-align:right;"> 0.667 </td>
   <td style="text-align:right;"> 0.423 </td>
   <td style="text-align:right;"> 600.000 </td>
  </tr>
</tbody>
</table>

---

## Learning More

In this course, we will only scrape the surface.

Our scope is limited to **making data ready for visualizations**.

If you want to learn more about data wrangling and tidy data, we recommend
the chapter [Wrangle](https://r4ds.had.co.nz/wrangle-intro.html) in
the free e-book *R for Data Science*.