class: middle, center, title-slide .title[ # Working with Data in R ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Data in R - R has excellent facilities for handling data - roughly speaking, two approaches * base R * tidyverse - in this course we will use the tidyverse approach -- .pull-left[ ### Tidyverse - a collection of packages built around a set of principles and design choices. - includes the package **ggplot2** that we will use for all our visualizations in this course ] .pull-right[ <img src="images/tidyverse-logo.png" width="50%" style="display: block; margin: auto;" /> ] --- class: middle <img src="images/tidyverse-workflow.svg" style="display: block; margin: auto;" /> --- ## An Example ```r library(tidyverse) # import read_csv("data/us_cars.csv") %>% # wrangle mutate(price = price*8.65) %>% # visualize ggplot(aes(mileage, y = price, color = title_status)) + geom_point(alpha = 0.5) + labs(x = "mileage (miles)", y = "price (sek)", color = "status") ``` --- class: center, middle <img src="working-with-data-in-r_files/figure-html/unnamed-chunk-3-1.png" width="648" style="display: block; margin: auto;" /> --- ## Importing Data Data can be stored in a myriad of formats (`.csv`, `.tsv`, `.dat`, `.sav`, etc). -- We will use the **readr** package, in which functions are named `read_*()`. ```r data_tsv <- read_tsv("data.tsv") data_csv <- read_csv("data.csv") # and so on ``` -- Always inspect the result; datasets are not always stored in a clean way. In this course, however, you will generally be provided data that needs little preprocessing. -- Many datasets are available from base R or from packages. -- --- ### Using RStudio to Import Data <img src="videos/rstudio-data-import.gif" style="display: block; margin: auto;" /> --- ## Data Wrangling Many times, data is not in the format that we need it to be; it needs to be **wrangled** with. Our workhorses: [dplyr](https://dplyr.tidyverse.org/) and [tidyr](https://tidyr.tidyverse.org/) -- .pull-left[ ### dplyr used for data manipulation, such as * creating new variables, * modifying existing variables, * filtering, and * subsetting data. ] -- .pull-right[ ### tidyr used for reshaping data, such as * pivoting data, * merging and separating variables, and * extracting variables. ] --- ## Data Semantics Most data is made up of a rectangular table of **rows** and **columns**. The elements (cells) in this table are **values**. Each value corresponds to an **observation** and a **variable**. There are usually multiple ways to format the same data: <br> -- .pull-left[ <table> <caption>Version A</caption> <thead> <tr> <th style="text-align:left;"> person </th> <th style="text-align:right;"> control </th> <th style="text-align:right;"> treatment </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> James </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Julia </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Bixby </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] -- .pull-right[ <table> <caption>Version B</caption> <thead> <tr> <th style="text-align:left;"> intervention </th> <th style="text-align:right;"> James </th> <th style="text-align:right;"> Julia </th> <th style="text-align:right;"> Bixby </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> treatment </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] --- ## Tidy Data .pull-left-60[ In tidy data: * each observation has its own row, * each variable its own column, and * each value its own cell. Tidy data makes visualization easy. In our example variables are: **person**, **intervention**, and **result**. ] .pull-right-40[ <table> <caption>Tidy treament data</caption> <thead> <tr> <th style="text-align:left;"> person </th> <th style="text-align:left;"> intervention </th> <th style="text-align:right;"> result </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> James </td> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Julia </td> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 3 </td> </tr> <tr> <td style="text-align:left;"> Bixby </td> <td style="text-align:left;"> control </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> James </td> <td style="text-align:left;"> treatment </td> <td style="text-align:right;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Julia </td> <td style="text-align:left;"> treatment </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> Bixby </td> <td style="text-align:left;"> treatment </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] <img src="images/tidy-data.png" width="853" style="display: block; margin: auto;" /> --- ## Reshaping Data with tidyr In `d`, we've stored the treatment data from before. ```r library(tidyverse) # loads tidyr, dplyr, tibble, etc d <- tibble(person = c("James", "Julia", "Bixby"), control = c(1, 3, 2), treatment = c(5, 6, 3)) ``` -- To reshape this into **tidy** data, we will use the function `pivot_longer()` (from **tidyr**). .pull-left[ ```r d_tidy <- pivot_longer( data = d, cols = c("control", "treatment"), names_to = "intervention", values_to = "result" ) ``` ] .pull-right[ ``` ## # A tibble: 6 × 3 ## person intervention result ## <chr> <chr> <dbl> ## 1 James control 1 ## 2 James treatment 5 ## 3 Julia control 3 ## 4 Julia treatment 6 ## 5 Bixby control 2 ## 6 Bixby treatment 3 ``` ] --- ## Manipulating Data with dplyr * `filter()` to filter rows (observations): ```r filter(d_tidy, person != "Bixby") ``` ``` ## # A tibble: 4 × 3 ## person intervention result ## <chr> <chr> <dbl> ## 1 James control 1 ## 2 James treatment 5 ## 3 Julia control 3 ## 4 Julia treatment 6 ``` --- ## Manipulating Data with dplyr * `select()` to select columns (variables): ```r select(d_tidy, person, result) ``` ``` ## # A tibble: 6 × 2 ## person result ## <chr> <dbl> ## 1 James 1 ## 2 James 5 ## 3 Julia 3 ## 4 Julia 6 ## 5 Bixby 2 ## 6 Bixby 3 ``` --- ## Manipulating Data with dplyr * `mutate()` to modify or add new variables: ```r mutate(d_tidy, rel_result = result/max(result)) ``` ``` ## # A tibble: 6 × 4 ## person intervention result rel_result ## <chr> <chr> <dbl> <dbl> ## 1 James control 1 0.167 ## 2 James treatment 5 0.833 ## 3 Julia control 3 0.5 ## 4 Julia treatment 6 1 ## 5 Bixby control 2 0.333 ## 6 Bixby treatment 3 0.5 ``` --- ## Manipulating Data with dplyr * `group_by()` and `summarize()` to group and summarize data: ```r group_by(d_tidy, intervention) %>% summarize(mean_result = mean(result)) ``` ``` ## # A tibble: 2 × 2 ## intervention mean_result ## <chr> <dbl> ## 1 control 2 ## 2 treatment 4.67 ``` --- ## The Pipe Operator The `%>%` (pipe) operator is very useful when working with data. It takes the left-hand (or line above) and forwards it into the next function ```r d_tidy %>% filter(person == "James") %>% select(intervention, result) ``` ``` ## # A tibble: 2 × 2 ## intervention result ## <chr> <dbl> ## 1 control 1 ## 2 treatment 5 ``` -- ```r select(filter(d_tidy, person == "James"), intervention, result) ``` ``` ## # A tibble: 2 × 2 ## intervention result ## <chr> <dbl> ## 1 control 1 ## 2 treatment 5 ``` --- ## Missing Data Normally, missing data should be approached with diligence (e.g. imputation), but in this course we will usually simply disregard it (`drop_na()`). Missing data is denoted by `NA` in R. Functions applied to missing data often return `NA` unless `na.rm = TRUE` is specified. <table> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:right;"> sleep_rem </th> <th style="text-align:right;"> sleep_cycle </th> <th style="text-align:right;"> brainwt </th> <th style="text-align:right;"> bodywt </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Cheetah </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 50.000 </td> </tr> <tr> <td style="text-align:left;"> Owl monkey </td> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 0.016 </td> <td style="text-align:right;"> 0.480 </td> </tr> <tr> <td style="text-align:left;"> Mountain beaver </td> <td style="text-align:right;"> 2.4 </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> NA </td> <td style="text-align:right;"> 1.350 </td> </tr> <tr> <td style="text-align:left;"> Greater short-tailed shrew </td> <td style="text-align:right;"> 2.3 </td> <td style="text-align:right;"> 0.133 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.019 </td> </tr> <tr> <td style="text-align:left;"> Cow </td> <td style="text-align:right;"> 0.7 </td> <td style="text-align:right;"> 0.667 </td> <td style="text-align:right;"> 0.423 </td> <td style="text-align:right;"> 600.000 </td> </tr> </tbody> </table> --- ## Learning More In this course, we will only scrape the surface. Our scope is limited to **making data ready for visualizations**. If you want to learn more about data wrangling and tidy data, we recommend the chapter [Wrangle](https://r4ds.had.co.nz/wrangle-intro.html) in the free e-book *R for Data Science*.