class: middle, center, title-slide .title[ # Statistical Models ] .subtitle[ ## Data Visualization ] .author[ ### Johan Larsson ] .author[ ### Behnaz Pirzamanbein ] .institute[ ### The Department of Statistics, Lund University ] --- ## Statistical Models Most of our work so far: **exploratory** analysis For analysis, we often also want **diagnostic**, **estimates**, and **prediction** plots. -- .pull-left[ For very simple models, `geom_smooth()` may be sufficient. ```r ggplot(cars, aes(speed, dist)) + geom_point() + * geom_smooth(method = "lm") ``` ] .pull-right[ <div class="figure" style="text-align: center"> <img src="statistical-models_files/figure-html/unnamed-chunk-1-1.png" alt="Ordinary least squares regression visualized for speed and stopping distance of cars." width="345.6" /> <p class="caption">Ordinary least squares regression visualized for speed and stopping distance of cars.</p> </div> ] --- ## Broom More complicated models usually call for more control. [broom](https://broom.tidymodels.org/) is a great tool for generating **tidy** output from models. ```r library(broom) fit <- lm(mpg ~ disp, data = mtcars) *augment(fit) ``` <table> <thead> <tr> <th style="text-align:left;"> .rownames </th> <th style="text-align:right;"> mpg </th> <th style="text-align:right;"> disp </th> <th style="text-align:right;"> .fitted </th> <th style="text-align:right;"> .resid </th> <th style="text-align:right;"> .hat </th> <th style="text-align:right;"> .sigma </th> <th style="text-align:right;"> .cooksd </th> <th style="text-align:right;"> .std.resid </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 23.0 </td> <td style="text-align:right;"> -2.00 </td> <td style="text-align:right;"> 0.042 </td> <td style="text-align:right;"> 3.29 </td> <td style="text-align:right;"> 0.009 </td> <td style="text-align:right;"> -0.630 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 160 </td> <td style="text-align:right;"> 23.0 </td> <td style="text-align:right;"> -2.00 </td> <td style="text-align:right;"> 0.042 </td> <td style="text-align:right;"> 3.29 </td> <td style="text-align:right;"> 0.009 </td> <td style="text-align:right;"> -0.630 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 108 </td> <td style="text-align:right;"> 25.1 </td> <td style="text-align:right;"> -2.35 </td> <td style="text-align:right;"> 0.063 </td> <td style="text-align:right;"> 3.28 </td> <td style="text-align:right;"> 0.019 </td> <td style="text-align:right;"> -0.746 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 258 </td> <td style="text-align:right;"> 19.0 </td> <td style="text-align:right;"> 2.43 </td> <td style="text-align:right;"> 0.033 </td> <td style="text-align:right;"> 3.27 </td> <td style="text-align:right;"> 0.010 </td> <td style="text-align:right;"> 0.761 </td> </tr> </tbody> </table> --- ## Case Study: Fertility in Switzerland ```r # dichotomize the Catholic variable swiss2 <- mutate( swiss, Catholic = ifelse(Catholic > 50, "Catholic", "Protestant") ) # fit a linear model fit <- lm(Fertility ~ ., data = swiss2) ``` ```r # get some summary statistics with broom::glance() glance(fit) ``` <table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> r.squared </th> <th style="text-align:right;"> adj.r.squared </th> <th style="text-align:right;"> sigma </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> df </th> <th style="text-align:right;"> logLik </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> BIC </th> <th style="text-align:right;"> deviance </th> <th style="text-align:right;"> df.residual </th> <th style="text-align:right;"> nobs </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.671 </td> <td style="text-align:right;"> 0.631 </td> <td style="text-align:right;"> 7.59 </td> <td style="text-align:right;"> 16.7 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> -159 </td> <td style="text-align:right;"> 331 </td> <td style="text-align:right;"> 344 </td> <td style="text-align:right;"> 2362 </td> <td style="text-align:right;"> 41 </td> <td style="text-align:right;"> 47 </td> </tr> </tbody> </table> --- ### Diagnostic Plots Visualizations are great for diagnosing models! For lm() models, usually only the standard plot() method is needed. .pull-left[ #### Residual Plot - a useful standard diagnostic plot ```r aug <- augment(fit) ggplot(aug, aes(.fitted, .std.resid)) + geom_hline(yintercept = 0, lty = 2) + geom_point() ``` ] .pull-right[ <img src="statistical-models_files/figure-html/unnamed-chunk-7-1.png" width="345.6" style="display: block; margin: auto;" /> ] --- ### Coefficient Plots Models often estimate parameters: linear regression fits return **coefficients**. Use `broom::tidy()` to extract summary statistics from the fit. ```r tidy(fit, conf.int = TRUE) ``` <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Output from broom::tidy().</caption> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 73.573 </td> <td style="text-align:right;"> 11.626 </td> <td style="text-align:right;"> 6.33 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 50.094 </td> <td style="text-align:right;"> 97.052 </td> </tr> <tr> <td style="text-align:left;"> Agriculture </td> <td style="text-align:right;"> -0.157 </td> <td style="text-align:right;"> 0.074 </td> <td style="text-align:right;"> -2.11 </td> <td style="text-align:right;"> 0.041 </td> <td style="text-align:right;"> -0.308 </td> <td style="text-align:right;"> -0.007 </td> </tr> <tr> <td style="text-align:left;"> Examination </td> <td style="text-align:right;"> -0.376 </td> <td style="text-align:right;"> 0.279 </td> <td style="text-align:right;"> -1.34 </td> <td style="text-align:right;"> 0.186 </td> <td style="text-align:right;"> -0.940 </td> <td style="text-align:right;"> 0.188 </td> </tr> <tr> <td style="text-align:left;"> Education </td> <td style="text-align:right;"> -0.799 </td> <td style="text-align:right;"> 0.198 </td> <td style="text-align:right;"> -4.04 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1.200 </td> <td style="text-align:right;"> -0.399 </td> </tr> <tr> <td style="text-align:left;"> CatholicProtestant </td> <td style="text-align:right;"> -6.009 </td> <td style="text-align:right;"> 3.306 </td> <td style="text-align:right;"> -1.82 </td> <td style="text-align:right;"> 0.076 </td> <td style="text-align:right;"> -12.685 </td> <td style="text-align:right;"> 0.667 </td> </tr> <tr> <td style="text-align:left;"> Infant.Mortality </td> <td style="text-align:right;"> 1.164 </td> <td style="text-align:right;"> 0.404 </td> <td style="text-align:right;"> 2.88 </td> <td style="text-align:right;"> 0.006 </td> <td style="text-align:right;"> 0.349 </td> <td style="text-align:right;"> 1.979 </td> </tr> </tbody> </table> --- It is easy to pipe the `broom::tidy()` output into ggplot2 in order to create coefficient plots. ```r tidy(fit, conf.int = TRUE) %>% filter(term != "(Intercept)") %>% ggplot(aes(estimate, term)) + geom_vline(xintercept = 0, lty = 2) + geom_point() + geom_linerange(aes(xmin = conf.low, xmax = conf.high)) ``` <img src="statistical-models_files/figure-html/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> Alternative: use `GGally::ggcoef()` directly on `fit` --- ### Prediction Plots <ol> <li> create a table of values to predict for </ol> `tidyr::expand()` generates combinations of variables in dataset. ```r new_data <- expand( swiss2, Agriculture = median(Agriculture), Examination = median(Examination), * Education = full_seq(Education, 2, 1), * Catholic, Infant.Mortality = median(Infant.Mortality) ) ``` <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> Agriculture </th> <th style="text-align:right;"> Examination </th> <th style="text-align:right;"> Education </th> <th style="text-align:left;"> Catholic </th> <th style="text-align:right;"> Infant.Mortality </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Catholic </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Protestant </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Catholic </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Protestant </td> <td style="text-align:right;"> 20 </td> </tr> </tbody> </table> --- <ol start=2> <li> augment new data with predictions from the model using <code>broom::augment()</code> </ol> ```r aug <- augment(fit, newdata = new_data, se_fit = TRUE) ``` <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <caption style="font-size: initial !important;">Predicted values, standard errors</caption> <thead> <tr> <th style="text-align:right;"> Agriculture </th> <th style="text-align:right;"> Examination </th> <th style="text-align:right;"> Education </th> <th style="text-align:left;"> Catholic </th> <th style="text-align:right;"> Infant.Mortality </th> <th style="text-align:right;"> .fitted </th> <th style="text-align:right;"> .se.fit </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Catholic </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 81.5 </td> <td style="text-align:right;"> 3.48 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Protestant </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 75.5 </td> <td style="text-align:right;"> 1.99 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Catholic </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 79.9 </td> <td style="text-align:right;"> 3.17 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:left;"> Protestant </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 73.9 </td> <td style="text-align:right;"> 1.78 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Catholic </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 78.3 </td> <td style="text-align:right;"> 2.88 </td> </tr> <tr> <td style="text-align:right;"> 54.1 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:left;"> Protestant </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 72.3 </td> <td style="text-align:right;"> 1.65 </td> </tr> </tbody> </table> --- <ol start=3> <li> plot the predictions </ol> ```r ggplot(aug, aes(Education, .fitted, fill = Catholic, color = Catholic)) + geom_ribbon(aes(ymin = .fitted - .se.fit, ymax = .fitted + .se.fit), color = "transparent", alpha = 0.2) + geom_line() ``` <img src="statistical-models_files/figure-html/unnamed-chunk-15-1.png" width="504" style="display: block; margin: auto;" /> --- ### Predictions Across Two Numeric Variables Heatmaps (`geom_tile()`) allow us to vary one more variable. ```r # create some new data to predict on new_data2 <- expand( swiss2, Agriculture = median(Agriculture), Examination = median(Examination), * Education = full_seq(Education, 2, 1), * Catholic, * Infant.Mortality = full_seq(Infant.Mortality, 1, 1) ) # augment with predictions aug2 <- augment(fit, newdata = new_data2) ``` --- ```r ggplot(aug2, aes(Education, Infant.Mortality, fill = .fitted)) + geom_tile() + facet_wrap(vars(Catholic)) + scale_fill_viridis_c() + labs(fill = "Fertility") ``` <div class="figure" style="text-align: center"> <img src="statistical-models_files/figure-html/unnamed-chunk-17-1.png" alt="Prediction heatmaps for fertility across variables education, infant mortality, and religion." width="720" /> <p class="caption">Prediction heatmaps for fertility across variables education, infant mortality, and religion.</p> </div> --- ## Good Practices - Start by tyding your data. - Explore the data with visualizations (and other summaries). - Spend ample time visualizing model diagnostics and predictions. - Use graphics **liberally** when presenting models. - Visualizations are easier to read than tables, but sometimes the exact numbers are also important. -- ## Tidymodels .pull-left-60[ Check out [tidymodels](https://tidymodels.org/) if you want to learn more about statistical modeling inside the tidyverse. ] .pull-right-40[ <img src="images/tidymodels.png" width="70%" style="display: block; margin: auto;" /> ] <!-- --- --> <!-- ## References --> <!-- ```{r, results = "asis", echo=FALSE} --> <!-- PrintBibliography(bib) --> <!-- ``` -->