Let’s get started working with R! In this worked example, we will learn how to install and use packages in R as well as perform some basic operations on vectors, matrices, and data frames.
When you start an R session, you enter into an environment that stores
objects that you access via their names. You can populate this environment
with objects of your own using the <-
operator by placing
the name (or variable) you want to give your object on the left-hand side, and
the object itself on the right-hand side, like this:
Then to access this object, you simply have to refer to it as
the_meaning_of_life
(no quotation marks).
You can (usually) print the contents of any variable in R by simply calling its name in the console.
## [1] 42
Vectors are one-dimensional arrays which store objects. Vectors can be created
from functions or directly using, for instance, the c()
function. Each vector
can only store objects of a single type, which means that you cannot combine
the different data types in R, such as character, numeric, and
integer. If you still decide to try to, however, you will soon notice that R
won’t tell you no. Instead, R simply converts the vector to a single type
based on a few defaults.
We talked about this in the lecture, so try to see if you can figure out how the following vectors will be converted (if at all):
Having numeric values be converted to characters is seldom what you want. The conversion from integers to doubles if often not a big problem, but can lead to improper formatting of things such as axis labels in a plot.
We can manipulate vectors algebraically using standard arithmetic operators.
## [1] 4 11
## [1] 0.8
## [1] 2 20
## [1] NA 9.753129 9.869604
If you combine two vectors there are of different length but where the shorter vector’s length is a multiple of the longer vector’s length, then R will recycle the shorter vector.
## [1] 1 4 3 8
What do you think happens if the shorter vector’s length is not a multiple of the longer vector’s length? We’ll find out a bit later.
NA
is a special value used to denote missing data. There is also another type
of value called NULL
, which is exactly what is sounds like: nothing.
We can assign NULL
to a variable
but note that creating a vector with several values, some of them NULL
, will
give a different result.
## [1] "a"
To access an element in a vector, we can use the [
operator:
## [1] "a"
And to access several elements at once, you can use a vector of indices.
## [1] "a" "e"
Accessing a contiguous sub-vector is often desirable, in which case the :
operator is a useful tool. It creates a vector of indices.
## [1] 1 2 3 4
We can use this to access a range of elements from x
.
## [1] "a" "b" "c"
It is also possible to modify elements in vectors in the same fashion.
## [1] "foo" "bar" "c" "e"
Similarly, we can remove certain elements by negative indexing in the following way.
## [1] "bar" "c" "e"
## [1] "c" "e"
Notice the use of parentheses in the second example. What do you think
x[-1:2]
would give us?
Matrices are two-dimensional arrays. Just like vectors, they can only contain a single type of values.
## [,1] [,2]
## [1,] 1 2
## [2,] 3 NA
We can access separate rows and columns by separating row and column indices with a comma.
## [1] 2
To select an entire column, simply omit the first argument (or vice verse to select an entire row).
## [1] 1 2
## [1] 2 NA
Data frames look a lot like matrices except they allow each column to have a different type. Most of the data that you will encounter in this course will be stored as data frames (or as tibbles, but more on that later).
The following code creates a simple data.frame, using the data.frame()
function.
## a b
## 1 1 a
## 2 2 b
## 3 3 c
In addition to sub-setting columns using the matrix notation (i.e. d[, 2]
for
the second row), we can subset columns using $
as follows.
## [1] 1 2 3
## [1] "a" "b" "c"
One of R’s strengths is the wide-ranging package ecosystem it sports.
One package that we will use a lot during this course is ggplot2. If you
don’t yet have that package installed, we will now do so by calling the
install.packages()
function, using ggplot2 as the first argument. Simply
call the code below.
When you call this line, R will also install all of the dependencies that
come with ggplot2, which includes other packages. If this is the first time you
run R, you might also be prompted to choose a CRAN mirror. A good default choice
is "https://cloud.r-project.org"
.
To use a package in R, you can call the library(<package-name>)
function,
after which all of that packages functions—and often also its
data sets—will be available to you directly in your working environment.
Load the ggplot2 package we just installed by calling the following.
Sometimes we will also use <package>::function()
to call a
function from a specific package. In this case, you don’t need to call library
first. This is particularly useful if you load two packages that export
two functions with the same names.
In the following line, we create a plot in ggplot2 with this syntax, but
bear in mind that this is just for show since we’re already loaded the package
using library()
.
As we said in the lecture, you will use functions a lot throughout this course (although you won’t write your own). Therefore, you really need to understand how to call functions.
Here is a simple function:
Its arguments are a
and b
and it returns the squared sum of a
and b
.
When we call it we must supply both arguments or R will throw an error.
## [1] 25
You can choose to specify arguments by location, as we did above, or by naming each argument as in the following call.
## [1] 25
When you use functions in R, however, you will find that you seldom need to supply all the arguments of a function, which is due to them having default values.
The following function, f2()
, have defaults for both of its arguments.
We can call it without supplying any argument.
## [1] -0.3473907
We could also supply only a
, by simply using one argument.
To only supply b
, you have to name the argument.
## [1] 1.732051
Many functions in R are vectorized, which means that you can apply the
function across an entire vector (or column of a matrix) without
any extra work. All of the arithmetics functions you saw previously, as well
as log()
, sqrt()
, and many more, share this feature.
For instance, say that you have the following data frame,
d <- data.frame(
name = c("Wahijd", "Mary", "Eyvind"),
diet = c("vegetarian", "meat-eater", "vegetarian"),
age = c(34, 89, 50)
)
d
## name diet age
## 1 Wahijd vegetarian 34
## 2 Mary meat-eater 89
## 3 Eyvind vegetarian 50
and want to convert the age of the three persons to days instead. We can then simply modify the entire vector in place by calling
We have already shown a few lines of code that when run produces errors.
Errors notify you that something has gone wrong, and this always happens when
calling a function (remember that everything you do in R is the result of a
function call; even 2 + 3
calls a function under the hood).
With every error, you will receive a message indicating what has gone wrong (and sometimes why). Error messages can, however, be rather cryptic. The best advice, if you don’t understand what’s going on, is to simply search for the error message ad verbatim in your favorite search engine. Chances are that someone else has been there before.
You may also occasionally receive warnings. Warnings indicate that something may not have gone as planned but that the problem hasn’t been fatal enough to cause the function to terminate. Be weary of warnings. Warnings can be harmless but they may also indicate that something isn’t working properly, and you will do well to investigate if that is the case.
Here is a line that throws a warning. Before you run the line, can you figure out what might cause it to give off a warning? What do you even expect to happen when running the code?
Conditional statements are very useful when it comes to data science, particularly for filtering observations based on the values of another variable. Conditionals are logical statements that check each value to see if it is equal to, greater/lesser than, or greater/lesser or equal to some reference.
Conditional statements return TRUE
or FALSE
and are typically invoked
through the following conditional operators:
operator | meaning |
---|---|
== |
equal to |
!= |
not equal to |
> |
greater than |
< |
lesser than |
>= |
greater than or equal to |
<= |
lesser than or equal to |
Let’s see how it works in practice. Try to predict the output of the following conditional statements before running them.
3 > 2
1/3 == 0.33 # is this surprising? why not?
TRUE != TRUE
FALSE == FALSE
TRUE == 1
FALSE != 0
"b" > "a"
"ab" < "ac"
These statements are also vectorized, which means that we can do the following.
## [1] TRUE FALSE TRUE
## [1] TRUE TRUE FALSE TRUE TRUE
%in%
OperatorAnother, very useful, conditional operator is %in%
. It checks if the left-hand
side of the operator is inside the right-hand side. Observe the following
example.
## [1] TRUE FALSE FALSE
This is very useful for checking which observations on a variable belong to
a set of levels. Here we re-use the d
data set from above to check which
persons are vegans or vegetarians.
## [1] TRUE FALSE TRUE
This will come in handy for us repeatedly throughout the course.
The easiest way to understand how a function works, what it returns, what arguments it has and what they mean, as well as what the default argument values are is to read the documentation of the function.
If you’re using R Studio (which you really should) you can simply place your
cursor on the function, click, and hit F1
to bring up the documentation for
the function in the help pane of R Studio. It is also possible to use
help(<function-name>)
or ?<function-name>
.
The source code for this document is available at https://github.com/stat-lu/dataviz/blob/main/worked-examples/hello-world.Rmd.