Worked Example: Hello, World!

Overview

Let’s get started working with R! In this worked example, we will learn how to install and use packages in R as well as perform some basic operations on vectors, matrices, and data frames. First off, if you’re familiar with other programming languages since before, you will notice that R is an interpreted language, which means that you don’t need to compile anything.¹

That means it’s very easy to write and run an R program. You simply put some R code in a text file with the ending .R, and then you can run that program using a terminal or a graphical user interface like R Studio. So, let’s say we have a script called helloworld.R. Then it’s just a matter of calling Rscript² on that scripts, like this.

Rscript helloworld.R

## [1] "Hello, World!"

This is just to showcase that R is really just a program that reads and interprets text files. The .R (and .Rmd) files that you will be working with are just that: text.

Objects in R

When you start an R session, you enter into an environment that stores objects that you access via their names. You can populate this environment with objects of your own using the <- operator by placing the name (or variable) you want to give your object on the left-hand side, and the object itself on the right-hand side, like this:

# store the number 42 in the_meaning_of_life
the_meaning_of_life <- 42

Then to access this object, you simply have to refer to it as the_meaning_of_life (no quotation marks).

You can (usually) print the contents of any variable in R by simply calling its name in the console.

the_meaning_of_life

## [1] 42

Vectors

Vectors are one-dimensional arrays which store objects. Vectors can be created from functions or directly using, for instance, the c() function. Each vector can only store objects of a single type, which means that you cannot combine the different data types in R, such as character, numeric, and integer. If you still decide to try to, however, you will soon notice that R won’t tell you no. Instead, R simply converts the vector to a single type based on a few defaults.

We talked about this in the lecture, so try to see if you can figure out how the following vectors will be converted (if at all):

c("a", 1, TRUE)
c(1L, 1) # 1L = integer of value 1
c(NA, "a")
c(1, 1/3)

Having numeric values be converted to characters is seldom what you want. The conversion from integers to doubles if often not a big problem, but can lead to improper formatting of things such as axis labels in a plot.

We can manipulate vectors algebraically using standard arithmetic operators.

1:2 + c(3, 9)      # adding two vectors together

## [1]  4 11

1.6 / 2

## [1] 0.8

c(1, 10) * 2       # 2 is multiplied with each element

## [1]  2 20

c(NA, 3.123, pi)^2 # pi is an object that's always available

## [1]       NA 9.753129 9.869604

If you combine two vectors thare are of different length but where the shorter vector’s length is a multiple of the longer vector’s length, then R will recycle the shorter vector.

c(1, 2, 3, 4) * c(1, 2)

## [1] 1 4 3 8

What do you think happens if the shorter vector’s length is not a multiple of the longer vector’s length? We’ll find out a bit later.

Special Values

NA is a special value used to denote missing data. There is also another type of value called NULL, which is exactly what is sounds like: nothing.

We can assign NULL to a variable

a <- NULL

but note that creating a vector with several values, some of them NULL, will give a different result.

a <- c(NULL, "a")
a

## [1] "a"

To access an element in a vector, we can use the [ operator:

x <- c("a", "b", "c", "e")
x[1]

## [1] "a"

And to access several elements at once, you can use a vector of indices.

x[c(1, 4)]

## [1] "a" "e"

Accessing a contiguous sub-vector is often desirable, in which case the : operator is a useful tool. It creates a vector of indices.

1:4

## [1] 1 2 3 4

We can use this to access a range of elements from x.

x[1:3]

## [1] "a" "b" "c"

It is also possible to modify elements in vectors in the same fashion.

x[1:2] <- c("foo", "bar")
x

## [1] "foo" "bar" "c"   "e"

Similarly, we can remove certain elements by negative indexing in the following way.

x[-1]

## [1] "bar" "c"   "e"

x[-(1:2)]

## [1] "c" "e"

Notice the use of parentheses in the second example. What do you think x[-1:2] would give us?

Matrices

Matrices are two-dimensional arrays. Just like vectors, they can only contain a single type of values.

m <- rbind(c(1, 2), 
           c(3, NA))
m

##      [,1] [,2]
## [1,]    1    2
## [2,]    3   NA

We can access separate rows and columns by separating row and column indices with a comma.

m[1, 2]

## [1] 2

To select an entire column, simply omit the first argument (or vice verse to select an entire row).

m[1, ]

## [1] 1 2

m[, 2]

## [1]  2 NA

Data Frames

Data frames look a lot like matrices except they allow each column to have a different type. Most of the data that you will encounter in this course will be stored as data frames (or as tibbles, but more on that later).

The following code creates a simple data.frame, using the data.frame() function.

d <- data.frame(a = 1:3,
                b = letters[1:3])
d

##   a b
## 1 1 a
## 2 2 b
## 3 3 c

In addition to sub-setting columns using the matrix notation (i.e. d[, 2] for the second row), we can subset columns using $ as follows.

d$a

## [1] 1 2 3

d$b

## [1] "a" "b" "c"

Packages

Installing Packages

One of R’s strengths is the wide-ranging package ecosystem it sports. One package that we will use a lot during this course is ggplot2. If you don’t yet have that package installed, we will now do so by calling the install.packages() function, using ggplot2 as the first argument. Simply call the code below.

install.packages("ggplot2")

When you call this line, R will also install all of the dependencies that come with ggplot2, which includes other packages. If this is the first time you run R, you might also be prompted to choose a CRAN mirror. A good default choice is "https://cloud.r-project.org".

Using Packages

To use a package in R, you can call the library(<package-name>) function, after which all of that packages functions—and often also its datasets—will be available to you directly in your working environment.

Load the ggplot2 package we just installed by calling the following.

library(ggplot2) # library("ggplot2") would work as well!

Sometimes we will also use <package>::function() to call a function from a specific package. In this case, you don’t need to call library first. This is particularly useful if you load two packages that export two functions with the same names.

In the following line, we create a plot in ggplot2 with this syntax, but bear in mind that this is just for show since we’re already loaded the package using library().

ggplot2::ggplot(mpg, ggplot2::aes(cty, hwy)) +
  ggplot2::geom_point()

Functions

Function Basics

As we said in the lecture, you will use functions a lot throughout this course (although you won’t write your own). Therefore, you really need to understand how to call functions.

Here is a simple function:

f1 <- function(a, b) {
  (a + b) ^ 2
}

Its arguments are a and b and it returns the squared sum of a and b. When we call it we must supply both arguments or R will throw an error.

f1(2) # error!

f1(2, 3)

## [1] 25

You can choose to specify arguments by location, as we did above, or by naming each argument as in the following call.

f1(a = 2, b = 3)

## [1] 25

When you use functions in R, however, you will find that you seldom need to supply all the arguments of a function, which is due to them having default values.

The following function, f2(), have defaults for both of its arguments.

f2 <- function(a = 3, b = 0.5) {
  sqrt(a) + 3 * log(b)
}

We can call it without supplying any argument.

f2()

## [1] -0.3473907

We could also supply only a, by simply using one argument.

f2(0) # error

To only supply b, you have to name the argument.

f2(b = 1)

## [1] 1.732051

Vectorization

Many functions in R are vectorized, which means that you can apply the function across an entire vector (or column of a matrix) without any extra work. All of the arithmetics functions you saw previously, as well as log(), sqrt(), and many more, share this feature.

For instance, say that you have the following data frame,

d <- data.frame(
  name = c("Wahijd", "Mary", "Eyvind"),
  diet = c("vegetarian", "meat-eater", "vegetarian"),
  age = c(34, 89, 50)
)
d

##     name       diet age
## 1 Wahijd vegetarian  34
## 2   Mary meat-eater  89
## 3 Eyvind vegetarian  50

and want to convert the age of the three persons to days instead. We can then simply modify the entire vector in place by calling

d$age <- d$age * 365

Errors and Warnings

Errors

We have already shown a few lines of code that when run produces errors. Errors notify you that something has gone wrong, and this always happens when calling a function (remember that everything you do in R is the result of a function call; even 2 + 3 calls a function under the hood).

With every error, you will receive a message indicating what has gone wrong (and sometimes why). Error messages can, however, be rather cryptic. The best advice, if you don’t understand what’s going on, is to simply search for the error message ad verbatim in your favorite search engine. Chances are that someone else has been there before.

Warnings

You may also occasionally receive warnings. Warnings indicate that something may not have gone as planned but that the problem hasn’t been fatal enough to cause the function to terminate. Be weary of warnings. Warnings can be harmless but they may also indicate that something isn’t working properly, and you will do well to investigate if that is the case.

Here is a line that throws a warning. Before you run the line, can you figure out what might cause it to give off a warning? What do you even expect to happen when running the code?

c(3, 2, 1) * c(1, 2)

Conditional Statements

Basic Conditional Operators

Conditional statements are very useful when it comes to data science, particularly for filtering observations based on the values of another variable. Conditionals are logical statements that check each value to see if it is equal to, greater/lesser than, or greater/lesser or equal to some reference.

Conditional statements return TRUE or FALSE and are typically invoked through the following conditional operators:

operator	meaning
`==`	equal to
`!=`	not equal to
`>`	greater than
`<`	lesser than
`>=`	greater than or equal to
`<=`	lesser than or equal to

Let’s see how it works in practice. Try to predict the output of the following conditional statements before running them.

3 > 2

1/3 == 0.33 # is this surprising? why not?

TRUE != TRUE

FALSE == FALSE

TRUE == 1

FALSE != 0

"b" > "a"

"ab" < "ac"

These statements are also vectorized, which means that we can do the following.

c("a", "b", "a") == "a"

## [1]  TRUE FALSE  TRUE

1:5 != 3

## [1]  TRUE  TRUE FALSE  TRUE  TRUE

The `%in%` Operator

Another, very useful, conditional operator is %in%. It checks if the left-hand side of the operator is inside the right-hand side. Observe the following example.

c("a", "b", "c") %in% "a"

## [1]  TRUE FALSE FALSE

This is very useful for checking which observations on a variable belong to a set of levels. Here we re-use the d dataset from above to check which persons are vegans or vegetarians.

d$diet %in% c("vegan", "vegetarian")

## [1]  TRUE FALSE  TRUE

This will come in handy for us repeatedly throughout the course.

Documentation

The easiest way to understand how a function works, what it returns, what arguments it has and what they mean, as well as what the default argument values are is to read the documentation of the function.

If you’re using R Studio (which you really should) you can simply place your cursor on the function, click, and hit F1 to bring up the documentation for the function in the help pane of R Studio. It is also possible to use help(<function-name>) or ?<function-name>.

help(mean) # or ?mean

Source Code

The source code for this document is available at https://github.com/stat-lu/STAE04/blob/master/worked-example-r.Rmd.

R actually does a bit of just-in-time compiling for you, but you will hardly notice.↩︎
Rscript is a helper application that is installed alongside R to make it easy to run scripts.↩︎