Chapter 2 Data Types

There are only 10 types of people in the world: those who understand binary, and those who don’t.
– Anonymous

We already talked about different data structures like vectors, matrices and data frames we can use in R. Now it’s time to dig a little deeper and strengthen our understanding of more basic data organization concepts before we come back to data types, high-level data structures and how to work with real-life data in R.

2.1 Fixed and Floating Point Number Representation

Before we look at the actual data types lets get a bit technical and explore how computers deal with numbers. The numbers we normally use are called decimal numbers, which are basically numbers with a base of \(b = 10\). Thus, we can deconstruct any real number \(x \in \mathbb{R}\) into an (infinite) sum of powers to our base \(b\). Let’s look at an example:

\[ 42.125 = x = 4 \cdot 10^{1} + 2 \cdot 10^{0} + 1 \cdot 10^{-1} + 2 \cdot 10^{-2} + 5 \cdot 10^{-3}\]

While numbers in the decimal system can be conveniently processed by humans, computers work differently. In order to store them in memory, we have to perform a base switch with our chosen number \(x\) and rewrite it in binary notation, where \(b = 2\) and therefore only the digits 0 (absence of electrical current) and 1 (presence of electrical current) exist. We know that \(42.125 = 32 + 8 + 2 + 0.125\), which leads to the following:

\[ 42.125 = x = 1 \cdot 2^{5} + 0 \cdot 2^{4} + 1 \cdot 2^{3} + 0 \cdot 2^{2} + 1 \cdot 2^{1} + 0 \cdot 2^{0} + 0 \cdot 2^{-1} + 0 \cdot 2^{-2} + 1 \cdot 2^{-3} \]

When we write the digits down it becomes clear that \(42.125_{b = 10}\) equals \(101010.001_{b =2}\). The representation we just discussed is called fixed point representation. Computers use a very similar version to handle real numbers called floating point representation. This is a combination of storing the algebraic sign, numbers in the binary system and moving the ‘decimal’ point while storing how many digits it has been moved and storing the algebraic sign. After this deconstruction our chosen number \(x\) in the decimal system looks like this:

\[ 42.125 = x = (-1)^{0} \cdot 4.2125 \cdot 10^{1}\]

All we have to do now is convert everything in binary again. For compatibility purposes we also need to come up with a standard, meaning how many binary digits (bits) we want to use for the sign, the actual number, and the exponent. The usual convention is 32 bit for single precision numbers divided into 1 bit for the sign, 8 bits for the exponent and 23 bits for the actual number (called mantissa) the convention for double precision numbers 64 bit with 1 bit for the sign, 11 bits for the exponent and 52 bits for the mantissa.

By only having a fixed amount of bits to form a number it should be obvious that using such a coding comes at a cost. This cost is usually precision. The value domain covers only a specific area and quickly comes to its limits when it comes to irrational numbers. Having a basic idea of how a computer codes decimal numbers it should be clear that there is the need for different data types in R, which means we can now start to look at the elementary data types our software environment provides. If you want a deep dive on pitfalls when it comes to floating point arithmetic you should have a look at the article called What every Computer Scientist should know about floating-point Arithmetic which is freely available.

2.2 Numeric

Atomic vectors, which are the base element of all higher order data structures can be of different types or modes. The mode is a mutually exclusive classification of objects according to their basic structure. The atomic modes are numeric, complex, logical, factors and character. A single object, even if not atomic, can only have a single mode. You can look up the mode using the command mode().

The atomic mode numeric can be divided into two types, one for natural numbers, called integer and one for floating point numbers called double. As of now the reasons why these two cases were separated should be obvious - it is quite easy and memory efficient to store a natural number as we do not need an exponent and the whole set of conventions we discussed earlier. Contrary to this, it is quite complex to store a floating point number, which consumes - dependent on our desired precision - more bits and therefore more memory. Let’s have a quick look at what R has to offer here:

vec <- 1:5
mode(vec)
#R> [1] "numeric"
typeof(vec)
#R> [1] "integer"
sequence <- seq(1 , 5 , by=.5)
typeof(sequence)
#R> [1] "double"
# Mixing integers and doubles will produce a vector of type double
mixed <- c(1:5 , seq(1 , 5 , by=.5))
typeof(mixed)         
#R> [1] "double"

Generally, R stores numbers as double and will use integers only when easily applicable or if you specifically force R to treat a number as an integer using the as.integer() command. A vector of type integer uses less memory than a vector of type double with the same length. The reason for this is the internal representation of numbers we discussed at the beginning of this chapter.

2.3 Complex

R can handle complex numbers. It is doubtful that you will come across them when doing simple data analysis, but likely if you are performing more advanced calculations for e.g. in time series analysis. You can simply create a complex vector by adding an imaginary term to a real number.

num <- sqrt(as.complex(-64))
typeof(num)
#R> [1] "complex"
num <- 1 + 2i
typeof(num)
#R> [1] "complex"
is.complex(num)
#R> [1] TRUE
Re(num)                 # Returns real part of a complex number
#R> [1] 1
Im(num)                 # Returns imaginary part of a complex number
#R> [1] 2

To perform more calculations or obtain more information when dealing with complex numbers you can also use Mod() for the modulus and Arg() for the argument of a given complex number.

2.4 Logical

Logicals can hold the value TRUE or FALSE and are often the output of comparisons. To fulfill your need for typing efficiency when programming R allows you to abbreviate TRUE and FALSE by T respectively F.

x <- TRUE
x
#R> [1] TRUE
1 > 5
#R> [1] FALSE
bool <- c(T, F, T, F, T)
bool
#R> [1]  TRUE FALSE  TRUE FALSE  TRUE
sum(bool)               # ATTENTION: Automatic conversion to 1 (T) and 0 (F)
#R> [1] 3

When using TRUE and FALSE in calculations they are automatically converted into their underlying numerical representation. We are going to talk about this phenomenon called coercion in detail in the respective chapter.

When it comes to logical values R provides a lot of operators to compare and evaluate objects - these are called logical operators.

Code Description Syntax
> Greater than a > b
>= Greater than or equal to a >= b
< Less than a < b
<= Less than or equal to a <= b
== Exactly equal to a == b
!= Not equal to a != b
! Logical negotiation (NOT) !a
| OR (elementwise) a | b
|| OR (stepwise) a || b
& AND (elementwise) a & b
&& AND (stepwise) a && b
%in% Is element in group of elements a %in% c(a,b,c)
xor() Exclusive or (XOR) xor(a,b)

A lot of these logical operators can come in handy when structuring your code and using conditional statements to handle different events or cases in your program. If you don’t know how they are working you can just try them out while giving a and b different TRUE or FALSE values or you can construct a truth table in the following way.

values <- c(NA, FALSE, TRUE)
names(values) <- as.character(values)

outer(values, values, "&")     # Truth table for AND
#R>        <NA> FALSE  TRUE
#R> <NA>     NA FALSE    NA
#R> FALSE FALSE FALSE FALSE
#R> TRUE     NA FALSE  TRUE
outer(values, values, "|")     # Truth table for OR
#R>       <NA> FALSE TRUE
#R> <NA>    NA    NA TRUE
#R> FALSE   NA FALSE TRUE
#R> TRUE  TRUE  TRUE TRUE

Operations using logicals can sometimes be tricky and seem to be wrong or trying to fool you. So pay attention when dealing with them and try to get a sense of how they are evaluated. Here is an example of a tricky evaluation. To understand the following lines you should know what the exclusive or (xor) does and what differentiates it from the normal or. Trying to come up with a line of code to construct a truth table for this in R may help to understand the following:

xor(T, T) == T | T       
#R> [1] TRUE
xor(T,T)
#R> [1] FALSE
T | T
#R> [1] TRUE
FALSE == TRUE                      # This is obviously wrong
#R> [1] FALSE

Although it seems as R is making an error here, it does not. It evaluates our line of code in a strictly logical order and this differs from the appearance of the equation. Here is how R evaluates the parts of our expression:

# Due to the definition of XOR the expression has to be FALSE
xor(T,T) == T|T       
#R> [1] TRUE
xor(T,T)            # 1. Step
#R> [1] FALSE
xor(T,T) == T       # 2. Step
#R> [1] FALSE
xor(T,T) == T|T     # 3. Step  
#R> [1] TRUE
F|T                 # Aggregation (same as 3. step)
#R> [1] TRUE
# If you want R to behave as it seems on first look you have
# to use parentheses:

xor(T,T) == (T|T)   # Evaluates each side individually, then compares
#R> [1] FALSE

The following code illustrates the difference between | and || and shows why using || as operator for the logical OR can come in handy sometimes. Remember that the logical OR returns TRUE if at least one element is TRUE. When the first element is TRUE, the result is independent of the second element.

rm(x)   # Make sure variable x does not exist

TRUE | x      # Element x does not exist, so R returns an Error!
#R> Error in eval(expr, envir, enclos): object 'x' not found
TRUE || x     # Element x does not exist, but the result can be determined 
#R> [1] TRUE
              # without actually touching the object and || ensures that
              # the execution is aborted as soon as possible.

2.5 Character

Character vectors store pieces of text from a single character to whole sentences. You can easily create a character vector by putting text in quotes and R will handle the rest for you.

char <- "TRUE"
typeof(char)
#R> [1] "character"
num <- "3.14"
# num * 3      # This produces an error
as.numeric(num) * 3
#R> [1] 9.42

An often occurring mistake is confusing numbers imported as characters with numerics. The conversion comes in handy when handling data and it is good advice to check datatypes if your calculations with important data look suspicious or won’t work at all.

2.6 Factors

Factors are helpful to represent nominally and ordinally scaled variables.

2.6.1 Unordered Factors

x <- factor(c("yes","yes","no","yes","no"))
x 
#R> [1] yes yes no  yes no 
#R> Levels: no yes
table(x)
#R> x
#R>  no yes 
#R>   2   3
unclass(x) # Levels are automatically generated in alphabetical order!
#R> [1] 2 2 1 2 1
#R> attr(,"levels")
#R> [1] "no"  "yes"
x <- factor(c("yes","yes","no","yes","no"),
            levels = c("yes","no"))
unclass(x) # By setting levels explicitly the internal order can be defined.
#R> [1] 1 1 2 1 2
#R> attr(,"levels")
#R> [1] "yes" "no"
x[1] == x[3] # Nominally scaled variables can be check for equality
#R> [1] FALSE

2.6.2 Ordered Factors

An example for an ordered factor are ordinally scaled variables, which are often found in questionnaires, like the likert scale.

likert <- factor(c(2,3,1,5,2,4,5,2,3,3),
                 levels = c(1,2,3,4,5),
                 labels = c( "strongly disagree", "disagree", "don’t know", "agree", "strongly agree"),
                 ordered = TRUE)

likert
#R>  [1] disagree          don’t know        strongly disagree strongly agree   
#R>  [5] disagree          agree             strongly agree    disagree         
#R>  [9] don’t know        don’t know       
#R> 5 Levels: strongly disagree < disagree < don’t know < ... < strongly agree
likert[1] < likert[2]
#R> [1] TRUE
likert[1] > likert[2]
#R> [1] FALSE
likert[1] + likert[2]
#R> Warning in Ops.ordered(likert[1], likert[2]): '+' is not meaningful for ordered
#R> factors
#R> [1] NA
likert[1] == likert[5]
#R> [1] TRUE

2.7 Missing and Raw Data

Besides the atomic data types, R supports two more types. A missing or undefined value is indicated by NA which stands for non-available. This is in fact not a real data type, R considers this type a logical value. A lot of functions support handling data with NAs in the set and provide different options to use the respective dataset anyway. Watch out to not mistake NA for the reserved term NaN, which indicates erroneous calculations.

num <- c(1 , 2 , NA , 4 , 5)
num + 3
#R> [1]  4  5 NA  7  8
is.na(num)
#R> [1] FALSE FALSE  TRUE FALSE FALSE
sum(is.na(num))               # Number of NAs in dataset
#R> [1] 1
which(is.na(num))             # Location/Index of the NA
#R> [1] 3
mean(num)                     # Mean of the data can not be calculated
#R> [1] NA
mean(num , na.rm=T)           # Omitting the NA for caluclating mean
#R> [1] 3

R also supports RAW vectors meaning data stored in hexadecimal notation. The hexadecimal system is a companion from our well-known decimal system and the binary system we talked about at the beginning of this chapter. Hexadecimal numbers are numbers with base \(b = 16\). This may get handy when reading in files in binary formats. To find additional information about how to work with the type RAW see the corresponding help pages with ?raw.

raw(3)                        # Create empty raw vector of length 3
#R> [1] 00 00 00
x <- as.raw(15)               # Convert number to raw
x
#R> [1] 0f
typeof(x)
#R> [1] "raw"

2.8 Coercion, Attributes, and Class

2.8.1 Coercion

We have already seen that R coerces data types sometimes automatically. That makes it possible to calculate the sum of a vector consisting of logical values, which basically tells you how many elements with value TRUE are present. R has strict rules on how it behaves when coercing data types. If a string is present in a vector everything will be converted to strings. When there are only logicals and numbers in a vector R converts the logicals to their numeric value, so that every TRUE becomes 1 and every FALSE becomes a 0. The main corresponding goal here is to not lose information but conserve it at the cost of memory requirements and compatibility. A graphical representation is shown in the following figure.

Coercion Cycle and Coercion Rules

FIGURE 2.1: Coercion Cycle and Coercion Rules

Data frames and lists can handle multiple data types in the same structure, but there is a good reason not to mix everything up in a huge data frame and this reason is Math. Using only vectors containing a single type of data is a big advantage, as it is easy to perform mathematical operations using matrices and vectors which couldn’t be done with a mixed type data structure and as they are so easy to store in memory these operations are fast.

2.8.2 Attributes and Names

Attributes are R’s interpretation of metadata. They can be attached to any type of object but won’t affect your calculation or other operations and they won’t be displayed when you display the object. That makes it very convenient to store a description or any additional information you want to handover with your object. Of course, you can access the attributes within functions which allows you to perform special tasks if a data has given attributes. Here is how you assign and display attributes.

x <- 1:5
attributes(x)            # A simple numeric vector has no attributes
#R> NULL

As you can see above, the result that is being returned is NULL, which is neither equivalent to 0 nor NA. NULL is short for the NULL-Pointer which is a computer scientists term for an empty set \(\{\varnothing\}\). Every time R returns NULL it just wants to express that there is nothing there.

description  <- list(Description = "Simple numerical vector")
attributes(x) <- description

attributes(x)
#R> $Description
#R> [1] "Simple numerical vector"

Another form of adding descriptive information to your data is by using the names() function. This is useful for datasets and you can imagine this as a header for your data. This is by far the most common way to enrich data in R. With names it is the same as with other attributes. They won’t affect the behavior of the vector meaning you can still perform all calculations

names(x)
#R> NULL
names(x) <- c("one","two","three","four","five")

x
#R>   one   two three  four  five 
#R>     1     2     3     4     5 
#R> attr(,"Description")
#R> [1] "Simple numerical vector"
x <- x^2

x
#R>   one   two three  four  five 
#R>     1     4     9    16    25 
#R> attr(,"Description")
#R> [1] "Simple numerical vector"
names(x) <- NULL         # To delete names set them to NULL
x <- unname(x)           # Alternative way to remove the names

x
#R> [1]  1  4  9 16 25
#R> attr(,"Description")
#R> [1] "Simple numerical vector"

2.8.3 Class

A class in R is a property assigned to an object that describes the type of stored data in some way. Classes are used to control how generic functions like summary() behave. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, its class is usually the same as its mode. If you transform data e.g. from a vector to a matrix, R will automatically change the class attribute.

num <- 1:10
class(num)
#R> [1] "integer"
typeof(num) == class(num)
#R> [1] TRUE
mod <- lm(wage ~ 1 + education)  # Regression from previous chapter
class(mod)
#R> [1] "lm"

When summarizing the objects mod with the linear model inside and the numeric vector num one can easily see that different outputs, dependent on the class of the objects, are produced. The same mechanism can also be used to define own generic functions and dynamically and conveniently control their behavior.

summary(num)
#R>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#R>    1.00    3.25    5.50    5.50    7.75   10.00
summary(mod)
#R> 
#R> Call:
#R> lm(formula = wage ~ 1 + education)
#R> 
#R> Residuals:
#R>     Min      1Q  Median      3Q     Max 
#R> -5.3396 -2.1501 -0.9674  1.1921 16.6085 
#R> 
#R> Coefficients:
#R>             Estimate Std. Error t value Pr(>|t|)    
#R> (Intercept) -0.90485    0.68497  -1.321    0.187    
#R> education    0.54136    0.05325  10.167   <2e-16 ***
#R> ---
#R> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#R> 
#R> Residual standard error: 3.378 on 524 degrees of freedom
#R> Multiple R-squared:  0.1648,    Adjusted R-squared:  0.1632 
#R> F-statistic: 103.4 on 1 and 524 DF,  p-value: < 2.2e-16

Exercises

EX 1

Write down a truth table for the exclusive or.

EX 2

What does double mean in the context of computer science?

EX 3

What is TRUE + TRUE? Why?

EX 4

Create the following data structure.
#R>   x y     z
#R> 1 1 c  TRUE
#R> 2 3 b FALSE