Chapter 2 Data Types
There are only 10 types of people in the world: those who understand binary, and those who don’t.
– Anonymous
We already talked about different data structures like vectors, matrices and data frames we can use in R. Now it’s time to dig a little deeper and strengthen our understanding of more basic data organization concepts before we come back to data types, high-level data structures and how to work with real-life data in R.
2.1 Fixed and Floating Point Number Representation
Before we look at the actual data types lets get a bit technical and explore how computers deal with numbers. The numbers we normally use are called decimal numbers, which are basically numbers with a base of \(b = 10\). Thus, we can deconstruct any real number \(x \in \mathbb{R}\) into an (infinite) sum of powers to our base \(b\). Let’s look at an example:
\[ 42.125 = x = 4 \cdot 10^{1} + 2 \cdot 10^{0} + 1 \cdot 10^{-1} + 2 \cdot 10^{-2} + 5 \cdot 10^{-3}\]
While numbers in the decimal system can be conveniently processed by humans, computers work differently. In order to store them in memory, we have to perform a base switch with our chosen number \(x\) and rewrite it in binary notation, where \(b = 2\) and therefore only the digits 0 (absence of electrical current) and 1 (presence of electrical current) exist. We know that \(42.125 = 32 + 8 + 2 + 0.125\), which leads to the following:
\[ 42.125 = x = 1 \cdot 2^{5} + 0 \cdot 2^{4} + 1 \cdot 2^{3} + 0 \cdot 2^{2} + 1 \cdot 2^{1} + 0 \cdot 2^{0} + 0 \cdot 2^{-1} + 0 \cdot 2^{-2} + 1 \cdot 2^{-3} \]
When we write the digits down it becomes clear that \(42.125_{b = 10}\) equals \(101010.001_{b =2}\). The representation we just discussed is called fixed point representation. Computers use a very similar version to handle real numbers called floating point representation. This is a combination of storing the algebraic sign, numbers in the binary system and moving the ‘decimal’ point while storing how many digits it has been moved and storing the algebraic sign. After this deconstruction our chosen number \(x\) in the decimal system looks like this:
\[ 42.125 = x = (-1)^{0} \cdot 4.2125 \cdot 10^{1}\]
All we have to do now is convert everything in binary again. For compatibility purposes we also need to come up with a standard, meaning how many binary digits (bits) we want to use for the sign, the actual number, and the exponent. The usual convention is 32 bit for single precision numbers divided into 1 bit for the sign, 8 bits for the exponent and 23 bits for the actual number (called mantissa) the convention for double precision numbers 64 bit with 1 bit for the sign, 11 bits for the exponent and 52 bits for the mantissa.
By only having a fixed amount of bits to form a number it should be obvious that using such a coding comes at a cost. This cost is usually precision. The value domain covers only a specific area and quickly comes to its limits when it comes to irrational numbers. Having a basic idea of how a computer codes decimal numbers it should be clear that there is the need for different data types in R, which means we can now start to look at the elementary data types our software environment provides. If you want a deep dive on pitfalls when it comes to floating point arithmetic you should have a look at the article called What every Computer Scientist should know about floating-point Arithmetic which is freely available.
2.2 Numeric
Atomic vectors, which are the base element of all higher order data structures can be of different types or modes. The mode is a mutually exclusive classification of objects according to their basic structure. The atomic modes are numeric, complex, logical, factors and character. A single object, even if not atomic, can only have a single mode. You can look up the mode using the command mode()
.
The atomic mode numeric
can be divided into two types, one for natural numbers, called integer
and one for floating point numbers called double
. As of now the reasons why these two cases were separated should be obvious - it is quite easy and memory efficient to store a natural number as we do not need an exponent and the whole set of conventions we discussed earlier. Contrary to this, it is quite complex to store a floating point number, which consumes - dependent on our desired precision - more bits and therefore more memory. Let’s have a quick look at what R has to offer here:
#R> [1] "numeric"
#R> [1] "integer"
#R> [1] "double"
# Mixing integers and doubles will produce a vector of type double
mixed <- c(1:5 , seq(1 , 5 , by=.5))
typeof(mixed)
#R> [1] "double"
Generally, R stores numbers as double
and will use integers only when easily applicable or if you specifically force R to treat a number as an integer using the as.integer()
command. A vector of type integer uses less memory than a vector of type double with the same length. The reason for this is the internal representation of numbers we discussed at the beginning of this chapter.
2.3 Complex
R can handle complex numbers. It is doubtful that you will come across them when doing simple data analysis, but likely if you are performing more advanced calculations for e.g. in time series analysis. You can simply create a complex vector by adding an imaginary term to a real number.
#R> [1] "complex"
#R> [1] "complex"
#R> [1] TRUE
#R> [1] 1
#R> [1] 2
To perform more calculations or obtain more information when dealing with complex numbers you can also use Mod()
for the modulus and Arg()
for the argument of a given complex number.
2.4 Logical
Logicals can hold the value TRUE
or FALSE
and are often the output of comparisons. To fulfill your need for typing efficiency when programming R allows you to abbreviate TRUE
and FALSE
by T
respectively F
.
#R> [1] TRUE
#R> [1] FALSE
#R> [1] TRUE FALSE TRUE FALSE TRUE
#R> [1] 3
When using TRUE
and FALSE
in calculations they are automatically converted into their underlying numerical representation. We are going to talk about this phenomenon called coercion in detail in the respective chapter.
When it comes to logical values R provides a lot of operators to compare and evaluate objects - these are called logical operators.
Code | Description | Syntax |
---|---|---|
> |
Greater than | a > b |
>= |
Greater than or equal to | a >= b |
< |
Less than | a < b |
<= |
Less than or equal to | a <= b |
== |
Exactly equal to | a == b |
!= |
Not equal to | a != b |
! |
Logical negotiation (NOT) | !a |
| |
OR (elementwise) | a | b |
|| |
OR (stepwise) | a || b |
& |
AND (elementwise) | a & b |
&& |
AND (stepwise) | a && b |
%in% |
Is element in group of elements | a %in% c(a,b,c) |
xor() |
Exclusive or (XOR) | xor(a,b) |
A lot of these logical operators can come in handy when structuring your code and using conditional statements to handle different events or cases in your program. If you don’t know how they are working you can just try them out while giving a
and b
different TRUE
or FALSE
values or you can construct a truth table in the following way.
values <- c(NA, FALSE, TRUE)
names(values) <- as.character(values)
outer(values, values, "&") # Truth table for AND
#R> <NA> FALSE TRUE
#R> <NA> NA FALSE NA
#R> FALSE FALSE FALSE FALSE
#R> TRUE NA FALSE TRUE
#R> <NA> FALSE TRUE
#R> <NA> NA NA TRUE
#R> FALSE NA FALSE TRUE
#R> TRUE TRUE TRUE TRUE
Operations using logicals can sometimes be tricky and seem to be wrong or trying to fool you. So pay attention when dealing with them and try to get a sense of how they are evaluated. Here is an example of a tricky evaluation. To understand the following lines you should know what the exclusive or (xor
) does and what differentiates it from the normal or. Trying to come up with a line of code to construct a truth table for this in R may help to understand the following:
#R> [1] TRUE
#R> [1] FALSE
#R> [1] TRUE
#R> [1] FALSE
Although it seems as R is making an error here, it does not. It evaluates our line of code in a strictly logical order and this differs from the appearance of the equation. Here is how R evaluates the parts of our expression:
#R> [1] TRUE
#R> [1] FALSE
#R> [1] FALSE
#R> [1] TRUE
#R> [1] TRUE
# If you want R to behave as it seems on first look you have
# to use parentheses:
xor(T,T) == (T|T) # Evaluates each side individually, then compares
#R> [1] FALSE
The following code illustrates the difference between | and || and shows why using || as operator for the logical OR can come in handy sometimes. Remember that the logical OR returns TRUE if at least one element is TRUE. When the first element is TRUE, the result is independent of the second element.
rm(x) # Make sure variable x does not exist
TRUE | x # Element x does not exist, so R returns an Error!
#R> Error in eval(expr, envir, enclos): object 'x' not found
#R> [1] TRUE
2.5 Character
Character vectors store pieces of text from a single character to whole sentences. You can easily create a character vector by putting text in quotes and R will handle the rest for you.
#R> [1] "character"
#R> [1] 9.42
An often occurring mistake is confusing numbers imported as characters with numerics. The conversion comes in handy when handling data and it is good advice to check datatypes if your calculations with important data look suspicious or won’t work at all.
2.6 Factors
Factors are helpful to represent nominally and ordinally scaled variables.
2.6.1 Unordered Factors
#R> [1] yes yes no yes no
#R> Levels: no yes
#R> x
#R> no yes
#R> 2 3
#R> [1] 2 2 1 2 1
#R> attr(,"levels")
#R> [1] "no" "yes"
x <- factor(c("yes","yes","no","yes","no"),
levels = c("yes","no"))
unclass(x) # By setting levels explicitly the internal order can be defined.
#R> [1] 1 1 2 1 2
#R> attr(,"levels")
#R> [1] "yes" "no"
#R> [1] FALSE
2.6.2 Ordered Factors
An example for an ordered factor are ordinally scaled variables, which are often found in questionnaires, like the likert scale.
likert <- factor(c(2,3,1,5,2,4,5,2,3,3),
levels = c(1,2,3,4,5),
labels = c( "strongly disagree", "disagree", "don’t know", "agree", "strongly agree"),
ordered = TRUE)
likert
#R> [1] disagree don’t know strongly disagree strongly agree
#R> [5] disagree agree strongly agree disagree
#R> [9] don’t know don’t know
#R> 5 Levels: strongly disagree < disagree < don’t know < ... < strongly agree
#R> [1] TRUE
#R> [1] FALSE
#R> Warning in Ops.ordered(likert[1], likert[2]): '+' is not meaningful for ordered
#R> factors
#R> [1] NA
#R> [1] TRUE
2.7 Missing and Raw Data
Besides the atomic data types, R supports two more types. A missing or undefined value is indicated by NA
which stands for non-available. This is in fact not a real data type, R considers this type a logical value. A lot of functions support handling data with NA
s in the set and provide different options to use the respective dataset anyway. Watch out to not mistake NA
for the reserved term NaN
, which indicates erroneous calculations.
#R> [1] 4 5 NA 7 8
#R> [1] FALSE FALSE TRUE FALSE FALSE
#R> [1] 1
#R> [1] 3
#R> [1] NA
#R> [1] 3
R also supports RAW vectors meaning data stored in hexadecimal notation. The hexadecimal system is a companion from our well-known decimal system and the binary system we talked about at the beginning of this chapter. Hexadecimal numbers are numbers with base \(b = 16\). This may get handy when reading in files in binary formats. To find additional information about how to work with the type RAW see the corresponding help pages with ?raw
.
#R> [1] 00 00 00
#R> [1] 0f
#R> [1] "raw"
2.8 Coercion, Attributes, and Class
2.8.1 Coercion
We have already seen that R coerces data types sometimes automatically. That makes it possible to calculate the sum of a vector consisting of logical values, which basically tells you how many elements with value TRUE
are present. R has strict rules on how it behaves when coercing data types. If a string is present in a vector everything will be converted to strings. When there are only logicals and numbers in a vector R converts the logicals to their numeric value, so that every TRUE
becomes 1 and every FALSE
becomes a 0. The main corresponding goal here is to not lose information but conserve it at the cost of memory requirements and compatibility. A graphical representation is shown in the following figure.

FIGURE 2.1: Coercion Cycle and Coercion Rules
Data frames and lists can handle multiple data types in the same structure, but there is a good reason not to mix everything up in a huge data frame and this reason is Math. Using only vectors containing a single type of data is a big advantage, as it is easy to perform mathematical operations using matrices and vectors which couldn’t be done with a mixed type data structure and as they are so easy to store in memory these operations are fast.
2.8.2 Attributes and Names
Attributes are R’s interpretation of metadata. They can be attached to any type of object but won’t affect your calculation or other operations and they won’t be displayed when you display the object. That makes it very convenient to store a description or any additional information you want to handover with your object. Of course, you can access the attributes within functions which allows you to perform special tasks if a data has given attributes. Here is how you assign and display attributes.
#R> NULL
As you can see above, the result that is being returned is NULL
, which is neither equivalent to 0 nor NA
. NULL
is short for the NULL
-Pointer which is a computer scientists term for an empty set \(\{\varnothing\}\). Every time R
returns NULL
it just wants to express that there is nothing there.
description <- list(Description = "Simple numerical vector")
attributes(x) <- description
attributes(x)
#R> $Description
#R> [1] "Simple numerical vector"
Another form of adding descriptive information to your data is by using the names()
function. This is useful for datasets and you can imagine this as a header for your data. This is by far the most common way to enrich data in R. With names it is the same as with other attributes. They won’t affect the behavior of the vector meaning you can still perform all calculations
#R> NULL
#R> one two three four five
#R> 1 2 3 4 5
#R> attr(,"Description")
#R> [1] "Simple numerical vector"
#R> one two three four five
#R> 1 4 9 16 25
#R> attr(,"Description")
#R> [1] "Simple numerical vector"
names(x) <- NULL # To delete names set them to NULL
x <- unname(x) # Alternative way to remove the names
x
#R> [1] 1 4 9 16 25
#R> attr(,"Description")
#R> [1] "Simple numerical vector"
2.8.3 Class
A class in R is a property assigned to an object that describes the type of stored data in some way. Classes are used to control how generic functions like summary()
behave. It is not a mutually exclusive classification. If an object has no specific class assigned to it, such as a simple numeric vector, its class is usually the same as its mode. If you transform data e.g. from a vector to a matrix, R will automatically change the class attribute.
#R> [1] "integer"
#R> [1] TRUE
#R> [1] "lm"
When summarizing the objects mod
with the linear model inside and the numeric vector num
one can easily see that different outputs, dependent on the class of the objects, are produced. The same mechanism can also be used to define own generic functions and dynamically and conveniently control their behavior.
#R> Min. 1st Qu. Median Mean 3rd Qu. Max.
#R> 1.00 3.25 5.50 5.50 7.75 10.00
#R>
#R> Call:
#R> lm(formula = wage ~ 1 + education)
#R>
#R> Residuals:
#R> Min 1Q Median 3Q Max
#R> -5.3396 -2.1501 -0.9674 1.1921 16.6085
#R>
#R> Coefficients:
#R> Estimate Std. Error t value Pr(>|t|)
#R> (Intercept) -0.90485 0.68497 -1.321 0.187
#R> education 0.54136 0.05325 10.167 <2e-16 ***
#R> ---
#R> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#R>
#R> Residual standard error: 3.378 on 524 degrees of freedom
#R> Multiple R-squared: 0.1648, Adjusted R-squared: 0.1632
#R> F-statistic: 103.4 on 1 and 524 DF, p-value: < 2.2e-16
Exercises
EX 1
EX 2
EX 3
TRUE + TRUE
? Why?
EX 4
#R> x y z
#R> 1 1 c TRUE
#R> 2 3 b FALSE