# Chapter 5 Data Management and Manipulations

In this chapter, we are going to see modify certain attributes of data (like names), how to extract a subset of data, how to clean up the data from missing values, etc. We have already seen some of these topics but here we are going to be systematic and focus on the underlying principles.

## 5.1 Attaching Names to the Elements of an Object

We can either provide a name for each element of a vector from the beginning when we create the vector as in this example

```
x <- c(Left=1, Middle=2, Right=3)
x
## Left Middle Right
## 1 2 3
names(x)
## [1] "Left" "Middle" "Right"
```

Or we can first create a vector and then name the elements using `names`

function. For example, let us create a vector of integers 1 to 3. By default, there’s no name attribute attached so when we call the names function on `x`

, `names(x)`

, it returns `NULL`

. However, we can give a name to each element of the vector `x`

. So for example, we can say the first element is called `Left`

, the second element is called `Middle`

, and the third element is called `Right`

. So now when we print out `x`

vector, we get a vector 1, 2, 3 but then each one has a name over it, which is the name we just specified. So when we call the names function we get also the names that are associated with each element of the vector `Left`

, `Middle`

, `Right`

:

```
x <- c(1, 2, 3) # create a vector
names(x) # no names attached yet
## NULL
names(x) <- c("Left", "Middle", "Right") # attach names
x
## Left Middle Right
## 1 2 3
```

Naming objects is very useful for writing readable code and self-describing objects.

And matrices can also have name, these are called `dimnames`

:

```
M <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(M) <- list(c("Top", "Bottom"), c("Left", "Right"))
M
## Left Right
## Top 1 3
## Bottom 2 4
```

Column names and row names can be set separately using the `colnames`

and `rownames`

functions:

```
M <- matrix(1:4, nrow = 2, ncol = 2)
rownames(M) <- c("Top", "Bottom")
colnames(M) <- c("Left", "Right")
M
## Left Right
## Top 1 3
## Bottom 2 4
```

Alternatively,

```
row_names <- c("Top", "Bottom")
col_names <- c("Left", "Right")
M <- matrix(1:4, nrow = 2, ncol = 2, dimnames = list(row_names, col_names))
M
## Left Right
## Top 1 3
## Bottom 2 4
```

For data frames, there is a separate function for setting the row names, the `row.names()`

function. Also, data frames do not have column names, they just have names (like lists). So to set the column names of a data frame just use the `names()`

function. Here is a quick summary:

Object |
Function to Set Column Names |
---|---|

data frame | `names()` |

matrix | `colnames()` |

Finally, lists can also have names:

```
x <- list(x = 1, y = 2, z = 3)
x
## $x
## [1] 1
##
## $y
## [1] 2
##
## $z
## [1] 3
```

## 5.2 Subsetting `R`

Objects

There are three operators that can be used to extract subsets of `R`

objects.

Operator |
Description |
---|---|

`[` |
Always returns an object of the same class as the original. It can be used to select multiple elements of an object. |

`[[` |
Extracts elements of a list or a data frame. It can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame. |

`$` |
Extract elements of a list or data frame by literal name. |

### 5.2.1 Subsetting a Vector

Vectors are basic objects in `R`

and they can be subsetted using the `[`

operator.

```
x <- c("X", "Y", "Z", "Z", "Z", "X")
x[1] # Extract the first element
## [1] "X"
x[2] # Extract the second element
## [1] "Y"
```

The `[`

operator can be used to extract multiple elements of a vector by passing the operator an integer sequence. Here we extract the first four elements of the vector.

```
x <- c("X", "Y", "Z", "Z", "Z", "X")
x[2:4]
## [1] "Y" "Z" "Z"
```

The sequence does not have to be in order; you can specify any arbitrary integer vector.

```
x <- c("X", "Y", "Z", "Z", "Z", "X")
x[c(1, 3, 5)]
## [1] "X" "Z" "Z"
```

We can also pass a logical sequence to the `[`

operator to extract elements of a vector that satisfy a given condition. For example, here we want the elements of `x`

that come lexicographically after the letter `Y`

.

```
x <- c("X", "Y", "Z", "T", "Z", "Z", "X", "T")
indices <- x > "T"
indices
## [1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
x[indices] # elements of x that comes after T in the alphabet
## [1] "X" "Y" "Z" "Z" "Z" "X"
```

Another, more compact, way to do this would be to skip the creation of a logical vector and just subset the vector directly with the logical expression.

```
x <- c("X", "Y", "Z", "T", "Z", "Z", "X", "T")
x[x>"T"] # elements of x that comes after T in the alphabet
## [1] "X" "Y" "Z" "Z" "Z" "X"
```

### 5.2.2 Subsetting a Matrix

Matrices can be subsetted in the usual way with \((i,j)\) type indices. Here, we create simple 2×3 matrix with the matrix function.

```
M <- matrix(1:8, 2)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
M[2, 1] # the (2,1) element of the matrix
## [1] 2
M[1, 3] # the (1,3) element of the matrix
## [1] 5
```

Indices can also be missing. This behavior is used to access entire rows or columns of a matrix.

```
M <- matrix(1:8, 2)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
M[2, ] # the second row of the matrix
## [1] 2 4 6 8
M[, 3] # the third column of the matrix
## [1] 5 6
```

#### 5.2.2.1 Dropping matrix dimensions

By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1×1 matrix. Often, this is exactly what we want, but this behavior can be turned off by setting `drop = FALSE`

.

```
M <- matrix(1:8, 2)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
M[2, 1, drop = FALSE] # setting drop=FALSE returns a 1×1 matrix
## [,1]
## [1,] 2
```

Similarly, when we extract a single row or column of a matrix, `R`

by default drops the dimension of length 1, so instead of getting a 1×3 matrix after extracting the first row, we get a vector of length 3. This behavior can similarly be turned off with the drop = FALSE option.

```
M <- matrix(1:8, 2)
M
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
M[2, , drop = FALSE] # setting drop=FALSE returns a 1×3 matrix
## [,1] [,2] [,3] [,4]
## [1,] 2 4 6 8
```

### 5.2.3 Subsetting Lists

Lists in `R`

can be subsetted using all three of the operators mentioned above, and all three are used for different purposes.

```
L <- list(integers = 1:4, decimal = 0.6)
L
## $integers
## [1] 1 2 3 4
##
## $decimal
## [1] 0.6
```

The `[[`

operator can be used to extract single elements from a list. Here we extract the first element of the list:

```
L <- list(integers = 1:4, decimal = 0.6)
L[[1]]
## [1] 1 2 3 4
```

The `[[`

operator can also use named indices so that you don’t have to remember the exact ordering of every element of the list. You can also use the `$`

operator to extract elements by name.

```
L <- list(integers = 1:4, decimal = 0.6)
L[["decimal"]]
## [1] 0.6
L$decimal
## [1] 0.6
```

Notice you don’t need the quotes when you use the `$`

operator.

One thing that differentiates the `[[`

operator from the `$`

is that the `[[`

operator can be used with computed indices. The `$`

operator can only be used with literal names.

```
L <- list(integers = 1:4, decimal = 0.6, word = "hello")
name <- "integers"
# computed index for "integers"
L[[name]]
## [1] 1 2 3 4
# element "name" doesn’t exist! (but no error here)
L$name
## NULL
# element "integers" does exist
L$integers
## [1] 1 2 3 4
```

#### 5.2.3.1 Extracting Multiple Elements of a List

The `[`

operator can be used to extract multiple elements from a list. For example, if you wanted to extract the first and third elements of a list, you would do the following

```
L <- list(integers = 1:4, decimal = 0.6, word = "hello")
# Get the 3rd element of the 1st element
L[c(1, 3)]
## $integers
## [1] 1 2 3 4
##
## $word
## [1] "hello"
```

Remember that the `[`

operator always returns an object of the same class as the original. Since the original object was a list, the `[`

operator returns a list. In the above code, we returned a list with two elements (the first and the third).

#### 5.2.3.2 Subsetting Nested Elements of a List

The `[[`

operator can take an integer sequence if you want to extract a nested element of a list.

```
L <- list(integers = 1:4, decimal = 0.6, word = "hello")
# Get the 3rd element of the 1st element
L[[c(1, 3)]]
## [1] 3
# Same as above
L[[1]][[3]]
## [1] 3
# 1st element of the 2nd element
L[[c(2, 1)]]
## [1] 0.6
```

So, note that `L[c(1, 3)]`

is not the same as `L[[c(1, 3)]]`

.

## 5.3 Missing Data

Missing values in `R`

are denoted by either `NA`

or `NAN`

: `NAN`

is used for undefined mathematical operations and `NA`

is pretty much used for everything else. There is also another type of value `NULL`

that sounds like `NA`

but actually quite different as we will see.

### 5.3.1 `NA`

and `NAN`

Let’s add a missing value by entering `NA`

as an element of our vector `x`

(`NA`

is a legitimate logical character, so `R`

will allow you to add it to a numeric vector) and try to compute the mean:

```
x <- 1:10
x <- c(x,NA)
x
## [1] 1 2 3 4 5 6 7 8 9 10 NA
mean(x)
## [1] NA
```

The built-in function for the mean returns `NA`

because of the missing data value. We need to say instead this:

```
mean(x, na.rm = TRUE)
## [1] 5.5
```

The `na.rm = TRUE`

argument does not remove the missing value but simply omits it from the calculations. In other word, the mean is calculated when you omit the missing value, but unless you were to use another command, such as `x <- x[-11]`

as we will see below, the vector will not change. There is a similar functionality with other functions `sum`

, `min`

, `max`

, `var`

, `sd`

and other built-in functions.

The function `is.na`

tests each element of a vector for missing values and returns a logical vector:

```
x <- c(NA, 2, NA, 1:3, NA)
is.na(x)
## [1] TRUE FALSE TRUE FALSE FALSE FALSE TRUE
```

Remember vectors must contain data elements of the same type. To demonstrate this, let us make a vector of 10 numbers, and then add a character element to the vector. `R`

coerces the data to a character vector because we added a character object to it. I used the index `[11]`

to add the character element to the vector. But the vector now contains characters and you cannot do math on it:

```
x <- 1:10
mean(x)
## [1] 5.5
x[11] <- "A" # appending a numeric vector with a character
x
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "A"
mean(x) # cannot do math on it
## Warning in mean.default(x): argument is not numeric or logical: returning
## NA
## [1] NA
```

We can use a negative index, `[-11]`

, to remove the character and the `R`

function `as.integer()`

to coerce the vector back to integers:

```
x <- x[-11]
x <- as.integer(x)
is.integer(x)
## [1] TRUE
mean(x)
## [1] 5.5
```

Let us create a numeric vector `x`

as `1, 2, NA, 10, 3`

. The `NA`

value in here is going to be a numeric missing value. So when we call `is.na`

on `x`

, it returns a logical vector where the logical vector indicates whether each element of the vector `x`

is missing or not.

```
x <- c(1, 2, NA, 10, 3)
is.na(x)
## [1] FALSE FALSE TRUE FALSE FALSE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE
x <- c(1, 2, NaN, NA, 4)
is.na(x)
## [1] FALSE FALSE TRUE TRUE FALSE
is.nan(x)
## [1] FALSE FALSE TRUE FALSE FALSE
```

The first two are `FALSE`

, the third is `TRUE`

, and the fourth and the fifth are `FALSE`

. If we call `is.nan`

on this vector, we see that vector that is returned is all `FALSE`

because there aren’t any `NaN`

values, or there aren’t any `NaN`

values in this vector so everything is `FALSE`

.

#### 5.3.1.1 Removing `NA`

Values

A common task in data analysis is removing missing values (`NA`

s) and this can be easily done in `R`

:

```
x <- c(1, 2, NA, 10, 3)
missing <- is.na(x)
print(missing)
## [1] FALSE FALSE TRUE FALSE FALSE
x[!missing]
## [1] 1 2 10 3
```

What if there are multiple `R`

objects and you want to take the subset with no missing values in any of those objects?

```
x <- c(1, 2, NA, 4, NA, 5)
y <- c("a", "b", NA, "d", NA, "f")
notmissing <- complete.cases(x, y)
notmissing
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
x[notmissing]
## [1] 1 2 4 5
y[notmissing]
## [1] "a" "b" "d" "f"
```

You can use the function `complete.cases`

on data frames too.

```
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
notmissing <- complete.cases(airquality)
head(airquality[notmissing, ])
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 7 23 299 8.6 65 5 7
## 8 19 99 13.8 59 5 8
```

### 5.3.2 `NULL`

`NULL`

is the absence of anything. For example, functions can sometimes return `NULL`

or their arguments can be `NULL`

. An important difference between `NA`

and `NULL`

is that `NULL`

is atomical and cannot exist within a vector. If `NULL`

is used inside a vector, it simply disappears:

```
x <- c(NA, 1, NULL, 2, NA)
x
## [1] NA 1 2 NA
length(x)
## [1] 4
```

Thus, even though it was entered into the vector `x`

, it did not get stored in `x`

.

The function `is.null`

tests for `NULL`

values in a vector:

```
x <- NULL
is.null(x)
## [1] TRUE
x <- c(NA, 1, NULL, 2, NA)
is.null(x)
## [1] FALSE
```

## 5.4 An Extended Example

In this section, I will demostrate various ways of subsetting and manipulating a data set, along which I will also illustrate the use of some functions that we haven’t seen before, such as

`which`

`recode`

`subset`

Throughout the section, I will use the `mpg`

dataset from the `ggplot2`

package, which should be already installed in `R`

so we just need to load it into the current session:

`library(ggplot2)`

Let us first see what kind of objects are included in `mpg`

by using `summary`

function

```
summary(mpg)
## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
```

and let’s see what they look like

```
head(mpg)
## # A tibble: 6 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
```

```
names(mpg)
## [1] "manufacturer" "model" "displ" "year"
## [5] "cyl" "trans" "drv" "cty"
## [9] "hwy" "fl" "class"
# let's save the original names
namesOriginal <- names(mpg)
```

### 5.4.1 Rename Columns (Variables)

```
# rename some of the columns
names(mpg)[c(1,8,9)] <- c("manuf", "mpg(cty)", "mpg(hwy)")
names(mpg)
## [1] "manuf" "model" "displ" "year" "cyl" "trans"
## [7] "drv" "mpg(cty)" "mpg(hwy)" "fl" "class"
```

Rename by a variable name:

```
names(mpg)[names(mpg)=="year"] <- "yr"
names(mpg)
## [1] "manuf" "model" "displ" "yr" "cyl" "trans"
## [7] "drv" "mpg(cty)" "mpg(hwy)" "fl" "class"
# put the original names back
names(mpg) <- namesOriginal
```

Renaming by conditioning on an individual column name might be useful especially when we don’t know the exact column number, for instance.

### 5.4.2 Working with Missing Data

```
# let's add an odd observation to the end of the variable cty
mpg$cty[length(mpg$cty)] <- -1
# use tail function to see that -1 is indeed a bizarre observation
tail(mpg$cty)
## [1] 18 19 21 16 18 -1
# extract the indices where cty == -1
which(mpg$cty == -1) # note that we use == not the usual =
## [1] 234
# treat these observations as missing
mpg$cty[which(mpg$cty == -1 )] <- NA
# tabulate missing values
table(is.na(mpg$cty))
##
## FALSE TRUE
## 233 1
# with missing values we must use the option 'na.rm = TRUE'
mean(mpg$cty, na.rm = TRUE)
## [1] 16.85837
```

We can also use `is.na(mpg$cty)`

to see if there are any `NA`

s in the data?

### 5.4.3 Computing New Variables

- Create a new variable by summing up the
`cty`

and`hwy`

variables and append this to the`mpg`

data frame as another column named`mpg_sum`

`mpg$mpg_sum <- rowSums(mpg[8:9])`

- Alternative method using the
`apply`

function:

`apply(mpg[, 8:9], 1, sum)`

- Compute the average mpg

`mpg$mpg_avg <- mpg$mpg_sum/length(mpg$mpg_sum)`

- Drop a variable

`mpg$mpg_avg <- NULL`

### 5.4.4 Recode a Continuous Variable into a New Categorical Variable

```
summary(mpg$cty)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 17.00 16.86 19.00 35.00 1
mpg$cty_cat <- NA
mpg$cty_cat[mpg$cty <= 15] <-"Low"
mpg$cty_cat[mpg$cty > 15 & mpg$cty <= 30] <- "Med"
mpg$cty_cat[mpg$cty > 30] <-"High"
# let's tabulate this new variable
table(mpg$cty_cat)
##
## High Low Med
## 2 97 134
```

### 5.4.5 Recode from Continuous to Continuous using `recode`

within the **car** package -

To use the `recode`

function we need to install the **car** package first:

```
# install.packages("car")
library(car)
mpg$cyl_2 <- recode(mpg$cyl, "4=1;5=2;6=3;8=4")
```

### 5.4.6 Subsets of a Data Frame

- Subsets by specifying the column name:

```
mpg[1:3, c("model", "cty", "hwy")]
## # A tibble: 3 x 3
## model cty hwy
## <chr> <dbl> <int>
## 1 a4 18 29
## 2 a4 21 29
## 3 a4 20 31
```

- Subsets by specifying the row:

```
# Returns indices of rows where logical statement is TRUE
which(mpg$cty > 30)
## [1] 213 222
which(mpg$cty > 23 & mpg$cty < 28)
## [1] 101 102 104 105 106 107 194 195 196 198
which(mpg$cty < 11 | mpg$cty > 32)
## [1] 55 60 66 70 127 213 222
```

- Subsets by specifying conditions on some of the rows using the
`subset`

function:

```
# another two more ways of conditional subsetting
sub1 <- mpg[which(mpg$cty > 30), c("cty","hwy")]
sub2 <- subset(mpg, cty > 30, select = c("cty","hwy"))
# check if they are identical
identical(sub1,sub2)
## [1] TRUE
```

### 5.4.7 Creating a Dummy Variable

The variable `year`

has two levels: 1999, 2008. Suppose we want create a dummy variable for the year 2008:

```
dum2008 <- as.numeric(mpg$year == 2008)
head(dum2008)
```

Here, `dum2008 = 1`

for when `years=2008`

and zero otherwise.

Another example, creating a dummy for years after 1999

```
dum1999a <- as.numeric(mpg$year > 1999)
head(dum1999a)
```

More generally, we can use `ifelse`

function to choose between two values depending on a condition. So if instead of a 0-1 dummy variable, for some reason you wanted to use, say, 4 and 7, you could use

```
dum99 <- ifelse(year == 1999, 1, 0)
dum08 <- ifelse(year >= 2008, 1, 0)
```

### 5.4.8 Data Frame Management: ‘attach’, ‘detach’, and ‘with’

There is an easier way of working with data frames with **attach** and **detach** functions. The `attach()`

function adds the data frame to the `R`

search path. When a variable name is encountered, data frames in the search path are checked for the variable in order. The `detach()`

function removes the data frame from the search path.

These two are equivalent:

```
summary(mpg$cty)
mpg$cty_cat <- NA
mpg$cty_cat[mpg$cty <= 15] <-"Low"
mpg$cty_cat[mpg$cty > 15 & mpg$cty <= 30] <- "Med"
mpg$cty_cat[mpg$cty > 30] <-"High"
# let's tabulate this new variable
table(mpg$cty_cat)
```

and

```
library(ggplot2)
attach(mpg)
print(summary(cty))
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 17.00 16.86 19.00 35.00 1
cty_cat <- NA
cty_cat[cty <= 15] <-"Low"
cty_cat[cty > 15 & cty <= 30] <- "Med"
cty_cat[cty > 30] <-"High"
# let's tabulate this new variable
table(cty_cat)
## cty_cat
## High Low Med
## 2 97 134
detach(mpg)
```

There is an alternative way that brings the practicality

```
library(ggplot2)
with(mpg, {
print(summary(cty))
cty_cat <- NA
cty_cat[cty <= 15] <-"Low"
cty_cat[cty > 15 & cty <= 30] <- "Med"
cty_cat[cty > 30] <-"High"
# let's tabulate this new variable
table(cty_cat)
})
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 17.00 16.86 19.00 35.00 1
## cty_cat
## High Low Med
## 2 97 134
```

If you want to save the objects created within `with`

use `<<-`

asignment operator. It saves the object to the global environment outside of the `with()`

call.

```
library(ggplot2)
with(mpg, {
print(summary(cty))
cty_cat <<- NA
cty_cat[cty <= 15] <<-"Low"
cty_cat[cty > 15 & cty <= 30] <<- "Med"
cty_cat[cty > 30] <<-"High"
})
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 17.00 16.86 19.00 35.00 1
# let's tabulate this new variable
table(cty_cat)
## cty_cat
## High Low Med
## 2 97 134
```

```
with(mtcars, {
nokeepstats <- summary(mpg)
keepstats <<- summary(mpg)
})
```

Now try to see which variable works outside the `with()`

.