R Data Frame

What is a Data Frame?

Suppose you want to store the names of your employees, their age and addresses all in one dataset.

The first thing that readily comes to mind is Matrix. But you can’t combine all this data in one matrix without converting it to character data.

So, you need a new data structure to keep all this information together. That data structure is a Data Frame.

Unlike vectors or matrices, data frames have no restriction on the data types of the variables; you can store numeric data, character data, and so on.

In a nutshell, a data frame is a list of equal-length vectors. The easiest way to think of a data frame is as an Excel worksheet.

Create a Data Frame

You can create a data frame using the data.frame() function.

Example: Create a data frame to store employee records

> name <- c("Bob", "Max", "Sam")
> age <- c(25,26,23)
> city <- c("New York", "Chicago", "Seattle")
> 
> df <- data.frame(name, age, city)
> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

You can also convert pre-existing structures to a data frame using the as.data.frame() function.

Example: Convert a list of vectors into a data frame

> lst <- list(name = c("Bob", "Max", "Sam"),
+             age = c(25,26,23),
+             city = c("New York", "Chicago", "Seattle"))
> df <- as.data.frame(lst)
> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

Example: Convert a matrix into a data frame

> m <-matrix(1:12, nrow = 4, ncol = 3)
> df <- as.data.frame(m)
> df
  V1 V2 V3
1  1  5  9
2  2  6 10
3  3  7 11
4  4  8 12

Keeping Characters as Characters

Let’s have a look at the internal structure of our data frame.

Example: Print internal structure of a data frame

> str(df)
'data.frame':	3 obs. of  3 variables:
 $ name: Factor w/ 3 levels "Bob","Max","Sam": 1 2 3
 $ age : num  25 26 23
 $ city: Factor w/ 3 levels "Chicago","New York",..: 2 1 3

str() function provides a compact display of the internal structure of any R object.

You may have noticed that the character columns (name and city) were converted to factors.

R does this by default, but you can avoid this by setting an extra argument stringsAsFactors to FALSE.

Example: Avoid character columns being converted to factors

> df <- data.frame(name, age, city, stringsAsFactors = FALSE)
> str(df)
'data.frame':	3 obs. of  3 variables:
 $ name: chr  "Bob" "Max" "Sam"
 $ age : num  25 26 23
 $ city: chr  "New York" "Chicago" "Seattle"

Naming Data Frame Rows and Columns

Every column in a data frame has a name.

Even if you didn’t specify them yourself, R will take the column names from your program variables.

Example: By default, R takes the column names from program variables

> v1 <- c("Bob", "Max", "Sam")
> v2 <- c(25,26,23)
> v3 <- c("New York", "Chicago", "Seattle")
> 
> df <- data.frame(v1, v2, v3)
> df
   v1 v2       v3
1 Bob 25 New York
2 Max 26  Chicago
3 Sam 23  Seattle

But you can give columns a sensible name by using colnames() or names().

Example: Modify column names

> names(df) <- c("name", "age", "city")
> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

By default, data frames does not have row names but you can add them with rownames()

Example: Modify row names

> rownames(df) <- c("row1", "row2", "row3")
> df
     name age     city
row1  Bob  25 New York
row2  Max  26  Chicago
row3  Sam  23  Seattle

You can use the same colnames() and rownames() functions to print column names and row names resp.

Example:

> # print column names
> colnames(df)
[1] "name" "age"  "city"

> # print column names
> names(df)
[1] "name" "age"  "city"

> # print row names
> rownames(df)
[1] "row1" "row2" "row3"

Subsetting Data Frames Like a List

Data frames possess the characteristics of lists.

So, when you subset with a single vector, they behave like lists and will return the selected columns with all rows.

Example:

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

> # subset for 1st column
> df[1]
  name
1  Bob
2  Max
3  Sam

> # subset for 1st and 3rd column
> df[c(1,3)]
  name     city
1  Bob New York
2  Max  Chicago
3  Sam  Seattle

> # omit 3rd column
> df[-3]
  name age
1  Bob  25
2  Max  26
3  Sam  23

Subsetting Data Frames Like a Matrix

Data frames also possess the characteristics of matrices.

So, when you subset with two vectors, they behave like matrices and can be subset by row and column.

Example:

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> # subset for 2nd row
> df[2,]
  name age    city
2  Max  26 Chicago

> # subset for 3rd column
> df[,3]
[1] "New York" "Chicago"  "Seattle" 

> # select single element
> df[2,3]
[1] "Chicago"

> # subset for row 1 and 2 but keep all columns
> df[1:2,]
  name age     city
1  Bob  25 New York
2  Max  26  Chicago

> # subset for both rows and columns
> df[1:2,2:3]
  age     city
1  25 New York
2  26  Chicago

> # omit 2nd row
> df[-2,]
  name age     city
1  Bob  25 New York
3  Amy  23  Seattle

> # omit 2nd row and 3rd column
> df[-2,-3]
  name age
1  Bob  25
3  Amy  23

> # subset for 'city' column
> df[,"city"]
[1] "New York" "Chicago"  "Seattle" 

The subset() function

There’s a more convenient way to subset a data frame using the subset() function.

The function takes three arguments.

subset(df,select,subset)

df: The data frame you want to subset.

select: A column name, or a vector of column names, to be selected.

subset: A logical expression that selects rows.

To see how subset() function works, let’s start with a simple data set. Suppose you have a dataframe df storing employee records:

Example:

> df
  name age sex     city
1  Eve  21   F  Chicago
2  Max  24   M  Houston
3  Ray  22   M New York
4  Kim  21   F New York
5  Sam  23   M  Chicago

> # select the employee name
> subset(df, select=name)
  name
1  Eve
2  Max
3  Ray
4  Kim
5  Sam

> # select the employee name and city
> subset(df, select=c(name,city))
  name     city
1  Eve  Chicago
2  Max  Houston
3  Ray New York
4  Kim New York
5  Sam  Chicago

> # select all employees from 'Chicago'
> subset(df, subset=(city == "Chicago"))
  name age sex    city
1  Eve  21   F Chicago
5  Sam  23   M Chicago

> # select the employee name and city with age > 22
> subset(df, select=c(name,city), subset=(age > 22))
  name    city
2  Max Houston
5  Sam Chicago

Add New Rows and Columns to Data Frame

You can add new columns to a data frame using the cbind() function.

Example: Add new column to a data frame

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> sex <- factor(c("M", "M", "F"))
> cbind(df, sex)
  name age     city sex
1  Bob  25 New York   M
2  Max  26  Chicago   M
3  Amy  23  Seattle   F

To add new rows (observations) to a data frame, use rbind() function.

Warning:

Take extra care when adding new rows to the data frame. Adding elements of wrong type can change the type of the columns.

For example, if your data frame contains a numeric column and you attempt to add a character vector, it will convert all columns to a character type.

Example: Add a new row to a data frame

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> row <- data.frame(name = "Sam",
+                   age = 22, 
+                   city = "New York")
> rbind(df, row)
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Sam  22 New York

Combine Two Data Frames

You can combine data frames in one of two ways:

Combine the Columns

Use cbind() function to combine the columns of two data frames side by side creating a wider data frame.

Example:

> df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> df2
     sex salary
1   Male  22000
2   Male  25400
3 Female  24800

> cbind(df1, df2)
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

Make sure the data frames have the same height (number of rows).

Otherwise, R will invoke the Recycling Rule to extend the short columns, which may or may not be what you want.

Combine the Rows

Use rbind() function to stack the rows of two data frames creating a taller data frame.

Example:

> df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> df2
  name age     city
1  Eve  21  Chicago
2  Ray  22  Houston
3  Kim  24 New York

> rbind(df1, df2)
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Eve  21  Chicago
5  Ray  22  Houston
6  Kim  24 New York

Make sure the data frames have the same width (same number of columns and same column names).

However, the columns need not be in the same order.

Merge Data Frames by Common Column

You can merge two data frames by matching on the common column using the merge() function.

You just need to specify the two data frames and the name of the common column.

Example: Merge two data frames by common column ‘name’

> df1
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> df2
  name    sex salary
1  Max   Male  25400
2  Amy Female  24800
3  Bob   Male  22000

> merge(df1, df2, by="name")
  name age     city    sex salary
1  Amy  23  Seattle Female  24800
2  Bob  25 New York   Male  22000
3  Max  26  Chicago   Male  25400

The merge() function does not require the rows to occur in the same order.

It also discards rows that appear in only one data frame or the other.

Modify a Data Frame

Modifying a data frame is pretty straightforward.

Access the element using [] operator and simply assign a new value.

Example:

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

> # modify 2nd column
> df[2] <- c(23,21,22)
> df
  name age     city
1  Bob  23 New York
2  Max  21  Chicago
3  Amy  22  Seattle

> # modify 2nd row
> df[2,] <- list("Eve",24,"Houston")
> df
  name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Amy  22  Seattle

> # modify single element
> df[3,1] <- "Sam"
> df
  name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Sam  22  Seattle

Editing a Data Frame

R offers convenient ways to edit the data frame contents: the edit() function and the fix() function

The edit() Function

It opens up the data editor that displays your data frame in a spreadsheet-like window.

Invoke the editor like this:

> df
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

> temp <- edit(df)
> df <- temp
r edit data frame

Once you are done with the changes, close the editor window.

The updated data frame will be assigned to the temp variable.

If you are happy with the changes, overwrite your data frame with the results.

The fix() Function

There’s another function called fix() which works exactly like edit() except it overwrites the data frame once you close the editor.

Use it if you are confident, because there is no undo.

> df
  name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

> fix(df)
r edit data frame

Create an Empty Data Frame

You can create an empty data frame using the numeric(), character(), and factor() functions to preallocate the columns; then join them together using data.frame()

This technique is useful especially when you want to build a data frame row-by-row.

Example: Create an empty data frame

> df <- data.frame(name=character(),
+                  age=numeric(),
+                  sex=factor(levels=c("M","F")),
+                  stringsAsFactors = FALSE)
> str(df)
'data.frame':	0 obs. of  3 variables:
 $ name: chr 
 $ age : num 
 $ sex : Factor w/ 2 levels "M","F":

You can even create an empty data frame of fixed size if you know the required number of rows in advance.

Example: Create an empty data frame with 3 rows

> N <- 3
> df <- data.frame(name=character(N),
+                  age=numeric(N),
+                  sex=factor(N, levels=c("M","F")),
+                  stringsAsFactors = FALSE)
> df
  name age  sex
1        0 <NA>
2        0 <NA>
3        0 <NA>

Sorting a Data Frame

You can sort the contents of a data frame by using the order() function and specifying one of the columns as the sort key.

The order() function alone tells you how to rearrange the columns. It does not return data values. Combine it with the subsetting operator [] to get the sorted data frame.

By default, sorting is ascending. Prepend the sorting variable by a minus sign to sort in descending order.

Example: Sort the data frame by age

> df
  name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# sort in ascending order
> df[order(df$age),]
  name age     city
3  Amy  23  Seattle
1  Bob  25 New York
2  Max  26  Chicago

# sort in descending order
> df[order(-df$age),]
  name age     city
2  Max  26  Chicago
1  Bob  25 New York
3  Amy  23  Seattle