What is a Data Frame?
Suppose you want to store the names of your employees, their age and addresses all in one dataset.
The first thing that readily comes to mind is Matrix. But you can’t combine all this data in one matrix without converting it to character data.
So, you need a new data structure to keep all this information together. That data structure is a Data Frame.
Unlike vectors or matrices, data frames have no restriction on the data types of the variables; you can store numeric data, character data, and so on.
In a nutshell, a data frame is a list of equal-length vectors. The easiest way to think of a data frame is as an Excel worksheet.
Create a Data Frame
You can create a data frame using the data.frame()
function.
# Create a data frame to store employee records
name <- c("Bob", "Max", "Sam")
age <- c(25,26,23)
city <- c("New York", "Chicago", "Seattle")
df <- data.frame(name, age, city)
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Sam 23 Seattle
You can also convert pre-existing structures to a data frame using the as.data.frame()
function.
# Convert a list of vectors into a data frame
lst <- list(name = c("Bob", "Max", "Sam"),
age = c(25,26,23),
city = c("New York", "Chicago", "Seattle"))
df <- as.data.frame(lst)
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Sam 23 Seattle
# Convert a matrix into a data frame
m <-matrix(1:12, nrow = 4, ncol = 3)
df <- as.data.frame(m)
df
V1 V2 V3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
Keeping Characters as Characters
Let’s have a look at the internal structure of our data frame.
# Print internal structure of a data frame
str(df)
'data.frame': 3 obs. of 3 variables:
$ name: Factor w/ 3 levels "Bob","Max","Sam": 1 2 3
$ age : num 25 26 23
$ city: Factor w/ 3 levels "Chicago","New York",..: 2 1 3
str()
function provides a compact display of the internal structure of any R object.
You may have noticed that the character columns (name and city) were converted to factors. R does this by default, but you can avoid this by setting an extra argument stringsAsFactors to FALSE.
df <- data.frame(name, age, city, stringsAsFactors = FALSE)
str(df)
'data.frame': 3 obs. of 3 variables:
$ name: chr "Bob" "Max" "Sam"
$ age : num 25 26 23
$ city: chr "New York" "Chicago" "Seattle"
Naming Data Frame Rows and Columns
Every column in a data frame has a name. Even if you didn’t specify them yourself, R will take the column names from your program variables.
v1 <- c("Bob", "Max", "Sam")
v2 <- c(25,26,23)
v3 <- c("New York", "Chicago", "Seattle")
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 Bob 25 New York
2 Max 26 Chicago
3 Sam 23 Seattle
But you can give columns a sensible name by using colnames()
or names()
.
names(df) <- c("name", "age", "city")
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Sam 23 Seattle
By default, data frames does not have row names but you can add them with rownames()
rownames(df) <- c("row1", "row2", "row3")
df
name age city
row1 Bob 25 New York
row2 Max 26 Chicago
row3 Sam 23 Seattle
You can use the same colnames()
and rownames()
functions to print column names and row names resp.
# print column names
colnames(df)
[1] "name" "age" "city"
# print column names
names(df)
[1] "name" "age" "city"
# print row names
rownames(df)
[1] "row1" "row2" "row3"
Subsetting Data Frames Like a List
Data frames possess the characteristics of lists. So, when you subset with a single vector, they behave like lists and will return the selected columns with all rows.
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Sam 23 Seattle
# subset for 1st column
df[1]
name
1 Bob
2 Max
3 Sam
# subset for 1st and 3rd column
df[c(1,3)]
name city
1 Bob New York
2 Max Chicago
3 Sam Seattle
# omit 3rd column
df[-3]
name age
1 Bob 25
2 Max 26
3 Sam 23
Subsetting Data Frames Like a Matrix
Data frames also possess the characteristics of matrices. So, when you subset with two vectors, they behave like matrices and can be subset by row and column.
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
# subset for 2nd row
df[2,]
name age city
2 Max 26 Chicago
# subset for 3rd column
df[,3]
[1] "New York" "Chicago" "Seattle"
# select single element
df[2,3]
[1] "Chicago"
# subset for row 1 and 2 but keep all columns
df[1:2,]
name age city
1 Bob 25 New York
2 Max 26 Chicago
# subset for both rows and columns
df[1:2,2:3]
age city
1 25 New York
2 26 Chicago
# omit 2nd row
df[-2,]
name age city
1 Bob 25 New York
3 Amy 23 Seattle
# omit 2nd row and 3rd column
df[-2,-3]
name age
1 Bob 25
3 Amy 23
# subset for 'city' column
df[,"city"]
[1] "New York" "Chicago" "Seattle"
The subset() function
There’s a more convenient way to subset a data frame using the subset()
function.
The function takes three arguments.
subset(df,select,subset)
df: The data frame you want to subset.
select: A column name, or a vector of column names, to be selected.
subset: A logical expression that selects rows.
To see how subset()
function works, let’s start with a simple data set. Suppose you have a dataframe df
storing employee records:
df
name age sex city
1 Eve 21 F Chicago
2 Max 24 M Houston
3 Ray 22 M New York
4 Kim 21 F New York
5 Sam 23 M Chicago
# select the employee name
subset(df, select=name)
name
1 Eve
2 Max
3 Ray
4 Kim
5 Sam
# select the employee name and city
subset(df, select=c(name,city))
name city
1 Eve Chicago
2 Max Houston
3 Ray New York
4 Kim New York
5 Sam Chicago
# select all employees from 'Chicago'
subset(df, subset=(city == "Chicago"))
name age sex city
1 Eve 21 F Chicago
5 Sam 23 M Chicago
# select the employee name and city with age > 22
subset(df, select=c(name,city), subset=(age > 22))
name city
2 Max Houston
5 Sam Chicago
Add New Rows and Columns to Data Frame
You can add new columns to a data frame using the cbind()
function.
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
sex <- factor(c("M", "M", "F"))
cbind(df, sex)
name age city sex
1 Bob 25 New York M
2 Max 26 Chicago M
3 Amy 23 Seattle F
To add new rows (observations) to a data frame, use rbind()
function.
Warning:
Take extra care when adding new rows to the data frame. Adding elements of wrong type can change the type of the columns.
For example, if your data frame contains a numeric column and you attempt to add a character vector, it will convert all columns to a character type.
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
row <- data.frame(name = "Sam",
age = 22,
city = "New York")
rbind(df, row)
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
4 Sam 22 New York
Combine Two Data Frames
You can combine data frames in one of two ways:
Combine the Columns
Use cbind()
function to combine the columns of two data frames side by side creating a wider data frame.
df1
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
df2
sex salary
1 Male 22000
2 Male 25400
3 Female 24800
cbind(df1, df2)
name age city sex salary
1 Bob 25 New York Male 22000
2 Max 26 Chicago Male 25400
3 Amy 23 Seattle Female 24800
Make sure the data frames have the same height (number of rows).
Otherwise, R will invoke the Recycling Rule to extend the short columns, which may or may not be what you want.
Combine the Rows
Use rbind()
function to stack the rows of two data frames creating a taller data frame.
df1
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
df2
name age city
1 Eve 21 Chicago
2 Ray 22 Houston
3 Kim 24 New York
rbind(df1, df2)
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
4 Eve 21 Chicago
5 Ray 22 Houston
6 Kim 24 New York
Make sure the data frames have the same width (same number of columns and same column names).
However, the columns need not be in the same order.
Merge Data Frames by Common Column
You can merge two data frames by matching on the common column using the merge()
function. You just need to specify the two data frames and the name of the common column.
# Merge two data frames by common column 'name'
df1
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
df2
name sex salary
1 Max Male 25400
2 Amy Female 24800
3 Bob Male 22000
merge(df1, df2, by="name")
name age city sex salary
1 Amy 23 Seattle Female 24800
2 Bob 25 New York Male 22000
3 Max 26 Chicago Male 25400
The merge()
function does not require the rows to occur in the same order.
It also discards rows that appear in only one data frame or the other.
Modify a Data Frame
Modifying a data frame is pretty straightforward. Access the element using []
operator and simply assign a new value.
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
# modify 2nd column
df[2] <- c(23,21,22)
df
name age city
1 Bob 23 New York
2 Max 21 Chicago
3 Amy 22 Seattle
# modify 2nd row
df[2,] <- list("Eve",24,"Houston")
df
name age city
1 Bob 23 New York
2 Eve 24 Houston
3 Amy 22 Seattle
# modify single element
df[3,1] <- "Sam"
df
name age city
1 Bob 23 New York
2 Eve 24 Houston
3 Sam 22 Seattle
Editing a Data Frame
R offers convenient ways to edit the data frame contents: the edit()
function and the fix()
function
The edit() Function
It opens up the data editor that displays your data frame in a spreadsheet-like window. Invoke the editor like this:
df
name age city sex salary
1 Bob 25 New York Male 22000
2 Max 26 Chicago Male 25400
3 Amy 23 Seattle Female 24800
temp <- edit(df)
df <- temp

Once you are done with the changes, close the editor window. The updated data frame will be assigned to the temp
variable. If you are happy with the changes, overwrite your data frame with the results.
The fix() Function
There’s another function called fix()
which works exactly like edit()
except it overwrites the data frame once you close the editor.
Use it if you are confident, because there is no undo.
df
name age city sex salary
1 Bob 25 New York Male 22000
2 Max 26 Chicago Male 25400
3 Amy 23 Seattle Female 24800
fix(df)

Create an Empty Data Frame
You can create an empty data frame using the numeric()
, character()
, and factor()
functions to preallocate the columns; then join them together using data.frame()
This technique is useful especially when you want to build a data frame row-by-row.
df <- data.frame(name=character(),
age=numeric(),
sex=factor(levels=c("M","F")),
stringsAsFactors = FALSE)
str(df)
'data.frame': 0 obs. of 3 variables:
$ name: chr
$ age : num
$ sex : Factor w/ 2 levels "M","F":
You can even create an empty data frame of fixed size if you know the required number of rows in advance.
# Create an empty data frame with 3 rows
N <- 3
df <- data.frame(name=character(N),
age=numeric(N),
sex=factor(N, levels=c("M","F")),
stringsAsFactors = FALSE)
df
name age sex
1 0 <NA>
2 0 <NA>
3 0 <NA>
Sorting a Data Frame
You can sort the contents of a data frame by using the order()
function and specifying one of the columns as the sort key.
The order()
function alone tells you how to rearrange the columns. It does not return data values. Combine it with the subsetting operator []
to get the sorted data frame.
By default, sorting is ascending. Prepend the sorting variable by a minus sign to sort in descending order.
# Sort the data frame by age
df
name age city
1 Bob 25 New York
2 Max 26 Chicago
3 Amy 23 Seattle
# sort in ascending order
df[order(df$age),]
name age city
3 Amy 23 Seattle
1 Bob 25 New York
2 Max 26 Chicago
# sort in descending order
df[order(-df$age),]
name age city
2 Max 26 Chicago
1 Bob 25 New York
3 Amy 23 Seattle