## What is a Data Frame?

Suppose you want to store the names of your employees, their age and addresses all in one dataset.

The first thing that readily comes to mind is Matrix. But you can’t combine all this data in one matrix without converting it to character data.

So, you need a new data structure to keep all this information together. That data structure is a Data Frame.

Unlike vectors or matrices, data frames have no restriction on the data types of the variables; you can store numeric data, character data, and so on.

In a nutshell, a data frame is a list of equal-length vectors. The easiest way to think of a data frame is as an Excel worksheet.

## Create a Data Frame

You can create a data frame using the `data.frame()` function.

``````# Create a data frame to store employee records
name <- c("Bob", "Max", "Sam")
age <- c(25,26,23)
city <- c("New York", "Chicago", "Seattle")

df <- data.frame(name, age, city)
df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle
``````

You can also convert pre-existing structures to a data frame using the `as.data.frame()` function.

``````# Convert a list of vectors into a data frame
lst <- list(name = c("Bob", "Max", "Sam"),
age = c(25,26,23),
city = c("New York", "Chicago", "Seattle"))
df <- as.data.frame(lst)
df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle
``````
``````# Convert a matrix into a data frame
m <-matrix(1:12, nrow = 4, ncol = 3)
df <- as.data.frame(m)
df
V1 V2 V3
1  1  5  9
2  2  6 10
3  3  7 11
4  4  8 12
``````

## Keeping Characters as Characters

Let’s have a look at the internal structure of our data frame.

``````# Print internal structure of a data frame
str(df)
'data.frame':	3 obs. of  3 variables:
\$ name: Factor w/ 3 levels "Bob","Max","Sam": 1 2 3
\$ age : num  25 26 23
\$ city: Factor w/ 3 levels "Chicago","New York",..: 2 1 3
``````

`str()` function provides a compact display of the internal structure of any R object.

You may have noticed that the character columns (name and city) were converted to factors. R does this by default, but you can avoid this by setting an extra argument stringsAsFactors to FALSE.

``````df <- data.frame(name, age, city, stringsAsFactors = FALSE)
str(df)
'data.frame':	3 obs. of  3 variables:
\$ name: chr  "Bob" "Max" "Sam"
\$ age : num  25 26 23
\$ city: chr  "New York" "Chicago" "Seattle"
``````

## Naming Data Frame Rows and Columns

Every column in a data frame has a name. Even if you didn’t specify them yourself, R will take the column names from your program variables.

``````v1 <- c("Bob", "Max", "Sam")
v2 <- c(25,26,23)
v3 <- c("New York", "Chicago", "Seattle")

df <- data.frame(v1, v2, v3)
df
v1 v2       v3
1 Bob 25 New York
2 Max 26  Chicago
3 Sam 23  Seattle
``````

But you can give columns a sensible name by using `colnames()` or `names()`.

``````names(df) <- c("name", "age", "city")
df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle
``````

By default, data frames does not have row names but you can add them with `rownames()`

``````rownames(df) <- c("row1", "row2", "row3")
df
name age     city
row1  Bob  25 New York
row2  Max  26  Chicago
row3  Sam  23  Seattle
``````

You can use the same `colnames()` and `rownames()` functions to print column names and row names resp.

``````# print column names
colnames(df)
[1] "name" "age"  "city"

# print column names
names(df)
[1] "name" "age"  "city"

# print row names
rownames(df)
[1] "row1" "row2" "row3"
``````

## Subsetting Data Frames Like a List

Data frames possess the characteristics of lists. So, when you subset with a single vector, they behave like lists and will return the selected columns with all rows.

``````df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Sam  23  Seattle

# subset for 1st column
df[1]
name
1  Bob
2  Max
3  Sam

# subset for 1st and 3rd column
df[c(1,3)]
name     city
1  Bob New York
2  Max  Chicago
3  Sam  Seattle

# omit 3rd column
df[-3]
name age
1  Bob  25
2  Max  26
3  Sam  23
``````

## Subsetting Data Frames Like a Matrix

Data frames also possess the characteristics of matrices. So, when you subset with two vectors, they behave like matrices and can be subset by row and column.

``````df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# subset for 2nd row
df[2,]
name age    city
2  Max  26 Chicago

# subset for 3rd column
df[,3]
[1] "New York" "Chicago"  "Seattle"

# select single element
df[2,3]
[1] "Chicago"

# subset for row 1 and 2 but keep all columns
df[1:2,]
name age     city
1  Bob  25 New York
2  Max  26  Chicago

# subset for both rows and columns
df[1:2,2:3]
age     city
1  25 New York
2  26  Chicago

# omit 2nd row
df[-2,]
name age     city
1  Bob  25 New York
3  Amy  23  Seattle

# omit 2nd row and 3rd column
df[-2,-3]
name age
1  Bob  25
3  Amy  23

# subset for 'city' column
df[,"city"]
[1] "New York" "Chicago"  "Seattle"
``````

## The subset() function

There’s a more convenient way to subset a data frame using the `subset()` function.

The function takes three arguments.

subset(df,select,subset)

df: The data frame you want to subset.

select: A column name, or a vector of column names, to be selected.

subset: A logical expression that selects rows.

To see how `subset()` function works, let’s start with a simple data set. Suppose you have a dataframe `df` storing employee records:

``````df
name age sex     city
1  Eve  21   F  Chicago
2  Max  24   M  Houston
3  Ray  22   M New York
4  Kim  21   F New York
5  Sam  23   M  Chicago

# select the employee name
subset(df, select=name)
name
1  Eve
2  Max
3  Ray
4  Kim
5  Sam

# select the employee name and city
subset(df, select=c(name,city))
name     city
1  Eve  Chicago
2  Max  Houston
3  Ray New York
4  Kim New York
5  Sam  Chicago

# select all employees from 'Chicago'
subset(df, subset=(city == "Chicago"))
name age sex    city
1  Eve  21   F Chicago
5  Sam  23   M Chicago

# select the employee name and city with age > 22
subset(df, select=c(name,city), subset=(age > 22))
name    city
2  Max Houston
5  Sam Chicago
``````

## Add New Rows and Columns to Data Frame

You can add new columns to a data frame using the `cbind()` function.

``````df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

sex <- factor(c("M", "M", "F"))
cbind(df, sex)
name age     city sex
1  Bob  25 New York   M
2  Max  26  Chicago   M
3  Amy  23  Seattle   F
``````

To add new rows (observations) to a data frame, use `rbind()` function.

Warning:

Take extra care when adding new rows to the data frame. Adding elements of wrong type can change the type of the columns.

For example, if your data frame contains a numeric column and you attempt to add a character vector, it will convert all columns to a character type.

``````df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

row <- data.frame(name = "Sam",
age = 22,
city = "New York")
rbind(df, row)
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Sam  22 New York
``````

## Combine Two Data Frames

You can combine data frames in one of two ways:

### Combine the Columns

Use `cbind()` function to combine the columns of two data frames side by side creating a wider data frame.

``````df1
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
sex salary
1   Male  22000
2   Male  25400
3 Female  24800

cbind(df1, df2)
name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800
``````

Make sure the data frames have the same height (number of rows).

Otherwise, R will invoke the Recycling Rule to extend the short columns, which may or may not be what you want.

### Combine the Rows

Use `rbind()` function to stack the rows of two data frames creating a taller data frame.

``````df1
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
name age     city
1  Eve  21  Chicago
2  Ray  22  Houston
3  Kim  24 New York

rbind(df1, df2)
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle
4  Eve  21  Chicago
5  Ray  22  Houston
6  Kim  24 New York
``````

Make sure the data frames have the same width (same number of columns and same column names).

However, the columns need not be in the same order.

## Merge Data Frames by Common Column

You can merge two data frames by matching on the common column using the `merge()` function. You just need to specify the two data frames and the name of the common column.

``````# Merge two data frames by common column 'name'
df1
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

df2
name    sex salary
1  Max   Male  25400
2  Amy Female  24800
3  Bob   Male  22000

merge(df1, df2, by="name")
name age     city    sex salary
1  Amy  23  Seattle Female  24800
2  Bob  25 New York   Male  22000
3  Max  26  Chicago   Male  25400
``````

The `merge()` function does not require the rows to occur in the same order.

It also discards rows that appear in only one data frame or the other.

## Modify a Data Frame

Modifying a data frame is pretty straightforward. Access the element using `[]` operator and simply assign a new value.

``````df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# modify 2nd column
df[2] <- c(23,21,22)
df
name age     city
1  Bob  23 New York
2  Max  21  Chicago
3  Amy  22  Seattle

# modify 2nd row
df[2,] <- list("Eve",24,"Houston")
df
name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Amy  22  Seattle

# modify single element
df[3,1] <- "Sam"
df
name age     city
1  Bob  23 New York
2  Eve  24  Houston
3  Sam  22  Seattle
``````

## Editing a Data Frame

R offers convenient ways to edit the data frame contents: the `edit()` function and the `fix()` function

### The edit() Function

It opens up the data editor that displays your data frame in a spreadsheet-like window. Invoke the editor like this:

``````df
name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

temp <- edit(df)
df <- temp
``````

Once you are done with the changes, close the editor window. The updated data frame will be assigned to the `temp` variable. If you are happy with the changes, overwrite your data frame with the results.

### The fix() Function

There’s another function called `fix()` which works exactly like `edit()` except it overwrites the data frame once you close the editor.

Use it if you are confident, because there is no undo.

``````df
name age     city    sex salary
1  Bob  25 New York   Male  22000
2  Max  26  Chicago   Male  25400
3  Amy  23  Seattle Female  24800

fix(df)
``````

## Create an Empty Data Frame

You can create an empty data frame using the `numeric()`, `character()`, and `factor()` functions to preallocate the columns; then join them together using `data.frame()`

This technique is useful especially when you want to build a data frame row-by-row.

``````df <- data.frame(name=character(),
age=numeric(),
sex=factor(levels=c("M","F")),
stringsAsFactors = FALSE)
str(df)
'data.frame':	0 obs. of  3 variables:
\$ name: chr
\$ age : num
\$ sex : Factor w/ 2 levels "M","F":
``````

You can even create an empty data frame of fixed size if you know the required number of rows in advance.

``````# Create an empty data frame with 3 rows
N <- 3
df <- data.frame(name=character(N),
age=numeric(N),
sex=factor(N, levels=c("M","F")),
stringsAsFactors = FALSE)
df
name age  sex
1        0 <NA>
2        0 <NA>
3        0 <NA>
``````

## Sorting a Data Frame

You can sort the contents of a data frame by using the `order()` function and specifying one of the columns as the sort key.

The `order()` function alone tells you how to rearrange the columns. It does not return data values. Combine it with the subsetting operator `[]` to get the sorted data frame.

By default, sorting is ascending. Prepend the sorting variable by a minus sign to sort in descending order.

``````# Sort the data frame by age
df
name age     city
1  Bob  25 New York
2  Max  26  Chicago
3  Amy  23  Seattle

# sort in ascending order
df[order(df\$age),]
name age     city
3  Amy  23  Seattle
1  Bob  25 New York
2  Max  26  Chicago

# sort in descending order
df[order(-df\$age),]
name age     city
2  Max  26  Chicago
1  Bob  25 New York
3  Amy  23  Seattle
``````