What is a Factor?
In real-world problems, you often encounter data that can be classified in categories.
For example, suppose a survey was conducted of a group of seven individuals, who were asked to identify their hair color and gender.
The result might appear as follows:
Name | Hair color | Gender |
Amy | Blonde | Female |
Bob | Black | Male |
Eve | Black | Female |
Kim | Red | Female |
Max | Blonde | Male |
Ray | Brown | Male |
Sam | Black | Male |
Here, the hair color and gender are the examples of categorical data. To store such categorical data, R has a special data structure called factors.
A factor is an ordered collection of items. The different values that the factor can take are called levels.
Create a Factor
In R, you can create a factor with the factor()
function.
# Factor storing hair color values
hcolors <- c("Blonde", "Black", "Black", "Red", "Blonde", "Brown", "Black")
f <- factor(hcolors)
f
[1] Blonde Black Black Red Blonde Brown Black
Levels: Black Blonde Brown Red
# Factor storing gender values
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male Female Female Male Male Male
Levels: Female Male
Factor Levels
A factor looks like a vector, but it has special properties. Levels are one of them.
Notice that when you print the factor, R displays the distinct levels below the factor. R keeps track of all the possible values in a vector, and each value is called a level of the associated factor.
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male Female Female Male Male Male
Levels: Female Male
The levels()
function shows all the levels from a factor.
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
levels(f)
[1] "Female" "Male"
Specifying Levels Explicitly
If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels.
Consider the following example of a vector consisting of directions:
# Factor with missing level "South"
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
f
[1] North West North East North West East
Levels: East North West
Notice that the levels of your new factor do not contain the value “South”. So, R thinks that North, West, and East are the only possible levels. However, in practice, it makes sense to have all the possible directions as levels of your factor.
To add all the possible levels explicitly, you specify the levels argument of factor()
.
# Specifying all the possible levels explicitly
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions,
levels = c("North", "East", "South", "West"))
f
[1] North West North East North West East
Levels: North East South West
You can do this by using the levels()
function as well.
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
levels(f) <- c("North", "East", "South", "West")
f
[1] East South East North East South North
Levels: North East South West
Factor Labels
R lets you assign abbreviated names for the levels. You can do this by specifying the labels argument of factor()
.
directions <- c("North", "West", "South", "East", "West", "North", "South")
f <- factor(directions,
levels = c("North", "East", "South", "West"),
labels = c("N", "E", "S", "W"))
f
[1] N W S E W N S
Levels: N E S W
Ordered Factors
Sometimes data has some kind of natural order between elements.
For example, sports analysts use a three-point scale to determine how well a sports team is competing: loss < tie < win.
In market research, it’s very common to use a five point scale to measure perceptions: strongly disagree < disagree < neutral < agree < strongly agree.
Such kind of data that is possible to place in order or scale is known as Ordinal data.
In R, there is a special data type for ordinal data. This type is called ordered factors.
To create an ordered factor, use the factor()
function with the argument ordered=TRUE
.
# Create ordinal levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record,
ordered = TRUE)
f
[1] win tie loss tie loss win win
Levels: loss < tie < win
You can also reverse the order of levels using the rev()
function.
# Reverse the order of levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record,
ordered = TRUE,
levels = rev(levels(f)))
f
[1] win tie loss tie loss win win
Levels: win < tie < loss
Recode Factor Levels
Suppose you have dining experience data that has three levels: good, average and bad. And you want to recode factor levels to: happy, neutral and unhappy. Then use the revalue()
function from the plyr package.
experience <- c("good", "average", "bad", "good", "bad", "good", "average")
f <- factor(experience)
f
[1] good average bad good bad good average
Levels: average bad good
plyr::revalue(f, c("good"="happy", "average"="neutral", "bad"="unhappy"))
[1] happy neutral unhappy happy unhappy happy neutral
Levels: neutral unhappy happy
Drop Unused Factor Levels
If you have no observations in one of the levels, you can drop it using the droplevels()
function.
# Drop unused level "tie"
record <- c("win", "loss", "loss", "win", "loss", "win")
f <- factor(record,
levels = c("loss", "tie", "win"))
f
[1] win loss loss win loss win
Levels: loss tie win
droplevels(f)
[1] win loss loss win loss win
Levels: loss win
Summarizing a factor
The summary()
function will give you a quick overview of the contents of a factor.
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
summary(f)
Female Male
3 4
The function table()
tabulates observations.
table(f)
f
Female Male
3 4