R Factor

What is a Factor?

In real-world problems, you often encounter data that can be classified in categories.

For example, suppose a survey was conducted of a group of seven individuals, who were asked to identify their hair color and gender.

The result might appear as follows:

NameHair colorGender
AmyBlondeFemale
BobBlackMale
EveBlackFemale
KimRedFemale
MaxBlondeMale
RayBrownMale
SamBlackMale

Here, the hair color and gender are the examples of categorical data. To store such categorical data, R has a special data structure called factors.

A factor is an ordered collection of items. The different values that the factor can take are called levels.

Create a Factor

In R, you can create a factor with the factor() function.

# Factor storing hair color values
hcolors <- c("Blonde", "Black", "Black", "Red", "Blonde", "Brown", "Black")
f <- factor(hcolors)
f
[1] Blonde Black  Black  Red    Blonde Brown  Black 
Levels: Black Blonde Brown Red
# Factor storing gender values
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male  
Levels: Female Male

Factor Levels

A factor looks like a vector, but it has special properties. Levels are one of them.

Notice that when you print the factor, R displays the distinct levels below the factor. R keeps track of all the possible values in a vector, and each value is called a level of the associated factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male  
Levels: Female Male

The levels() function shows all the levels from a factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
levels(f)
[1] "Female" "Male" 

Specifying Levels Explicitly

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels.

Consider the following example of a vector consisting of directions:

# Factor with missing level "South"
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
f
[1] North West  North East  North West  East 
Levels: East North West

Notice that the levels of your new factor do not contain the value “South”. So, R thinks that North, West, and East are the only possible levels. However, in practice, it makes sense to have all the possible directions as levels of your factor.

To add all the possible levels explicitly, you specify the levels argument of factor().

# Specifying all the possible levels explicitly
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions,
            levels = c("North", "East", "South", "West"))
f
[1] North West  North East  North West  East 
Levels: North East South West

You can do this by using the levels() function as well.

directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
levels(f) <- c("North", "East", "South", "West")
f
[1] East  South East  North East  South North
Levels: North East South West

Factor Labels

R lets you assign abbreviated names for the levels. You can do this by specifying the labels argument of factor().

directions <- c("North", "West", "South", "East", "West", "North", "South")
f <- factor(directions,
            levels = c("North", "East", "South", "West"),
            labels = c("N", "E", "S", "W"))
f
[1] N W S E W N S
Levels: N E S W

Ordered Factors

Sometimes data has some kind of natural order between elements.

For example, sports analysts use a three-point scale to determine how well a sports team is competing: loss < tie < win.

In market research, it’s very common to use a five point scale to measure perceptions: strongly disagree < disagree < neutral < agree < strongly agree.

Such kind of data that is possible to place in order or scale is known as Ordinal data.

In R, there is a special data type for ordinal data. This type is called ordered factors.

To create an ordered factor, use the factor() function with the argument ordered=TRUE.

# Create ordinal levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record, 
            ordered = TRUE)
f
[1] win tie  loss tie  loss win  win 
Levels: loss < tie < win

You can also reverse the order of levels using the rev() function.

# Reverse the order of levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record, 
            ordered = TRUE, 
            levels = rev(levels(f)))
f
[1] win tie  loss tie  loss win  win 
Levels: win < tie < loss

Recode Factor Levels

Suppose you have dining experience data that has three levels: good, average and bad. And you want to recode factor levels to: happy, neutral and unhappy. Then use the revalue() function from the plyr package.

experience <- c("good", "average", "bad", "good", "bad", "good", "average")
f <- factor(experience)
f
[1] good    average bad     good    bad     good    average
Levels: average bad good

plyr::revalue(f, c("good"="happy", "average"="neutral", "bad"="unhappy"))
[1] happy   neutral unhappy happy   unhappy happy   neutral
Levels: neutral unhappy happy

Drop Unused Factor Levels

If you have no observations in one of the levels, you can drop it using the droplevels() function.

# Drop unused level "tie"
record <- c("win", "loss", "loss", "win", "loss", "win")
f <- factor(record,
            levels = c("loss", "tie", "win"))

f
[1] win  loss loss win  loss win 
Levels: loss tie win

droplevels(f)
[1] win  loss loss win  loss win 
Levels: loss win

Summarizing a factor

The summary() function will give you a quick overview of the contents of a factor.

gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
summary(f)
Female   Male 
     3      4

The function table() tabulates observations.

table(f)
f
Female   Male 
     3      4