## What is a Factor?

In real-world problems, you often encounter data that can be classified in categories.

For example, suppose a survey was conducted of a group of seven individuals, who were asked to identify their hair color and gender.

The result might appear as follows:

 Name Hair color Gender Amy Blonde Female Bob Black Male Eve Black Female Kim Red Female Max Blonde Male Ray Brown Male Sam Black Male

Here, the hair color and gender are the examples of categorical data. To store such categorical data, R has a special data structure called factors.

A factor is an ordered collection of items. The different values that the factor can take are called levels.

## Create a Factor

In R, you can create a factor with the `factor()` function.

``````# Factor storing hair color values
hcolors <- c("Blonde", "Black", "Black", "Red", "Blonde", "Brown", "Black")
f <- factor(hcolors)
f
[1] Blonde Black  Black  Red    Blonde Brown  Black
Levels: Black Blonde Brown Red
``````
``````# Factor storing gender values
gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male
Levels: Female Male
``````

## Factor Levels

A factor looks like a vector, but it has special properties. Levels are one of them.

Notice that when you print the factor, R displays the distinct levels below the factor. R keeps track of all the possible values in a vector, and each value is called a level of the associated factor.

``````gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
f
[1] Female Male   Female Female Male   Male   Male
Levels: Female Male
``````

The `levels()` function shows all the levels from a factor.

``````gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
levels(f)
[1] "Female" "Male"
``````

## Specifying Levels Explicitly

If your vector contains only a subset of all the possible levels, then R will have an incomplete picture of the possible levels.

Consider the following example of a vector consisting of directions:

``````# Factor with missing level "South"
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
f
[1] North West  North East  North West  East
Levels: East North West
``````

Notice that the levels of your new factor do not contain the value “South”. So, R thinks that North, West, and East are the only possible levels. However, in practice, it makes sense to have all the possible directions as levels of your factor.

To add all the possible levels explicitly, you specify the levels argument of `factor()`.

``````# Specifying all the possible levels explicitly
directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions,
levels = c("North", "East", "South", "West"))
f
[1] North West  North East  North West  East
Levels: North East South West
``````

You can do this by using the `levels()` function as well.

``````directions <- c("North", "West", "North", "East", "North", "West", "East")
f <- factor(directions)
levels(f) <- c("North", "East", "South", "West")
f
[1] East  South East  North East  South North
Levels: North East South West
``````

## Factor Labels

R lets you assign abbreviated names for the levels. You can do this by specifying the labels argument of `factor()`.

``````directions <- c("North", "West", "South", "East", "West", "North", "South")
f <- factor(directions,
levels = c("North", "East", "South", "West"),
labels = c("N", "E", "S", "W"))
f
[1] N W S E W N S
Levels: N E S W
``````

## Ordered Factors

Sometimes data has some kind of natural order between elements.

For example, sports analysts use a three-point scale to determine how well a sports team is competing: loss < tie < win.

In market research, it’s very common to use a five point scale to measure perceptions: strongly disagree < disagree < neutral < agree < strongly agree.

Such kind of data that is possible to place in order or scale is known as Ordinal data.

In R, there is a special data type for ordinal data. This type is called ordered factors.

To create an ordered factor, use the `factor()` function with the argument `ordered=TRUE`.

``````# Create ordinal levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record,
ordered = TRUE)
f
[1] win tie  loss tie  loss win  win
Levels: loss < tie < win
``````

You can also reverse the order of levels using the `rev()` function.

``````# Reverse the order of levels
record <- c("win", "tie", "loss", "tie", "loss", "win", "win")
f <- factor(record,
ordered = TRUE,
levels = rev(levels(f)))
f
[1] win tie  loss tie  loss win  win
Levels: win < tie < loss
``````

## Recode Factor Levels

Suppose you have dining experience data that has three levels: good, average and bad. And you want to recode factor levels to: happy, neutral and unhappy. Then use the `revalue()` function from the plyr package.

``````experience <- c("good", "average", "bad", "good", "bad", "good", "average")
f <- factor(experience)
f

[1] happy   neutral unhappy happy   unhappy happy   neutral
Levels: neutral unhappy happy
``````

## Drop Unused Factor Levels

If you have no observations in one of the levels, you can drop it using the `droplevels()` function.

``````# Drop unused level "tie"
record <- c("win", "loss", "loss", "win", "loss", "win")
f <- factor(record,
levels = c("loss", "tie", "win"))

f
[1] win  loss loss win  loss win
Levels: loss tie win

droplevels(f)
[1] win  loss loss win  loss win
Levels: loss win
``````

## Summarizing a factor

The `summary()` function will give you a quick overview of the contents of a factor.

``````gender <- c("Female", "Male", "Female", "Female", "Male", "Male", "Male")
f <- factor(gender)
summary(f)
Female   Male
3      4
``````

The function `table()` tabulates observations.

``````table(f)
f
Female   Male
3      4
``````