A Histogram is a graphical display of continuous data using bars of different heights.
It is similar to a bar graph, except a histogram groups the data into bins. The height of each bar shows the number of elements in the bin.
They are a great way to display the distribution or variation of data over a range.
The hist() function
In R, you can create a histogram using the
It has many options and arguments to control many things, such as bin size, labels, titles and colors.
The syntax for the
hist() function is:
|x||A vector of values describing the bars which make up the plot|
|breaks||A number specifying the number of bins for the histogram|
|freq||If TRUE, hist() gives counts instead of probabilities.|
|labels||If TRUE, draws labels on top of bars|
|density||The density of shading lines|
|angle||The slope of shading lines|
|col||A vector of colors for the bars|
|border||The color to be used for the border of the bars|
|main||An overall title for the plot|
|xlab||The label for the x axis|
|ylab||The label for the y axis|
|…||Other graphical parameters|
Create a Histogram
To get started with plot, you need a set of data to work with.
Let’s consider the built-in faithful data set as an example data set.
Here are the first six observations of the data set.
Example: First six observations of the ‘Faithful’ data set
> head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55
Faithful data set
The faithful data set contains 272 observations from the Old Faithful Geyser in Yellowstone National Park, Wyoming, USA.
Each observation consists of two measurements: time between eruptions and the duration of the eruption.
To create a histogram just specify the vector in
Example: Create a histogram of time between eruptions of Old Faithful
Choose the Number of Bins
The accuracy of a histogram depends solely upon the number of bins used to plot the data.
Large number of bins hides important details about distribution, while small number of bins causes a lot of noise and hides important information about the distribution as well.
By default, the
hist() function chooses an appropriate number of bins to cover the range of values.
However, there are a couple of ways to manually set the number of bins.
1. You can tell R the number of bars you want in the histogram by giving a single number as a value to the breaks argument.
Example: Specify the number of bars you want in the histogram
> hist(faithful$waiting, + breaks = 20)
Just keep in mind that the number is only a suggestion.
R will still decide whether that’s actually reasonable, and it tries to plot the maximum number of bins as possible.
2. You can tell R exactly where to put the breaks by giving a vector with the break points as the argument.
Example: Histogram with custom breaks
> hist(faithful$waiting, + breaks = c(40,45,55,60,65,70,75,85,90,100))
Coloring a Histogram
Use col argument to change the colors used for the bars.
Example: Change the bar color
> hist(faithful$waiting, + col="dodgerblue3")
By using the border argument, you can even change the color used for the border of the bars.
Example: Change the color used for the border of the bars
> hist(faithful$waiting, + col="lightblue1", + border="dodgerblue3")
Create a Hatched Histogram
Creating hatched charts in R is rather easy, just specify the density argument in the
By default the plot is hatched with 45° slanting lines, however, you can change it with the angle argument.
Example: Create a hatched histogram with 60° slanting lines
> hist(faithful$waiting, + col="dodgerblue3", + density=25, + angle=60)
Adding Titles and Axis Labels
You can add your own title and axis labels easily by specifying following arguments.
|main||Main plot title|
Example: Add the title and axis labels to your plot
> hist(faithful$waiting, + col="dodgerblue3", + main="Time between eruptions of Old Faithful", + xlab="Time (minutes)")
Add Value Markers
Often you want to draw attention to specific values or observations in your graphic to provide unique insight. You can do this by adding markers to your graphic.
For example, adding mean line will give you an idea about how much of the distribution is above and below the average.
You can add such marker by using the
Example: Add mean line in the histogram
> hist(faithful$waiting, + col="lightblue1") > abline(v=mean(faithful$waiting), + col="dodgerblue3", + lty=2, + lwd=2)
Another example is placing values on top of bars; which will help you interpret the graph correctly.
You can add them by setting the labels argument to TRUE.
Example: Show values on top of each bar in the histogram
> hist(faithful$waiting, + col="dodgerblue3", + labels=TRUE)
Plotting a Kernel Density Estimate (KDE)
A histogram gives you a rough sense of the density of the underlying distribution of your data.
The most complete way of describing your data is by estimating the probability density function (PDF) or density of your variable.
density() function to approximate the sample density and then use
lines() function to draw the approximation.
By default, the
hist() function plots the counts in the histogram. By setting
freq argument to FALSE, you can plot the densities.
Example: Add a kernel density estimate to a histogram
> hist(faithful$waiting, + col="lightblue1", + freq = FALSE) > lines(density(faithful$waiting), + col="dodgerblue3", + lwd=2)
To fill the density plot, use the
Example: Fill the density plot
> hist(faithful$waiting, + col="lightblue1", + freq = FALSE) > lines(density(faithful$waiting)) > polygon(density(faithful$waiting), + col=rgb(1,0,1,.2))
Instead of setting
freq = FALSE, you can achieve the same result by setting argument
prob = TRUE
Plot Multiple Histograms
Often you want to compare the distributions of different variables within your data.
You can overlay the histograms by setting the add argument of the second histogram to TRUE.
Example: Overlay two histograms
> # random numbers > h1 <- rnorm(1000,6) > h2 <- rnorm(1000,4) > hist(h1, + col=rgb(1,0,0,0.5)) > hist(h2, + col=rgb(0,0,1,0.5), + add=TRUE)