Jump to content

Box plot

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Jhguch (talk | contribs) at 18:50, 22 September 2006. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Summary

File:R-speed of light boxplot.png
Figure 1. Box Plot of Data from the Michelson-Morley Experiment.

In descriptive statistics, a boxplot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting the five-number summary, which consists of the smallest non-outlier observation, lower quartile (Q1), median, upper quartile (Q3), and largest non-outlier observation.

Boxplots are able to visually show different types of populations, without any assumptions of the statistical distribution. The spacings between the different parts of the box help indicate variance, skew and identify outliers. Boxplots can be drawn either horizontally or vertically.

Definition

A plain-text version might look like this:

                            +-----+-+    
  *           o     |-------|   + | |---|
                            +-----+-+    
                                         
+---+---+---+---+---+---+---+---+---+---+   number line
0   1   2   3   4   5   6   7   8   9  10

For this data set (values are approximate, based on the figure):

  • smallest observation (outliers excluded, minimum or min) = 5
  • lower (first) quartile (Q1) = 7
  • median (second quartile) (Med) = 8.5
  • upper (third) quartile (Q3) = 9
  • largest observation (outliers excluded, maximum or max) = 10
  • mean = 8
  • interquartile range, IQR = = 2
  • the value 3.5 is a "mild" outlier, between 1.5*(IQR) and 3*(IQR) below Q1
  • the value 0.5 is an "extreme" outlier, more than 3*(IQR) below Q1
  • the smallest value that is not an outlier is 5
  • the data are skewed to the left (negatively skewed)

The horizontal lines (the "whiskers") extend to at most 1.5 times the box width (the interquartile range) from either or both ends of the box. They must end at an observed value, thus connecting all the values outside the box that are not more than 1.5 times the box width away from the box. Three times the box width marks the boundary between "mild" and "extreme" outliers.

There are alternative implementations of this detail of the box plot in various software packages, such as the whiskers extending to at most the 5th and 95th (or some more extreme) percentiles. Such approaches do not conform to Tukey's definition, with its emphasis on the median in particular and counting methods in general, and they tend to produce "outliers" for all data sets larger than ten, no matter what the shape of the distribution.

Visualization

Figure 2. Boxplot and Probability Density Function (pdf) of a Normal N(0,1σ2) Population

The boxplot is a quick graphic approach for examining one or more sets of data. Boxplots may seem more primitive than a histogram or Probability density function (pdf) but it does have its benefits. Besides saving space on paper, boxplots are quicker to generate by hand. Histograms and probability density functions require assumptions of the statistical distribution. This assumption can be a major barrier because binning techniques can heavily influence the histogram and incorrect variance calculations will heavily affect the probability density function.

Because looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a Normal N(0,1σ2) distribution may be a useful tool for understanding the boxplot (Figure 2).

History

The boxplot was invented in 1977 by American statistician John Tukey.

Please add to this section.

See also