Jump to content

Box plot: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m Reverted edits by 82.69.49.153 (talk) to last version by 64.54.119.83
Blanked the page
Line 1: Line 1:
[[Image:Michelsonmorley-boxplot.svg|thumb|Figure 1. Box plot of data from the [[Michelson-Morley Experiment]]]]

In [[descriptive statistics]], a '''boxplot''' (also known as a '''box-and-whisker diagram''' or '''plot''' or more rarely as a '''candlestick chart''') is a convenient way of graphically depicting groups of numerical data through their [[five-number summary|five-number summaries]] (the smallest observation, lower [[quartile]] (Q1), [[median]], upper [[quartile]] (Q3), and largest observation). A boxplot may also indicate which observations, if any, might be considered [[outlier]]s. The boxplot was invented in 1977 by the American statistician [[John Tukey]].

Boxplots can be useful to display differences between [[statistical population|populations]] without making any assumptions of the underlying [[probability distribution|statistical distribution]]. The spacings between the different parts of the box help indicate the degree of [[statistical dispersion|dispersion]] (spread) and [[skewness]] in the data, and identify [[outlier]]s. Boxplots can be drawn either horizontally or vertically.

== Construction ==

For a [[data set]], one constructs a horizontal box plot in the following manner:
*Calculate the first [[quartile]] (<Math>x_{.25}</Math>), the [[median]] (<Math>x_{.50}</Math>) and third [[quartile]] (<Math>x_{.75}</Math>)
*Calculate the [[interquartile range]] (IQR) by subtracting the first quartile from the third quartile. (<Math>x_{.75}-x_{.25}</Math>)
*Construct a box above the number line bounded on the left by the first quartile (<Math>x_{.25}</Math>) and on the right by the third quartile (<Math>x_{.75}</Math>).
*Indicate where the median lies inside of the box with the presence of a symbol or a line dividing the box at the median value.
*The mean value of the data can also be labeled with a point.
*Any data observation which lies more than <Math>\scriptstyle 1.5 \cdot\mathrm{IQR}</Math> lower than the first quartile or <Math>\scriptstyle 1.5 \cdot\mathrm{IQR}</Math> higher than the third quartile is considered an [[outlier]]. Indicate where the smallest value that is not an [[outlier]] is by connecting it to the box with a horizontal line or "whisker". Optionally, also mark the position of this value more clearly using a small vertical line. Likewise, connect the largest value that is not an outlier to the box by a "whisker" (and optionally mark it with another small vertical line).
*Indicate outliers by open and closed dots. "Extreme" outliers, or those which lie more than three times the IQR to the left and right from the first and third quartiles respectively, are indicated by the presence of an open dot. "Mild" outliers - that is, those observations which lie more than 1.5 times the IQR from the first and third quartile but are not also extreme outliers are indicated by the presence of a closed dot. (Sometimes no distinction is made between "mild" and "extreme" outliers.)
*Add an appropriate label to the number line and title the boxplot.
*A boxplot may be constructed in a similar manner vertically as opposed to horizontally by merely interchanging "bottom" for "left", "top" for "right" and "vertical" for "horizontal" in the above description.

== Example ==

A plain-text version might look like this:

+-----+-+
o * |-------| | |---|
+-----+-+
+---+---+---+---+---+---+---+---+---+---+---+---+ number line
0 1 2 3 4 5 6 7 8 9 10 11 12

For this [[data set]]:
* smallest non-[[outlier]] observation = 5 (left "whisker") (left "whisker" would have been 4 had there been an observation with a value of 4 (<Math>Q1-</Math><Math>\scriptstyle 1.5\cdot\mathrm{IQR}</Math>))
* lower (first) quartile (<Math>Q1</Math>, <Math>x_{.25}</Math>) = 7
* median (second quartile) (<Math>Med</Math>, <Math>x_{.5}</Math>) = 8.5
* upper (third) quartile (<Math>Q3</Math>, <Math>x_{.75}</Math>) = 9
* largest non-outlier observation = 10 (right "whisker")
* [[interquartile range]], <Math>\mathrm{IQR} = Q3-Q1 = 2</Math>
* the value 3.5 is a "mild" [[outlier]], between <Math>\scriptstyle 1.5 \cdot\mathrm{IQR}</Math> and <Math>\scriptstyle 3\cdot\mathrm{IQR}</Math> below <Math>Q1</Math>
* the value 0.5 is an "extreme" [[outlier]], more than <Math>\scriptstyle 3\cdot\mathrm{IQR}</Math> below <Math>Q1</Math>
* the data is [[skewness|skewed]] to the left (''negatively skewed'')

The horizontal lines (the "whiskers") extend to at most 1.5 times the box width (the [[interquartile range]]) from either or both ends of the box. They must end at an observed value, thus connecting all the values outside the box that are not more than 1.5 times the box width away from the box. Three times the box width marks the boundary between "mild" and "extreme" outliers. In this boxplot, "mild" and "extreme" outliers are differentiated by closed and open dots, respectively.

There are alternative implementations of this detail of the box plot in various software packages, such as the whiskers extending to at most the 5<sup>th</sup> and 95<sup>th</sup> (or some more extreme) percentiles. Such approaches do not conform to [[John Tukey|Tukey's]] definition, with its emphasis on the median in particular and counting methods in general, and they tend to produce "outliers" for all data sets larger than ten, no matter what the shape of the distribution.<ref>
There are also several minor variations on how to calculate the [[quartile]]s (see also [[Quantile#Estimating the quantiles|quantile]]), and Tukey (1977) originally proposed instead using another variant that he named "hinges". The difference between the definitions is no more than the difference between two consecutive data values, however, so it is always dwarfed by [[sampling error|sampling variability]] as so is of little practical consequence.</ref>

== Visualization ==

[[Image:Boxplot vs PDF.png|thumb|Figure 2. Boxplot and Probability Density Function (pdf) of a Normal N(0,1σ<sup>2</sup>) Population]]

The boxplot is a quick graphic for examining one or more sets of data. Boxplots may seem more primitive than a [[histogram]] or [[kernel density estimation|kernel density estimate]] but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of [[Histogram#Number of bins and width|number and width of bins]] techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate.

As looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ<sup>2</sup>) distribution may be a useful tool for understanding the boxplot (Figure 2).

==See also==

* [[Exploratory data analysis]]

==References==

* John W. Tukey. "''Exploratory Data Analysis''". [[Addison-Wesley]], Reading, MA. 1977.
* Michael Frigge and David C. Hoaglin and Boris Iglewicz. "[http://links.jstor.org/sici?sici=0003-1305%28198902%2943%3A1%3C50%3ASIOTB%3E2.0.CO%3B2-E Some Implementations of the Boxplot]". ''The American Statistician''. Vol. 43 (1), February 1989. 50–54.
* Yoav Benjamini. "[http://links.jstor.org/sici?sici=0003-1305%28198811%2942%3A4%3C257%3AOTBOAB%3E2.0.CO%3B2-%23 Opening the Box of a Boxplot]". ''The American Statistician''. Vol 42 (4), November 1988. 257–262.
* Peter J. Rousseeuw, Ida Ruts and John W. Tukey. "[http://links.jstor.org/sici?sici=0003-1305%28199911%2953%3A4%3C382%3ATBABB%3E2.0.CO%3B2-K The Bagplot: A Bivariate Boxplot]". ''The American Statistician''. Vol 53 (4), November 1999. 382–387.

==Notes==
{{reflist}}

==External links==
* [http://www.lcgceurope.com/lcgceurope/data/articlestandard/lcgceurope/132005/152912/article.pdf Visual Presentation of Data by Means of Box Plots] (PDF)
* [http://www.physics.csbsju.edu/stats/box2.html On-line box plot calculator with explanations and examples]
* [http://www.duncanwil.co.uk/boxplot.html Box and Whisker Diagrams: getting Microsoft Excel to plot them for you]
* [http://peltiertech.com/Excel/Charts/BoxWhisker.html Box and Whisker Plots in Microsoft Excel]
* [http://blog.immeria.net/2007/01/box-plot-and-whisker-plots-in-excel.html Box plot and whisker plots in Excel 2007]
* [http://informationandvisualization.de/blog/box-plot Box plot explanation, examples and a javascript/css-based box plot]

[[Category:Statistical charts and diagrams]]
[[Category:Statistics]]

[[de:Boxplot]]
[[es:Diagrama de caja]]
[[fr:Boîte à moustaches]]
[[it:Box-plot]]
[[nl:Boxplot]]
[[ja:箱ひげ図]]
[[pl:Wykres pudełkowy]]
[[sv:Lådagram]]

Revision as of 02:13, 3 April 2008