Talk:Q–Q plot

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated B-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B-Class article B  This article has been rated as B-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.
 
WikiProject Mathematics (Rated B-class, Low-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
B Class
Low Importance
 Field: Probability and statistics


Expected value of the k-th order statistic[edit]

Header added. —Nils von Barth (nbarth) (talk) 04:01, 20 April 2009 (UTC)

OK, remind me: What is the expected value of the k-th order statistic, and why isn't that in the article? Septentrionalis 16:26, 7 March 2006 (UTC)

Good question – AFAICT, the (uniform) k-th order statistic doesn’t have a simple expression, hence why people use estimates. I’ve included this in the article.
—Nils von Barth (nbarth) (talk) 04:01, 20 April 2009 (UTC)

Incomplete[edit]

This page is incomplete. It only discusses quantile normal plots; there are also quantile quantile plots, where the quantiles of two different distributions (e.g. the weights of adult human males and the weights of adult human females) are plotted against each other. I am very new here (this is my first time writing anything!) This would be complicated to edit, with formulas, charts, etc. I have no idea how to do it, but someone should (see, e.g. William S. Cleveland's Visualizing Data). What do I do? thanks! —The preceding unsigned comment was added by Plf515 (talkcontribs) .

Hmm, I am not sure what other kind of QQ plot you're talking about. QQ Plots, at least I had thought, were short for quantile-quantile plots. If you plot one data set against another, is it to compare two distributions. Do the data sets have to be the same size? Is there a discription on a website, by any chance? I'm certainly happy to help with editing, and am quite curious to add this bit of stat knowledge to my learning. Thanks, --TeaDrinker 02:09, 23 November 2006 (UTC)
Ahh some googling does seem to show what you're talking about. Mathworld seems to discuss it as comparison of two data sets [1]. --TeaDrinker 02:12, 23 November 2006 (UTC)

I added edits at the same time that you did; now I see only yours; will mine show up? Plf515 02:19, 23 November 2006 (UTC)plf515

Curiosuly, I don't see the edit in the history (you can check the past history by clicking the history tab at the top of the screen). Did you get a note saying either "edit conflict" or "database locked?" The former would have looked like an edit window, the latter would look like the talk page, with red writing across the top saying something about it being a preview. (Database errors are relatively rare, but it looks like the databased was momentarily locked a few minutes ago). --TeaDrinker 02:26, 23 November 2006 (UTC)
I think it got loast. I did see a database locked, but then it went away. Oh well, perhaps when I thought I hit save I hit preview. Here's what I had to say, more or less.

The plots shown in the diagram are very useful (e.g. to check that the residuals from a linear regression are normally distributed), and correct (they look nice, too; what software did you use?). But they are only part of what can be done with these types of plots. As another example, suppose you are interested in comparing the weights of adult human males vs. adult human females. Clearly, males are heavier (on average), but are there differences in the shapes of the distributions? One way to check is to have a QQ plot where one axis has quantiles of the distribution for men, the other for women.

That's a trivial example; one place where some controversy has arisen is in whether the distributions of male and feamle IQs are equal - they have the same mean and sd, but some claim that male IQs have a different distribution than females, with more in the extreme ends. This isn't the palce for a lengthy discussion of that, though; it's just to give you the idea. Plf515 02:34, 23 November 2006 (UTC)plf515

Just working it out in my head, if two data sets come from exactly the same distribution, the quantiles should probably fall roughly in a line. Likewise, if two data sets came from different normal distributions (ie different mean and variance), they should both fall in a line. The same would be true of most distributions which are location-scale families in their parameters. I think, however, that other distributions which dramatically change shape with different parameters (gamma for instance), would not form a line for most parameter values. Is there a source which makes this explicit? --TeaDrinker 02:59, 23 November 2006 (UTC)


One source is William S. Cleveland's book Visualizing Data (a worthwhile purchase for anyone who likes statistics). There's also discussion in the R help pages for qqnorm and qqplot. I don't recall distributions like gamma, particularly. It would be pretty easy to try it out and simulate data from gammas with different parameters. My intuition is that you are right. But, for what I am talking about, the issue is not so much which theoretical distribution is best, but how well the two distributions match.

1. Are they different at the median? 2. Are they different at various other quantiles? 3. Are these difference constant across the distribution (e.g. Is the difference between a man in the 1st percentile and a woman in the 1st percentile the same as between a man at the median and a woman at the median?) If not, how does the difference change across the distribution?

I gotta go, but will check in over the weekend. Thanks for your warm welcome, happy thanksgiving Plf515 03:13, 23 November 2006 (UTC)plf515

Plotting positions[edit]

I have added a section about plotting positions today, with some references. DFH 18:43, 29 January 2007 (UTC)


I made an edit on 2 January 2009 to remove the list of plotting position formulas and added a new reference. This change was undone by Melcombe with a note "important to have this even if someone else makes other choice". The issue is not about "choice". The referenced recent article (Makkonen, L. Bringing closure to the plotting position controversy. Communications in Statistics - Theory and Methods 37, 460-467) resolves the century old controversy on the plotting positions by showing that the whole concept of choosing a plotting positions is a misunderstanding. There is only one correct plotting position formula and the others are incorrect. One should stop referring to the obsolete formulas, as their use results in serious errors at the tails of the distribution (Makkonen, L. Problems in the extreme value analysis. Structural Safety, 30, 405-419) and gives a misleading impression that there is a choice.(LJM)
Surely the issue is about choice: Makkonen picked one to declare the only correct one. This extraordinary claim of "resolving" the issue has been rebutted in two thorough statistical papers (Cook, NJ. Comments on "Plotting positions in extreme value analysis" (The role of sampling error in extreme value analysis). J. Appl. Meteorol. Climatol. 2011;50:255–66, as well as Cook, NJ. Rebuttal of "Problems in the extreme value analysis". Structural Safety 2012;34:418–423). To quote: 'Seeking to overturn 60 years of theoretical and practical development from two generations of eminent statisticians is a bold venture, and one would think it would have been accompanied by a rigorous and comprehensive proof. But the claim [...] rests solely on untested assertions that the Weibull estimator is "unique" and is "exact". A review by Cook [...] used rigorous statistical proofs to show this is not true.' The plotting position according to Weibull is actually the expectation of the mean frequency. This is not exact, but has a variance of

m*(N-m+1)/(N+1)^2/(N+2).

Makkonen ignores sampling error that would only vanish in case of considering infinitely many samples (while in practice we deal with only one). Cook argues convincingly that the other choices, rather than being obsolete, are actually more correct for many purposes. For example, taking Makkonen's own example of wind speed data, evaluated for the design reduced variate (at P=0.98 probability), shows the Weibull estimator with the worst performance (10% bias) among the alternatives.
(Zoltan A. Fekete (Zoli))

The unique plotting position proven[edit]

That the Weibull expression m/(N+1) is the only correct choice for the plotting position has now been mathematically proven from the first principles of the probability theory. Please read Makkonen, L., Pajari, M. and Tikanmäki, M. (2013). Closure to "Problems in the extreme value analysis". Structural Safety 40,65-70.

(LasseMakkonen (talk) 19:37, 4 January 2013 (UTC))

A fundamental confusion[edit]

A Q-Q plot is a (nonparametric) technique for comparing two batches of data (I deliberately avoided the technical term "sample"). It is far more informative than comparing data moments computed from two batches, and makes no assumptions about the underlying statistical populations.

The entry mistakenly asserts that a Q-Q plot compares a batch of data with a hypothetical probability density function (pdf). This is not a Q-Q plot but a rankit, a scatter plot of the data against the expected values of the order statistics for the hypothetical pdf. The maintained hypothesis is that the data are a random sample from a population whose probability law is the pdf in question. The most common form of rankit is the normal probability plot, whose interpretation should be taught in every first course in statistics. Be that as may be, the normal probability plot and related techniques should not, in my view, be part of this entry. If what I write here is correct, this entry will have to be completely rewritten.

I concur with those of you above who cite favorably William Cleveland's Visualing Data.132.181.160.42 (talk) 22:49, 13 December 2007 (UTC)

You are right; after all, Q-Q plot means Quantile-Quantile plot. There is a web page from the NIST/SEMATECH e-Handbook of Statistical Methods that explains it well. This Handboook is an update of the National Bureau of Standards Handbook 91, Experimental Statistics.[2] NIST can be considered a reliable and authoritative source. The text is not covered by copyright,[3] which means that we can reuse it for Wikipedia – provided that we properly acknowledge this, for example like Template:1911: This article incorporates text from ....  --Lambiam 07:35, 14 December 2007 (UTC)
Please feel free to use Template:NIST-PD for this purpose. Btyner (talk) 16:23, 16 February 2008 (UTC)

Hallo this artical do not need to be rewritten completely. A Q-Q plot is both: I. comparing data with a distribution and II. comparing datasets (even with different number of observation). If in any handbooks are further distinctions, these handbooks has nothing to do with statistical practice. And one last not the Quantiles can come from a theoretical distribution or from observed data!! —Preceding unsigned comment added by 84.57.60.225 (talk) 10:00, 11 February 2008 (UTC)

As I've been taught Q-Q-plots are divided in two grops: normal Q-Q-plots and general Q-Q-plots. The first are based on a Gaussian distribution while the latter uses two somehow related distributions. This difference is also made in the Geostatistical Analyst Tool in ArcGIS from ESRI. Circushead (talk) 19:47, 3 July 2008 (UTC)

The way I think about Q-Q plots is that they are a tool to visually compare two distributions. Does it really matter that much if one of the two is based on theory rather than quantiles that were derived from data? I agree that these two possibilities could be described in more detail in the article, but I don't think a complete rewrite is necessary. 195.176.238.195 (talk) 07:24, 25 July 2008 (UTC)

Let's look at part of the Quantile-Quantile Plot page from the NIST/SEMATECH e-Handbook of Statistical Methods referenced above:

The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.

A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.

A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions.

Note that, unlike the current Wikipedia article, "non-normal" or "given" distributions are not mentioned. The two data sets in a Q-Q plot are peers, and in no necessary relation to a known distribution, normal or otherwise. Yet the similarity of the underlying distributions may still be compared with a Q-Q plot. The current article presumes a particular use that is not implied by the definition found in the NIST reference. My vote is a rewrite. Metaxis (talk) 21:36, 15 December 2008 (UTC)

Thanks for bringing this up – the terminology is very confused.
I’ve now carefully distinguished P-P plot, Q-Q plot, and probability plot, which is sometimes used as a general term, and sometimes used specifically to mean Q-Q plot (or another plot).
As reflected in the references, Q-Q plot is used generally to simply mean “plotting two quartiles against each other”, which can come (non-parametrically) from two sample sets, or parametrically comparing a sample set against a theoretical distribution and using it to estimate parameters – both applications are widely referred to as Q-Q plots, while rankit is used narrowly to mean either a normal probability plot, or the quantiles of the normal distribution that are used.
I’ve heavily revised all these pages, as I found them very confusing; hopefully they’re clearer now (and better referenced)!
—Nils von Barth (nbarth) (talk) 04:11, 20 April 2009 (UTC)

Gallery[edit]

I think that adding a gallery of normal Q-Q plots of all major distributions (at least gamma, Cauchy, beta, possibly more) would be a good idea. I haven't found such a resource in the Internet. And it would be a very good resource for finding what distribution data fit best. Tomato86 (talk) 15:12, 26 September 2010 (UTC)