Talk:Q–Q plot
This is the talk page for discussing improvements to the Q–Q plot article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||
|
Expected value of the k-th order statistic
[edit]- Header added. —Nils von Barth (nbarth) (talk) 04:01, 20 April 2009 (UTC)
OK, remind me: What is the expected value of the k-th order statistic, and why isn't that in the article? Septentrionalis 16:26, 7 March 2006 (UTC)
- Good question – AFAICT, the (uniform) k-th order statistic doesn’t have a simple expression, hence why people use estimates. I’ve included this in the article.
- —Nils von Barth (nbarth) (talk) 04:01, 20 April 2009 (UTC)
Incomplete
[edit]This page is incomplete. It only discusses quantile normal plots; there are also quantile quantile plots, where the quantiles of two different distributions (e.g. the weights of adult human males and the weights of adult human females) are plotted against each other. I am very new here (this is my first time writing anything!) This would be complicated to edit, with formulas, charts, etc. I have no idea how to do it, but someone should (see, e.g. William S. Cleveland's Visualizing Data). What do I do? thanks! —The preceding unsigned comment was added by Plf515 (talk • contribs) .
- Hmm, I am not sure what other kind of QQ plot you're talking about. QQ Plots, at least I had thought, were short for quantile-quantile plots. If you plot one data set against another, is it to compare two distributions. Do the data sets have to be the same size? Is there a discription on a website, by any chance? I'm certainly happy to help with editing, and am quite curious to add this bit of stat knowledge to my learning. Thanks, --TeaDrinker 02:09, 23 November 2006 (UTC)
- Ahh some googling does seem to show what you're talking about. Mathworld seems to discuss it as comparison of two data sets [1]. --TeaDrinker 02:12, 23 November 2006 (UTC)
I added edits at the same time that you did; now I see only yours; will mine show up? Plf515 02:19, 23 November 2006 (UTC)plf515
- Curiosuly, I don't see the edit in the history (you can check the past history by clicking the history tab at the top of the screen). Did you get a note saying either "edit conflict" or "database locked?" The former would have looked like an edit window, the latter would look like the talk page, with red writing across the top saying something about it being a preview. (Database errors are relatively rare, but it looks like the databased was momentarily locked a few minutes ago). --TeaDrinker 02:26, 23 November 2006 (UTC)
- I think it got loast. I did see a database locked, but then it went away. Oh well, perhaps when I thought I hit save I hit preview. Here's what I had to say, more or less.
The plots shown in the diagram are very useful (e.g. to check that the residuals from a linear regression are normally distributed), and correct (they look nice, too; what software did you use?). But they are only part of what can be done with these types of plots. As another example, suppose you are interested in comparing the weights of adult human males vs. adult human females. Clearly, males are heavier (on average), but are there differences in the shapes of the distributions? One way to check is to have a QQ plot where one axis has quantiles of the distribution for men, the other for women.
That's a trivial example; one place where some controversy has arisen is in whether the distributions of male and feamle IQs are equal - they have the same mean and sd, but some claim that male IQs have a different distribution than females, with more in the extreme ends. This isn't the palce for a lengthy discussion of that, though; it's just to give you the idea. Plf515 02:34, 23 November 2006 (UTC)plf515
- Just working it out in my head, if two data sets come from exactly the same distribution, the quantiles should probably fall roughly in a line. Likewise, if two data sets came from different normal distributions (ie different mean and variance), they should both fall in a line. The same would be true of most distributions which are location-scale families in their parameters. I think, however, that other distributions which dramatically change shape with different parameters (gamma for instance), would not form a line for most parameter values. Is there a source which makes this explicit? --TeaDrinker 02:59, 23 November 2006 (UTC)
- One source is William S. Cleveland's book Visualizing Data (a worthwhile purchase for anyone who likes statistics). There's also discussion in the R help pages for qqnorm and qqplot. I don't recall distributions like gamma, particularly. It would be pretty easy to try it out and simulate data from gammas with different parameters. My intuition is that you are right. But, for what I am talking about, the issue is not so much which theoretical distribution is best, but how well the two distributions match.
1. Are they different at the median? 2. Are they different at various other quantiles? 3. Are these difference constant across the distribution (e.g. Is the difference between a man in the 1st percentile and a woman in the 1st percentile the same as between a man at the median and a woman at the median?) If not, how does the difference change across the distribution?
I gotta go, but will check in over the weekend. Thanks for your warm welcome, happy thanksgiving Plf515 03:13, 23 November 2006 (UTC)plf515
Plotting positions
[edit]I have added a section about plotting positions today, with some references. DFH 18:43, 29 January 2007 (UTC)
I made an edit on 2 January 2009 to remove the list of plotting position formulas and added a new reference. This change was undone by Melcombe with a note "important to have this even if someone else makes other choice". The issue is not about "choice". The referenced recent article (Makkonen, L. Bringing closure to the plotting position controversy. Communications in Statistics - Theory and Methods 37, 460-467) resolves the century old controversy on the plotting positions by showing that the whole concept of choosing a plotting positions is a misunderstanding. There is only one correct plotting position formula and the others are incorrect. One should stop referring to the obsolete formulas, as their use results in serious errors at the tails of the distribution (Makkonen, L. Problems in the extreme value analysis. Structural Safety, 30, 405-419) and gives a misleading impression that there is a choice.(LJM)
Surely the issue is about choice: Makkonen picked one to declare the only correct one. This extraordinary claim of "resolving" the issue has been rebutted in two thorough statistical papers (Cook, NJ. Comments on "Plotting positions in extreme value analysis" (The role of sampling error in extreme value analysis). J. Appl. Meteorol. Climatol. 2011;50:255–66, as well as Cook, NJ. Rebuttal of "Problems in the extreme value analysis". Structural Safety 2012;34:418–423). To quote:
'Seeking to overturn 60 years of theoretical and practical development from two generations of eminent statisticians is a bold venture, and one would think it would have been accompanied by a rigorous and comprehensive proof. But the claim [...] rests solely on untested assertions that the Weibull estimator is "unique" and is "exact". A review by Cook [...] used rigorous statistical proofs to show this is not true.'
The plotting position according to Weibull is actually the expectation of the mean frequency. This is not exact, but has a variance of
m*(N-m+1)/(N+1)^2/(N+2).
Makkonen ignores sampling error that would only vanish in case of considering infinitely many samples (while in practice we deal with only one). Cook argues convincingly that the other choices, rather than being obsolete, are actually more correct for many purposes. For example, taking Makkonen's own example of wind speed data, evaluated for the design reduced variate (at P=0.98 probability), shows the Weibull estimator with the worst performance (10% bias) among the alternatives.
(Zoltan A. Fekete (Zoli))
The unique plotting position proven
[edit]That the Weibull expression m/(N+1) is the only correct choice for the plotting position has now been mathematically proven from the first principles of the probability theory. Please read Makkonen, L., Pajari, M. and Tikanmäki, M. (2013). Closure to "Problems in the extreme value analysis". Structural Safety 40,65-70.
(LasseMakkonen (talk) 19:37, 4 January 2013 (UTC))
A fundamental confusion
[edit]A Q-Q plot is a (nonparametric) technique for comparing two batches of data (I deliberately avoided the technical term "sample"). It is far more informative than comparing data moments computed from two batches, and makes no assumptions about the underlying statistical populations.
The entry mistakenly asserts that a Q-Q plot compares a batch of data with a hypothetical probability density function (pdf). This is not a Q-Q plot but a rankit, a scatter plot of the data against the expected values of the order statistics for the hypothetical pdf. The maintained hypothesis is that the data are a random sample from a population whose probability law is the pdf in question. The most common form of rankit is the normal probability plot, whose interpretation should be taught in every first course in statistics. Be that as may be, the normal probability plot and related techniques should not, in my view, be part of this entry. If what I write here is correct, this entry will have to be completely rewritten.
I concur with those of you above who cite favorably William Cleveland's Visualing Data.132.181.160.42 (talk) 22:49, 13 December 2007 (UTC)
- You are right; after all, Q-Q plot means Quantile-Quantile plot. There is a web page from the NIST/SEMATECH e-Handbook of Statistical Methods that explains it well. This Handboook is an update of the National Bureau of Standards Handbook 91, Experimental Statistics.[2] NIST can be considered a reliable and authoritative source. The text is not covered by copyright,[3] which means that we can reuse it for Wikipedia – provided that we properly acknowledge this, for example like Template:1911: This article incorporates text from .... --Lambiam 07:35, 14 December 2007 (UTC)
- Please feel free to use Template:NIST-PD for this purpose. Btyner (talk) 16:23, 16 February 2008 (UTC)
Hallo this artical do not need to be rewritten completely. A Q-Q plot is both: I. comparing data with a distribution and II. comparing datasets (even with different number of observation). If in any handbooks are further distinctions, these handbooks has nothing to do with statistical practice. And one last not the Quantiles can come from a theoretical distribution or from observed data!! —Preceding unsigned comment added by 84.57.60.225 (talk) 10:00, 11 February 2008 (UTC)
- As I've been taught Q-Q-plots are divided in two grops: normal Q-Q-plots and general Q-Q-plots. The first are based on a Gaussian distribution while the latter uses two somehow related distributions. This difference is also made in the Geostatistical Analyst Tool in ArcGIS from ESRI. Circushead (talk) 19:47, 3 July 2008 (UTC)
The way I think about Q-Q plots is that they are a tool to visually compare two distributions. Does it really matter that much if one of the two is based on theory rather than quantiles that were derived from data? I agree that these two possibilities could be described in more detail in the article, but I don't think a complete rewrite is necessary. 195.176.238.195 (talk) 07:24, 25 July 2008 (UTC)
Let's look at part of the Quantile-Quantile Plot page from the NIST/SEMATECH e-Handbook of Statistical Methods referenced above:
The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution.
A q-q plot is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. That is, the 0.3 (or 30%) quantile is the point at which 30% percent of the data fall below and 70% fall above that value.
A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line. The greater the departure from this reference line, the greater the evidence for the conclusion that the two data sets have come from populations with different distributions.
Note that, unlike the current Wikipedia article, "non-normal" or "given" distributions are not mentioned. The two data sets in a Q-Q plot are peers, and in no necessary relation to a known distribution, normal or otherwise. Yet the similarity of the underlying distributions may still be compared with a Q-Q plot. The current article presumes a particular use that is not implied by the definition found in the NIST reference. My vote is a rewrite. Metaxis (talk) 21:36, 15 December 2008 (UTC)
- Thanks for bringing this up – the terminology is very confused.
- I’ve now carefully distinguished P-P plot, Q-Q plot, and probability plot, which is sometimes used as a general term, and sometimes used specifically to mean Q-Q plot (or another plot).
- As reflected in the references, Q-Q plot is used generally to simply mean “plotting two quartiles against each other”, which can come (non-parametrically) from two sample sets, or parametrically comparing a sample set against a theoretical distribution and using it to estimate parameters – both applications are widely referred to as Q-Q plots, while rankit is used narrowly to mean either a normal probability plot, or the quantiles of the normal distribution that are used.
- I’ve heavily revised all these pages, as I found them very confusing; hopefully they’re clearer now (and better referenced)!
- —Nils von Barth (nbarth) (talk) 04:11, 20 April 2009 (UTC)
Gallery
[edit]I think that adding a gallery of normal Q-Q plots of all major distributions (at least gamma, Cauchy, beta, possibly more) would be a good idea. I haven't found such a resource in the Internet. And it would be a very good resource for finding what distribution data fit best. Tomato86 (talk) 15:12, 26 September 2010 (UTC)
Bug in caption?!
[edit]I think the existing caption is incorrect:
"that the July distribution is skewed to the left compared to the March distribution"
Instead, it should say "to the right" --- add the intuition that "July has warmer temperatures on average." — Preceding unsigned comment added by Gforman44 (talk • contribs) 16:59, 12 September 2018 (UTC)
Opening/closing dates of SR 20
[edit]The graph of opening/closing dates of SR20 doesn't look like a Q-Q plot. The horizontal axis is indeed a quantile scale, but the vertical axis has units of days, so I don't think this is a Q-Q plot. --Macrakis (talk) 20:02, 1 April 2024 (UTC)
@Nbarth:, I believe you added this on 2009-04-19. Comments? --Macrakis (talk) 20:07, 1 April 2024 (UTC)
- @Macrakis It’s been 15 years, so I’d need to think to see what was going on; I was probably rearranging existing content. From a quick look, this is a Q–Q plot comparing against the theoretical normal distribution (vertical axis is the difference from mean/median, think standard deviations; horizontal axis is quantile). See the last paragraph in this section, beginning “Another common use of Q–Q plots is to compare the distribution of a sample to a theoretical distribution ”. Does this help? The caption certainly needs clarification! —Nils von Barth (nbarth) (talk) 21:30, 1 April 2024 (UTC)
- @Nbarth: Maybe I'm misunderstanding, but the vertical axis seems to be in absolute units (days), not in distributional units (quantiles). --Macrakis (talk) 21:37, 1 April 2024 (UTC)
- Even if it were in SDs, that doesn't reflect the empirical distribution. --Macrakis (talk) 21:39, 1 April 2024 (UTC)
- @Macrakis You’re definitely right that it’s not a Q–Q plot on absolute axes! OTOH, it certainly looks like some kind of normal probability plot – it would be a straight line if the distribution were normal. I think it’s a rankit plot – the horizontal values are quantiles of the assumed normal distribution (the graphs and descriptions seem to correspond). In any case, this is complicated and confusing (and graphs two series too!), and shouldn't be the primary example! I’ll move the example to the rankit page, fix the image description and caption, and just have a link from Q–Q plot to Rankit, giving that as a more complicated example. WDYT? —Nils von Barth (nbarth) (talk) 23:16, 1 April 2024 (UTC)
- Sounds good! Macrakis (talk) 15:39, 2 April 2024 (UTC)
- @Macrakis You’re definitely right that it’s not a Q–Q plot on absolute axes! OTOH, it certainly looks like some kind of normal probability plot – it would be a straight line if the distribution were normal. I think it’s a rankit plot – the horizontal values are quantiles of the assumed normal distribution (the graphs and descriptions seem to correspond). In any case, this is complicated and confusing (and graphs two series too!), and shouldn't be the primary example! I’ll move the example to the rankit page, fix the image description and caption, and just have a link from Q–Q plot to Rankit, giving that as a more complicated example. WDYT? —Nils von Barth (nbarth) (talk) 23:16, 1 April 2024 (UTC)