# Talk:Quantile

WikiProject Statistics (Rated C-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.

## Objection to current definition

In my [Mood, Graybill, Boes] the quantile is defined in a completely different way:

The q-th quantile of a random variable X or of the corresponding distribution is defined as the smallest number ξ that satisfies F_X(ξ) <= q.

I find that definition simpler to explain and understand, and more general. Quantiles are continuous, not necessarily chopped in 4, 5, 10 or 100 or whatever. Else it would not be possible to express an irrational quantile (e.g. 1/sqrt(2)). And let's face it, what would the world be without irrational quantiles? :-)

Also this article, as it stands, does not clearly recognise that quantiles are properties of the distribution (be it continuous or discrete), and not of a sample. From a sample, we can only estimate (and not calculate, by the way) these attributes.--PizzaMargherita 20:18, 4 October 2005 (UTC)

While that definition may be more general it is not completely different. However quantiles are discreet not continuous (see with k an integer satisfying 0 < k < q). A continuous quantile would essentially be an inverse CDF. Where quantile 1/sqrt(2) of a 4-quantile (read quartile) would be equall to ${\displaystyle CDF^{-1}({\frac {1}{4\cdot {\sqrt {2}}}})}$
More over I must object to that definition via reductio ad absurdum seeing as a 4-th quantile of an 100-quantile would result in F_X(ξ) <= 4, but assuming that F_X(ξ) is the CDF function of a random variable, this would result in all values of a random variable seeing as F_X can not go beyond 1. I say reductio ad absurdum because ${\displaystyle CDF^{-1}(4)}$ is an invalid argument to the inverse CDF, unless your using complex values in your calculations ${\displaystyle \sin ^{-1}({\sqrt {2}})={\frac {\pi }{2}}-i{\frac {\ln(2{\sqrt {2}}+3)}{2}}}$ (I got this from a TI-89 calculator), I think it is safe to assume that it is absurd, but this is all dependent upon my assumption that F_X(ξ) is the CDF function of a random variable.NormDor 02:08, 20 February 2006 (UTC)
"quantiles are discreet not continuous" - Precisely my objection to the current definition.
"A continuous quantile would essentially be an inverse CDF." - No, it wouldn't. Not all CDFs are invertible.
"quantile 1/sqrt(2) of a 4-quantile" - This is not allowed by the current definition, which mandates k to be an integer.
I'm not sure I understand your "reductio ad absurdum" argument against MGB's definition, but I accept that it allows q-th quantiles to be defined for q<0 and q>1. However, that's easy enough to work around:
The q-th quantile (with 0 ≤ q ≤ 1) of a random variable X or of the corresponding distribution is defined as the smallest number ξ that satisfies FX(ξ) ≤ q.
I don't understand what complex numbers have to do with any of this. I'm not aware of any extensions in measure theory that allow a the inverse CDF to be defined for complex arguments, or indeed for arguments outside [0,1], unlike the well-known extension for logarithms and therefore inverse trigonometric functions.
Finally, your recent "Error Correction" to the article is wrong. "P" is not the CDF. I think the article needs a thorough review. I'll do it when I find some time. PizzaMargherita 23:28, 20 February 2006 (UTC)
I've just never thought about a q-th quantile to be in the domain [0, 1] (when I see k'th something I take k to be an integer, because I just don't see .83752'th quantile working in a sentence.) and thus my main argument is incorrect in the first place. However I don't see how a continuos CDF can not be invertible.
""P" is not the CDF." Opps... my bad.
"I'm not aware of any extensions" neither am I, I was just using it as an example in my flawed argument.
Personally I disagree with such usages but I have found other usages similar to what you have described. I guess you could call it a 1-quantile without the restriction of k being an integer. NormDor 06:02, 21 February 2006 (UTC)
I just checked the edit history and you can blame this entire fiasco on me. Sorry. NormDor 06:20, 21 February 2006 (UTC)
Hey, no worries. Take it easy. PizzaMargherita 07:07, 21 February 2006 (UTC)
All CDF's of discrete random variables are invertible when you interpolate between every discrete element as is allowed with quantile's (Weighted average). The interpolation would result in a CDF as a strictly increasing continuous function satisfying the "ONTO" and "ONE-TO-ONE" properties for the existence of an inverse function. Regarding the complex values, they were the result of mistaking (0,1) bounded continuous values for discreet values, resulting in CDF(X)>1 which is impossible (thergo the absurditiy in a strict unclarified interpretation of the given equation) unless you could consider complex values. --ANONYMOUS COWARD0xC0DE 04:21, 2 April 2007 (UTC)

Be nice for some NON Mathematical dicussions of these terms - you guys make it too hard!

## Equivalent Characterisation

I think there might be a slight mistake in the equvivalent characterization of the p- and q-quantil. Should the second line not be ${\displaystyle P(X>x)\leq p}$ or somthing simlair?

No, I do not believe that the definition should be ${\displaystyle P(X>x)\leq p}$, because it is my opinion that it is almost always done as ${\displaystyle P(X. I believe that the author added the second equivalent to account for the actual method used in estimating the quantiles, where some times you round up and some times you round down. Along those lines, I believe that the two equivalent functions could just as easily be expressed as ${\displaystyle P(X\leq x)\cong p{\mbox{ and }}P(X\geq x)\cong 1-p}$. I hope I didn't go on and on as much as I did last time (⇩⇩⇩See below⇩⇩⇩!!!) sorry about that I don't know what I was thinking. NormDor 13:38, 22 June 2006 (UTC)

## Two images

Two images from Yrithinnd's list of Commons images might be of interest here?

--DO11.10 00:19, 5 May 2007 (UTC)

## Tertile

I came across the word tertile, but found no article for it, so I made a stub. The word is infrequent in use, but I ask that someone with a better command of the statistical lingo than mine refine the article. --Shingra 08:29, 23 July 2007 (UTC)

The tertile stub has since been rewritten and transwikified: tertile. --Shingra (talk) 11:24, 23 November 2007 (UTC)

## Dr.Math

I came across this http://mathforum.org/library/drmath/view/60969.html which i believe have a much simplier explanation.

Should we put it in the external links at least? —Preceding unsigned comment added by 189.33.225.219 (talk) 05:41, 25 December 2007 (UTC)

## Estimating the quantiles of a population

I had added the "Quantiles of a sample" section, distinct from the "Quantiles of a population" section, because I have used the ideas of the former for several years, and had assumed that they were commonplace. However, since the time of adding that section, I have not seen any published works that are similar to my approach. Thus, I must conclude that my approach is actually original research — and I have removed it. Quantling (talk) 20:39, 10 August 2009 (UTC)

Quantiles of a sample, revisited: Can we really compute a 1 percentile or 99 percentile if we have, say, only ${\displaystyle N=2}$ points drawn from some distribution? The expected (average) percentile of the smaller of two points is 33 1/3 and the expected percentile of the larger of two points is 66 2/3. (See order statistics.) I have no problem computing percentiles between 33 1/3 and 66 2/3 by interpolating the two sampled values, but I am tempted to say that, for percentiles that are below 33 1/3 or above 66 2/3, we don't have enough information to estimate the value. Does that make sense? If so, is it appropriate to touch on these ideas in the article? Quantling (talk) 18:39, 2 February 2010 (UTC)

The distinction manifests itself in interpolation as well. If I had to choose one of ${\displaystyle N=3}$ points to represent the 35 percentile, I would choose the smallest value (with an expected percentile of 25, which is only 10 away from 35) rather than the middle value (50 percentile). However, the article tells me that the smallest value represents the percentile range up to at most 33 1/3 and that, among the three points, the middle value is the appropriate choice. Quantling (talk) 18:53, 2 February 2010 (UTC)

The current "Estimating the quantiles of a population" section now addresses these issues, by listing all R and SAS approaches for estimating quantiles. Quantling (talk) 15:28, 22 March 2010 (UTC)

## Preferred method

I have removed the following

• Monte Carlo simulations show that method R-5 is the preferred method for continuous data. ref name="schoonjans"Schoonjans F, De Bacquer D, Schmid P (2011). "Estimation of population percentiles". Epidemiology. 22 (5): 750–751. doi:10.1097/EDE.0b013e318225c1de. /ref

as this is a letter in a medical journal and it looks as it the writers are expressing personal opinions having looked a particular case rather than widely held view among statisticians.--Rumping (talk) 22:59, 13 December 2011 (UTC)

## Quartiles

I've added Quartiles to the list of Specialized quantiles. Is there a reason it wasn't there in the first place, while less common quantiles like duo-deciles are present? --Adam Matan (talk) 11:35, 8 September 2013 (UTC)

## Quantile estimation from a sample

### Week of September 6, 2015

Dear Leegrc,

The recent edit on estimating sample quantiles was undone by you. According to the Wikipedia rules, fully referenced new information should not be reverted, especially not when based on unreferenced arguments.

Your argument that the quantile at the kth value depends on "the context and purpose" suggest that the matter is somehow subjective. But this is mathematics. The quantiles are defined by the CDF, and the CDF is defined by the non-exceedance probabilities. The probability itself is not subjective, of course, but is defined by Kolmogorov's axioms. It follows that the quantile point values for the order ranks cannot depend on anything related to their use.

On the contrary, they are known exactly, please read the reference http://dx.doi.org/10.1155/2014/326579

Best regards,

RTELAM

RATLAM (talk) 08:33, 12 September 2015 (UTC)

### Week of September 13, 2015

Thank you for your note. This is how I think about it. Yes, I agree that when a population is fully described, its quantiles are fully characterized by the CDF, except for ties such as the median of an even number of elements, where there is some ambiguity. However, when one has a sample drawn from an unknown population, one does not know the CDF of the population. What one then chooses for the estimate of the quantile can depend upon the assumptions and purposes. For example, if it is known that the sample is drawn from some normal distribution with known variance but unknown mean, that can be used to interpolate when h (from the big table in the article) is not an integer. On the other hand, if it is known that the sample is drawn from a uniform distribution with unknown endpoints, that can be used to interpolate in a different way. R and the other packages support multiple approaches, not simply in hopes that people will accidentally choose the wrong option, but because there are circumstances where each is applicable. Even with so many options, the list is not and cannot be exhaustive.
Your critique of these thoughts would be appreciated. Thank you Leegrc (talk) 12:14, 14 September 2015 (UTC)
The article you have been citing (Makkonen, L., Pajari, M. (2014) Defining sample quantiles by the true rank probability. Journal of Probability and Statistics, ID 326579. http://dx.doi.org/10.1155/2014/326579) has zero citations according to Google Scholar. It fails the notability criterion of Wikipedia and I have removed those citations. Leegrc (talk) 15:20, 14 September 2015 (UTC)

Thank you for your response. I can see your point that the method of interpolation should be chosen according to the CDF when it is known, or can be deduced from the data. This is analogous to transforming the probability axis according to the assumed CDF in order to obtain a linear relationship, and then interpolating linearly.

However, my edit was not about the interpolation method. It was about the points between which interpolation is made. These points represent the probability that a random observation selected from the population does not exceed the hth value in the sample. They are known exactly, and since we are discussing order-statistics here, they are independent on the CDF. These probabilities are h/(N+1). For example, a new random observation does not exceed the highest value of two observations in a sample with the probability of 2/3 (regardless of the CDF!). Since the quantile function is defined as the inverse of the CDF, only R-6 gives the exact and correct value of h. It follows that all the other formulas given in the table are in error.

This issue is discussed in the reference Makkonen and Pajari (2014) in detail. The paper has been published in a peer reviewed journal. It has not been cited yet because it was published only nine months ago.

The point that I tried to make in my edit to the page “Empirical distribution function” was quite the same, i.e., that the traditional definition of EDF could be improved, because it gives erroneous point values and steps. That the steps are actually 1/(N+1), not 1/N, is outlined in Makkonen and Pajari (2014). However, this particular fact has also been proven and published, with illustrations, in the following reference: Makkonen, L. (2008) Bringing closure to the plotting position controversy. Communications in Statistics – Theory and Methods, 37, 460-467. This paper is highly cited (52 citations in Google Scholar so far) and could be used as the reference. A hint towards this finding was also given as a figure on p. 149 in the textbook Madsen et al (2006) “Methods of Structural Safety” ISBN-10: 0486445976. The first edition of this book was published in 1986 and the book has 2545 citations in Google Scholar. RATLAM (talk) 16:43, 18 September 2015 (UTC)

### Week of September 20, 2015

I believe that you are saying, for example, that if one has sampled an odd number of values and one wants to estimate the median of the underlying population then there is never any excuse to use anything other than the middle of the sampled values. A perhaps obvious counterexample is when one does know the population CDF; then one should choose its median regardless of what values were sampled. Another counterexample: suppose there is a distribution that is known to be 40% at zero and 60% at some unknown point x. In that case, rather than choosing the middle of the sampled values, which could happen to be zero, one should choose the non-zero value that is seen (assuming that a non-zero value is seen). Do these examples address your concerns? Leegrc (talk) 13:51, 21 September 2015 (UTC)

To both of your questions, my answer is no. Instead of using the midmost of the observed value, it is preferable to associate a probability ph to each of the order-ranked values xh and find a curve which fits to these points (xh,ph). This curve is an estimate for the CDF, and the inverse of this estimate is used to determine the quantiles, e.g. median. This is how methods R-4...R-9 are deduced. In all of them, a broken line connects point (xh,ph) to point (xh+1,ph+1). However, a better fit is often obtained when the curve is not forced to go via the points (xh,ph). In such a case, the estimated median and the midmost xh seldom coincide.

All methods R-4...R-9 in your counterexamples give the same median. Therefore, they are not helpful in clarifying, why only R-6 is adequate. My point is that methods R-4, R-5 and R-7...R-9 should be abandoned because they associate a wrong non-exceedance probability ph to xh, as shown in Makkonen, L. Bringing closure to the plotting position controversy. Communications in Statistics – Theory and Methods 37, 460-467 and Makkonen. L., Pajari, M. (2014) Defining Sample Quantiles by the True Rank Probability. Journal of Probability and Statistics, Article ID 326579, http://dx.doi.org/10.1155/2014/326579. RATLAM (talk) 17:30, 22 September 2015 (UTC)

Are you saying that the only legitimate ph value for the h-th smallest of n sampled values is ph = h/(n+1)? As a specific case, are you saying that the middle of an odd number of sampled values should always get ph = 1/2? These are the results of Makkonen et al., right? If do not have that right, please give an example where you think the middle of an odd number of sampled values would get a ph ≠ 1/2 … or give an example where the curve that is not forced to go via the points (xh, ph) gives other than 1/2 for the middle of an odd number of sampled values. Please also explain how R-6 is appropriate for your example. Leegrc (talk) 19:09, 22 September 2015 (UTC)
Are you saying that R-6 is not always correct, but when it is incorrect so must be the alternatives R-1, …, R-5, R-7, …, R-9, and others? Leegrc (talk) 19:11, 22 September 2015 (UTC)

The answer is yes to your first three questions. ph = h/(n+1) is the probability of a random observation not to exceed the h-th smallest value in the sample. This is always correct. Also the other formulas (except R-4) give the correct probability for the middle observation, but not for any other rank. Consider an example with N = 3 and h = 3, i.e. the probability of not exceeding the largest value in the sample of three. For this case, the formulas in the table give the following ph

R-4: 1
R-5: 5/6
R-6: 3/4
R-7: 1
R-8: 4/5
R-9: 21/26

Only one of these is the correct probability for this random process, and that is 3/4.

RATLAM (talk) 20:27, 22 September 2015 (UTC)

I would say that "ph = h/(n+1) is the probability of a random observation not to exceed the h-th smallest value in the sample" needs some qualifications. I agree that it is true for a continuous distribution for which no point as a finite probability of occurring. However, it would it be false for my 40% / 60% example with 1,000,001 samples in that the middle of the sampled values will almost surely be the non-zero value and the probability that a random observation will not exceed the middle value is thus a negligible hair under 100%, not 50%. As another example, suppose the underlying true population has exactly 3 equally probable unknown values. With 1,000,000 samples drawn from that distribution, the largest sampled value will be the largest of the three values with probability 100% − (2/3)1,000,000, and the probability that a random observation will not exceed the largest value is much closer to 100% than it is to 1,000,000 / 1,000,001.
More importantly, I would say that we are solving two different problems here. Even with a continuous distribution that has no finite probabilities (that is, no parts that look like a discrete distribution), the statement that "ph = h/(n+1) is the probability of a random observation not to exceed the h-th smallest value in the sample" is not the same as, for example, the statement that the "best" estimate for the k-th q-quantile that can be derived from a sample of q−1 values is necessarily the k-th smallest sampled value, because what is meant by "best" depends upon your loss function. For example, in some circumstances the mode is more important than the mean; perhaps the relevant loss function gives credit only for an estimate that is within ϵ of the actual population quantile (for which mode is best) rather than being a root-mean-square-deviation valuation (for which mean is best). In that case, the best estimate is where the mode is, which is given by R-7. Sometimes instead, people care more about the median. Sometimes, with discrete distributions, people care more about "less than" than "not more than". And so on.
To be very concrete, suppose we have a sample of 2 values drawn from a uniform distribution with unknown endpoints [a, b] and we want to estimate the first quartile of the population, which is (3/4)a + (1/4)b. What estimate based upon the 2 sampled values will give you the smallest root-mean-square-deviation from (3/4)a + (1/4)b? What estimate based upon the 2 values will give you the highest probability of getting a value within ϵ of (3/4)a + (1/4)b? Are these different?
Lastly, I believe that the calculations that you attribute to Makkonen and Pajari (2014) should be instead attributed to Karl Pearson and his work on the beta distribution, and that the result is about 100 years old. Leegrc (talk) 12:11, 23 September 2015 (UTC)

I was discussing continuous distributions. The estimation is done by fitting a distribution by using the points (xh,ph), and the loss function depends on how well this is done. Obviously, any such estimation should use all the available information. This includes the exact non-exceedance probabilities of the ranked data ph. So, why not use them? Regardless of how "best" is measured, it makes no sense to use, instead, some other incorrect probabilities as the basis of the fitting.

Pearson derived the beta distribution, and Emil Gumbel utilized it in 1958 to show that "ph = h/(n+1) is the expectation of the CDF associated with the rank h. That this actually equals the probability of not exceeding the h-th value is a more recent finding, the application of which to quantile estimation is the subject of the paper by Makkonen and Pajari (1984). RATLAM (talk) 19:40, 23 September 2015 (UTC)

I believe you are stating that R-6 is the only correct solution, so long as we are talking about a continuous distribution (without discrete-distribution-like parts) and so long as we are talking about the knots (points) between which the interpolations are made. The text (that I deleted) from the article omitted those qualifications.
Even in the continuous case, though … let me try to get even more concrete by working an example. Suppose we have sampled two points independently and uniformly at random from an unknown interval [a,b]. Call the smaller one x and the larger one y. Suppose that our loss function is that we want to minimize the expected square deviation of our estimate from the true value. Suppose we are looking for p = 1/3, i.e., the 1st 3-quantile aka the 33 1/3 percentile. R-6 would say to use the smaller of the two sampled points, x, but let's do the math. Because the problem is translation and scale invariant, we know that the optimal solution will be of the form (2−h)x + (h−1)y for some value h. Once the form of the solution is selected, we can assume that a = 0 and b = 1 without loss of generality. We choose h to minimize the cost function
${\displaystyle \Phi (h)=\int _{0}^{1}\int _{x}^{1}([(2-h)x+(h-1)y]-p)^{2}~2~dy~dx}$,
where 2 dx dy is the joint probability density for the drawing of the two points, 0 ≤ xy ≤ 1, and where ([(2−h)x + (h−1) y] − p)2 is the loss function. These integrals are not hard to integrate, yielding:
${\displaystyle \Phi (h)={\frac {1}{6}}\left(h^{2}-(1+4p)h+(1+6p^{2})\right)}$.
Minimizing the value of Φ(h) with respect to h is easy. The answer we get is that h = 2p + 1/2 minimizes the loss function; this formula is R-5. When p = 1/3 then h = 7/6; the estimate with lowest expected loss is Qp = (5/6)x + (1/6)y. For this loss function, we have shown that Qp = (5/6)x + (1/6)y, which has Φ(7/6) = 0.0509 is better than Qp = x, the R-6 solution with Φ(1) = 0.0555, even though the R-6 solution falls squarely on the knot defined by the smaller point x. Leegrc (talk) 13:00, 24 September 2015 (UTC)

### Week of September 27, 2015

50 000 new random observations taken from a uniform distribution falling into three bins.

That the loss function can be best minimized by some quantile formula does not prove anything of the correctness of that formula. This is because the loss function has no direct connection with the probability, and is thus biased in the probabilistic sense. What your example demonstrates is merely that it is possible to obtain a useful result by combining a biased method with suitably adjusted incorrect quantiles.

The probabilistically correct way to compare the quantile formulas is the following. Draw a line (estimated CDF) through points (x(h), p(h)) and (y(h), p(h)) and determine, by this line, X and Y that correspond to the p-values 1/3 and 2/3. Then, take from the original distribution a new random observation z, and see to which bin 1: [0, X), 2: [X, Y), 3: [Y, 1] it belongs. Repeat this and count the relative frequency of observations falling into each bin. The result where the relative frequencies in the bins approach the same shows that the method is good (in the same way as an equal frequency of the results 1-6 shows that a dice is fair). The Hazen formula (R-5) and the Weibull formula (R-6) provide the following result when 50 000 random observations are taken.

This figure is for the uniform distribution, discussed above, but the same result is obtained for any other continuous distribution. Only by using R-6, a unique asymptotic relative frequency is obtained, showing that all other quantile formulas are incorrect. RATLAM (talk) 11:27, 30 September 2015 (UTC)

I guess I have to disagree with the statement "That the loss function can be best minimized by some quantile formula does not prove anything of the correctness of that formula." What it does prove is that the formula is correct for minimizing that loss function. Are you arguing that it is inappropriate ("biased") for anyone to want to minimize this loss function in this circumstance? If so, on what basis? We are in agreement that R-6 is most appropriate if one wants the expected value of the fraction of future draws to be the quantile in question (… or at least R-6 is most appropriate in the subscenario of a continuous-only distributions when R-6 gives h to be an integer), but what if the loss function is a better representation of what I care about than this result about this expected value? Least-squares approaches are very popular in a wide variety of scenarios….
Is it that you are arguing that approaches other than R-6 can be appropriate for real world problems, but that these real world problems should not be characterized as instances of the quantile estimation problem? Evidence from the nomenclature usage in popular statistical packages suggests that quantile estimation is what many people call what they are doing when the use any of the formulae in the article's table. Leegrc (talk) 15:46, 30 September 2015 (UTC)

Quantiles are not defined by their ability to minimize some function, but in the probablistic way explained in the beginning of the "Quantile" page. Their correctness should be viewed against the definition, of course. It is not inappropriate to use a loss function, but already your example shows that it can be misleading in the probabilistic sense.

The second paragraph of your reply nails it. Yes! People indeed think that they are using quantiles, when they are not. This is why these practices require sharpening and why, in my opinion, the "Quantile" page of Wikipedia should be edited further. RATLAM (talk) 13:53, 1 October 2015 (UTC)

I think we are now in complete agreement that people can have legitimate uses for formulae other than R-6, even for plotting purposes, structural design, extreme values, and so forth. The only question I see up for debate is what these quantile-like quantities that are produced should be called. Based upon common usage, it is clear to me that they should be called "quantiles" or "quantile estimates" or similar. Leegrc (talk) 15:15, 1 October 2015 (UTC)

Whether we are in agreement or not depends on the meaning of "legitimate". Any quantile formula other than R-6 will result in a poor estimate of the CDF, because incorrect point values of the probability are then used as the basis of the estimation. A good empirical estimate of the CDF is crucial in determining extreme values for structural design, as an example. Hence, in this connection, formulas other than R-6 are certainly not legitimate, and typically result in underestimating the risk of extreme events.

If one is, instead, interested in estimating something else than the CDF, then the situation may be different. For example, commonly, people estimate a parameter of a CDF in the sense that if one would take more samples, their mean would asymptotically approach the parameter of the underlying CDF. In this case, the result may be improved by using points other than those of R-6. This is because the parameters of a non-linear CDF are non-additive (p is additive by definition), which means that the mean of a distribution parameter actually does not approach the true parameter value. The error arising from this can be compensated by making another error: distorting the quantiles from their true values.

In this light, the incorrect probabilities should not be called anything that implies a probabilistic interpretation, such as quantiles. Neither is there any point in calling them estimates since the excact probabilites are known, and there is no need to estimate them - only to correct for an inadequate method. Perhaps, distorted quantiles, would be appropriate. RATLAM (talk) 13:27, 2 October 2015 (UTC)

Yes, I agree that if the experiment is to draw N points from a continuous (and "discrete-free") distribution, sort them, giving x1 ≤ … ≤ xN, and draw an additional point y then it will be the case that the probability is trivially computed as Pr[yxk] = k/(N+1). That is, if we repeat this experiment many times, drawing x values and the y value each time, and compute the fraction of experiments that give yxk then that will converge to k/(N+1) as the number of experiments goes to infinity.
However, that isn't the experiment that is repeated in many real world situations. Very often the set of x values is given once (the past observations) and the repeated experiment is draws of the y value (possible futures observations). In this case, the convergence of the fraction of experiments that achieve yxk is to CDF(xk) for the given value of xk rather than to k/(N+1). That is, when one is hoping to find the k-th (N+1)-quantile, the use of the given value of xk does not guarantee the desired probability k/(N+1) even in the limit as the number of experiments goes to infinity. One could use xk nonetheless, because that is correct for the related experiment where the x values are also repeatedly chosen, but such a choice is not required. Other reasonable approaches include using loss functions, such as least squares. Leegrc (talk) 17:35, 2 October 2015 (UTC)

### Week of October 4, 2015

Pr[yxk] = k/(N+1) is the probability of a new observation not to exceed the k-th smallest value in a sample. This is order statistics, so that Pr does not depend on the actual values in the sample. Thus, also in the experiment described in the second paragraph of your reply, Pr[yxk] = k/(N+1). Counting frequencies of yxk for a given value of xk cannot be done, since we have no given values of x, as we do not know the CDF. We wish to estimate the CDF and the quantiles, and for that we have estimates of xk (the observations in the sample) and their known rank probabilities Prk. RATLAM (talk) 18:10, 4 October 2015 (UTC)

Frequently, a set {x1, … xN} is what the scientist starts with when trying to estimate the quantiles, so I disagree with the statement "Counting frequencies of yxk for a given value of xk cannot be done". Also, the argument that the scientist can randomly draw new sets {x1, … xN} at will is, in my experience, seldom true. Leegrc (talk) 11:56, 5 October 2015 (UTC)

Perhaps, the discussion is drifting away from "Estimating quantiles from a sample" which is the title of this section of the "Quantile" page. So, I would like to summarize. The quantile function is defined as the inverse of the CDF. The CDF is defined by the probability that a random observation, selected from the population, does not exceed some value x. For the h-th smallest value in a sample of N observations, this probability is h/(N+1). Therefore, R-6 in the table of the "Quantile" page gives the correct quantiles for the points. The other formulas provide erroneous quantiles. They may provide "best unbiased estimates" only for parameters that have no definite connection with the probability. The use of such parameters may be misleading, and it is important to be aware of this. The "Quantile" page of Wikipedia could be considerably improved by pointing out these issues. RATLAM (talk) 10:17, 6 October 2015 (UTC)

Perhaps it will help if I point out precisely where I think this logic fails. We agree that if one draws the set {x1, … xN} each time one draws the value y then the statement "For the h-th smallest value in a sample of N observations, this probability is h/(N+1)" is true. Otherwise, I disagree that such a statement follows from the previous statements. Specifically, unless I am very lucky, chances are that the value of xh that I start with will not be the h-th (N+1)-quantile, and no number of repeated random selections of y will change that. Leegrc (talk) 11:37, 6 October 2015 (UTC)

The value of xh has nothing to do with order-statistics. Example: Our randomly chosen sample consists of three men in a room, and we ask what is the probability Pr of a randomly selected man entering the room next not to be taller than the h-th shortest man in this sample. The answer is Pr = h/(N+1). For example, the probability of the incomer not to be taller than anyone in the sample is 3/4. This is easy to understand by noting that, once the newcomer has entered the room, there are four men in it, each randomly chosen, so that the probability of one of them (e.g. the newcomer) to be the tallest is 1/4. All this has nothing whatsoever to do with how tall the men actually are. RATLAM (talk) 13:41, 6 October 2015 (UTC)

That example appears to be assuming that there are four men, three are chosen randomly to enter the room first, and then the fourth enters. Yes, in that case the numbers in the example are correct. However, there are other experiments. Perhaps I know that all men have a height that is uniformly distributed across some interval, but I don't know what the interval is. The three men I see are 1.9 meters, 1.8 meters and 1.5 meters. In that case my best guess for the third quartile should not be 1.9 meters. Instead I use R-5 to compute h = 2.75. With a slight reinterpretation of R-5, I use this to interpolate between x1 = 1.5 and x3 = 1.9 to get 1.85 meters. If I were a gambling man, I would bet real money that my 1.85 meters answer is going to be closer than the 1.9 meters answer in a test where we measure the height of the 75,000-th shortest man in a hold out set of size 99,999. Leegrc (talk) 14:44, 6 October 2015 (UTC)

There are no "other experiments" that need to be discussed here. The first sentence in the section Estimating quantiles from a sample of the Quantile page reads: "When one has a sample drawn from an unknown population, the cumulative distribution function and quantile function of the underlying population are not known and the task becomes that of estimating the quantiles." This is our problem, not something related to known distributions or intervals. RATLAM (talk) 11:08, 7 October 2015 (UTC)

I do not follow your argument. Are you saying that I would win the gamble but that I am still wrong? (Do you sagree that I would win the gamble?) The quoted sentence does say that we will need to estimate the quantiles, and I agree. Are you saying that the existence of the need means that the gambling example is an inappropriate use of quantiles? Leegrc (talk) 11:44, 7 October 2015 (UTC)

Our problem is defined in the quoted sentence, in which I have now emphasized important words in Italics. Your example is a very different problem.

But I will briefly reply. In your example, where you know the distribution, estimation with the observations only is a poor alternative. One should utilize the information on the CDF when available. You have no bet, because nobody is suggesting that the estimation should be done so that an observed value as such (here 1.9 meters) is taken as a quantile in your example. Rather, a probability h/(N+1) is associated with each observed value, and the quantiles are obtained by the CDF fitted to all the points. RATLAM (talk) 14:48, 7 October 2015 (UTC)

Let me try this step by step to explain why my example is not "a very different problem". Which of the following statements do you disagree with?
1. The task of finding the height of the 75,000 shortest person in a set of 99,999 people can be called "the task of finding the third quartile of that set of people" even if we know that those 99,999 heights were drawn uniformly at random from an unspecified interval.
2. By "estimating" the height of the 75,000 shortest person, we mean trying to come up a with a height that will be near to the actual height of that person and, in particular, evaluation of an estimator for efficiency, bias, etc. should be based upon how the estimator approaches the height of that person.
3. If we use a sample of size 3 that is randomly drawn without replacement from the 99,999 people then the task of estimation of the third quartile of the set of 99,999 people using only the height of these three people can be called the task of "quantile estimation from a sample".
4. If we use R-6 on the sample of size 3 then R-6 will indicate h = 3.
5. The largest of the three heights is the R-6 estimate for the third quartile of heights in the set of 99,999 people. That is, QR-6 = x3.
6. Use of R-5 on this sample of size 3 will give h = 2.75.
7. Although not the typical way, this value of h can be achieved by calculating a weighted mean QR-5 variant = (1/8)x1 + (7/8)x3.
8. QR-5 variant is more efficient and less biased than QR-6 for estimating the height of the 75,000 shortest person in the set of 99,999 people.
Leegrc (talk) 12:04, 8 October 2015 (UTC)
Actually, I have debugged the math and it isn't R-5 either that is optimal. The optimal formula that uses only x1 and xN is looking like h = (N + 1 − 2/N)p + 1/N, where N is the size of the sample and the size of the population (e.g., 99,999) tends to infinity. This is R-5 only for the sample of size 2; it is R-8-like for a sample size of 3 and is not on the article's chart for larger sample sizes. In the case of estimating the third quartile with a sample of size 3, the formula gives h = (3 + 1 -2/3)(3/4) + 1/3 = 2 5/6, which continues to differ from the h = 3 result from R-6. Leegrc (talk) 17:32, 9 October 2015 (UTC)

### Week of October 11, 2015

I disagree with the last statement (8). As discussed, the probability of not exceeding the h-th smallest value in the sample of N is h/(N+1). Because CDF is defined by the non-exceedance probability, R-6 gives the best empirical estimate of the quantiles. “Efficiency” and “bias” are related to the expected value. Nothing can be said about them based on a single artificial sample. RATLAM (talk) 09:15, 14 October 2015 (UTC)

Let me try to recap, with an even simpler experiment. We choose 3 values independently from the probability distribution that is uniform over some interval [a, b], in order to infer the point at which the CDF of this uniform distribution equals 75%, which occurs at (a + 3b)/4 exactly. We agree that the expected value of the largest of the three, x3, will be (a + 3b)/4. We agree that this is not the same as saying that 50% of the time x3 will be less than or equal to (a + 3b)/4. We agree that this is not the same as saying that the expected value E[{x3 − (a + 3b)/4}2] minimizes the quadratic loss formula. We agree that these three objectives (mean = (a + 3b)/4, median = (a + 3b)/4, and minimum square deviation around (a + 3b)/4) are different ways to home in on (a + 3b)/4, which is the absolutely true point at which the CDF of the uniform distribution equals 75%. Although you might characterize it as a distorted use of the words, we agree that it is common usage for people to characterize an approach based upon any of these three objectives as "quantile estimation from a sample". Do I have that right? Leegrc (talk) 13:05, 14 October 2015 (UTC)

Yes, you have it right. A poor estimate is still an estimate. The point that should, in my opinion, be clear on the Quantile-page is that because R-6 gives the correct probabilities associated with order ranked observations, there is no benefit from using any of the other formulas in estimating quantiles, albeit they have historically been used. RATLAM (talk) 20:42, 15 October 2015 (UTC)

The goal is to find (a + 3b)/4 because 75% of future draws will not exceed that value. However, almost surely any formula for making an estimate from three sampled values will not give exactly (a + 3b)/4. That is, any formula for making an estimate from three sampled values will almost surely fail to give us a 75% nonexceedance probability for future draws. We could give up here, or we could relax our goal somewhat. One option for getting near the true goal of achieving exactly 75% nonexceedance regardless of the sample of three values is the use of R-6, where we achieve the goal that on average the estimator gives 75% nonexceedance. That is, R-6 is almost surely failing to deliver 75% nonexceedance for any particular sample of three values, but it is guaranteeing 75% nonexceedance on average. Who are we to say that R-6 is better at approximating the true every-sample goal than an approach that guarantees 75% nonexceedance on median, by which I mean that the median value of the estimator is exactly (a + 3b)/4, the 75% nonexceedance value? Leegrc (talk) 13:50, 16 October 2015 (UTC)

We know that the non-exceedance probability is h/(N+1) exactly. CDF and the quantiles are defined by the non-exceedance probability. Probability has a unique value assigned to an event. Therefore, it makes no sense to speak of its average or median. RATLAM (talk) 09:10, 17 October 2015 (UTC)

### Week of October 18, 2015

I think that we have found the heart of the disagreement. I claim that if we take N values randomly from a continuous probability distribution then the nonexceedance probability for the h-th smallest of the values will almost surely not be h/(N+1). This appears to be in direct contradiction to the statement that the nonexceedance probability will be exactly h/(N+1) in this scenario. Leegrc (talk) 12:09, 19 October 2015 (UTC)

Your claim is incomprehensible, because the concept of "almost surely" applies to an outcome of an event. Probability is not an outcome, but a specific number assigned to an event. The condition xxh defines an event in a two-dimensional space with variates x and xh, and P(xxh) is the probability assigned to this event. A mathematical proof has been given by Madsen et al. (2006) and Makkonen and Pajari (2014) for that P(xxh) equals h/(N+1) . If you wish to claim otherwise, you would need to point out an error in this proof. RATLAM (talk) 07:38, 20 October 2015 (UTC)

We are trying to understand what it means to estimate a quantile from a sample. I see you stating that it must be the case that we talking about the probability of an event in a two-dimensional space of variates. That is where I disagree. I state that instead:
We are trying to find a fixed formula to compute a value Qp from a set of sample values, such that the one-dimensional variate x will give the probability of the event Pr[x Qp] = p.
In other words, my objection with the references you cite is not that they fail to present the math correctly for the two-dimensional case but that they should be considering the one-dimensional case instead. With the one-dimensional variate, it is almost surely the case that we will fail to get the desired probability (regardless of our formula for Qp). Instead, we must be satisfied with being close; and each user is free to define what "close" means depending upon context and purpose. Leegrc (talk) 11:55, 20 October 2015 (UTC)

But then, to be useful at all, each of the formulas in the Table should be connected to a specific definition for "close". This would mean that when discussing quantiles, one should always outline which "closeness" criterion they are based on. This would not be in harmony with the specific mathematical definition of quantiles (which requires the use of R-6). Moreover, for most methods in the Table, no valid closeness criterion has been presented in the literature. Based on Monte Carlo simulations, some of them have been claimed to give unbiased estimates for the distribution parameters for a particular distribution, but these "proofs" include a fundamental error discussed in detail in Makkonen and Pajari (2014).

How to get from the initial two-dimensional case of order ranked observations to the one-dimensional case of the final estimate of the CDF is explained graphically in Fig. 2 of reference Makkonen et al. (2012) Closure to Problems in the extreme value analysis. Structural Safety 40, 65-67.RATLAM (talk) 16:50, 23 October 2015 (UTC)

There is a description provided with each formula in the Notes column of the table. There is talk of "medians" and "modes" that give a start (at least) towards explaining the motivation / closeness criteria. Although I do not know the details by heart, surely each of those formulae was originally based upon some rationale given in some journal article. If you find the table notes lacking, please update them to make them clearer.
We do not agree about "This would not be in harmony with the specific mathematical definition of quantiles (which requires the use of R-6)." I continue to believe that the the specific mathematical definition of quantiles from a sample is the straightforward one-dimensional variate formulation now highlighted just above. An argument that starts with the wrong, two-dimensional formulation in order to show a "fundamental error" in the correct, one-dimensional formulation is not of any use to me in the present discussion. Leegrc (talk) 18:28, 23 October 2015 (UTC)

The formulas in the Table date from the time when it was thought that P(xxh) needs to be estimated in some way, i.e. when it was not understood that h/(N+1) gives the exact non-exceedance probability associated with each order ranked observation xh. Since all of them, except R-6, result in unnecessarily poor estimates, the Table should be removed from the page rather than the notes in it updated.

The two-dimensional formulation of considering Ph(xxh), is not "wrong". On the contrary, order-ranking is the foundation of the analysis, because it is the only way to use random obervations for estimating the underlying CDF and the quantiles. The starting point is to plot the points (Ph, xh). Using them, one can then make a fit and proceed to estimating the actual CDF. RATLAM (talk) 18:47, 24 October 2015 (UTC)

### Week of October 25, 2015

I fear that we are at a standstill. We agree about the repeated assertions for the two-dimensional variate formulation, including the exact nonexceedance formula for it, and we agree that some of these results do not carry over to the one-dimensional variate formulation, but I do not agree that this implies that the two-dimensional variate formulation is more appropriate than the one-dimensional variate formulation. We agree that order ranking is important to many analyses including the present situation, but I do not agree that this implies that the two-dimensional variate formulation is more appropriate than the one-dimensional variate formulation. We agree that the plot (Ph, xh) is relevant, but I do not agree that this implies that the two-dimensional variate formulation is more appropriate than the one-dimensional variate formulation.
To the contrary, the common usage of the formulae other than R-6 strongly suggests that the common interpretation is that the one-dimensional variate formulation is more appropriate than the two-dimensional variate formulation when it comes to estimating quantiles from a sample. Leegrc (talk) 13:15, 26 October 2015 (UTC)

All the formulas in the Table are based on treating xh as a variable, i.e. a two-dimensional interpretation. In regard to the interplay between ordinary statistics and order statistics, there is no issue of one being more appropriate than the the other. One starts the analysis by order-ranking the observations, associates the rank probabilities to them, and next uses some fitting method to transfer to the one-dimensional interpretation, i.e., to making the final estimate of the CDF and the quantiles. Obviously, it makes sense to begin this procedure by using the correct rank probablities (R-6) rather than something else.

But, I think that I am repeating the arguments already made, and I would like to close this long general discussion on my part here. It would be good to discuss the specific improvements to be made on the Quantile page. RATLAM (talk) 09:03, 28 October 2015 (UTC)

Indeed we are repeating arguments. One last try in terms of symbols; maybe that will help. For samples of size N from a unknown probability distribution described by a continuous probability density function, the goal is a formula Qp:RNR that for a desired probability p takes a set S of N sampled values and produces Qp(S), an estimate of the desired quantile. Ideally our chosen formula Qp(S) works perfectly regardless of the sampled values S,
Criterion 1: S, Prx[xQp(S)] = p.
If we are willing to accept that the criterion is met almost surely rather than always then we are hoping for:
Criterion 2: PrS[Prx[xQp(S)] = p] = 1.
Unfortunately this is not the case; we fail almost surely regardless of our chosen formula Qp(S):
Unfortunate Fact 1: PrS[Prx[xQp(S)] = p] = 0.
Thus we will have to settle for something different. The only possibility that you consider correct is R-6, which satisfies the criterion:
Criterion 3: Prx, S[xQp(S)] = p
… when p=h/(N+1) for some integer h ∈ {1, …, N}. Please do not misunderstand me; I agree that Criterion 3 is a decent criterion and R-6 is a decent choice. However, Criterion 3 is not the only criterion that approximates the desired Criterion 2, and thus R-6 is not the only decent choice. To be specific, an alternative to R-6 that also decently approximates the unachievable Criterion 2 is
Criterion 4: PrS[Prx[xQp(S)] ≤ p] = 1/2.
In words, we know that for a given sample S, the formula Qp(S) will almost surely not produce a nonexceedance value, Prx[xQp(S)], equal to p but the median of the produced nonexceedance values is p. Depending on context and purpose, Criterion 4 may be more useful than Criterion 3 at approximating the unachievable Criterion 2.
Key question: If you believe that Criterion 2 is not the ideal we are trying to approximate, would you explain your reasoning? Leegrc (talk) 15:15, 28 October 2015 (UTC)

### Week of November 1, 2015

It is self-evident that an estimate of a continuous variable will almost surely not give the exact result. Therefore, Criterion 2 is of no help. To compare the methods, one needs a criterion that provides a measure of the goodness of the estimate. Here, that criterion is:

Criterion 5: How well does the method predict Pr[xQp(S)] = p when p=h/(N+1) for some integer h ∈ {1, …, N} ?

Let Frp(S) be the estimate of the probability based on sample S, and let us call it the sample frequency. The frequency interpretation of probability tells us that the mean of Frp(S) over all possible samples S is Pr. Thus, criterion 5 becomes

Criterion 6: How well does the method predict E[Frp(S)] = p when p=h/(N+1) for some integer h ∈ {1, …, N} ?

Regarding the methods in the Table, R-6 provides the correct probability Pr = E[Frp(S)], whereas the other methods do not. RATLAM (talk) 19:16, 4 November 2015 (UTC)

Some questions:
1. I am thinking that the Criterion 5 text, "How well does the method predict Pr[xQp(S)] = p when p=h/(N+1)," is equivalent to the text "How close is Prx[xQp(S)] to p when p=h/(N+1)"; is that right? If not, please elaborate.
2. I think that you are then defining Frp(S) equal to Prx[xQp(S)]; is that right? I worry that I am misunderstanding this one; if I am then would you restate it in symbols using Prx, PrS, Ex, ES, etc?
3. I think that you are then observing that ES[Frp(S)] = p for R-6 (but not the others) when p=h/(N+1); is that right?
Thank you. 𝕃eegrc (talk) 21:24, 4 November 2015 (UTC)

1. Yes
2. Yes, and there is an important reason for using the symbol Frp(S) instead of Prx. This is that, when Prx[xQp(S)] = p is estimated based on a sample S, then Prx is not a probability. It is a sample estimate of probability. The mean of sample estimates is the probability.
3. Yes.

RATLAM (talk) 19:03, 5 November 2015 (UTC)

Thank you. May I get some clarification on the answer to Question 2? I can see that, for any fixed value of S, the right-hand side of Frp(S) = Prx[xQp(S)] is an exact probability of nonexceedance over possible values of x, which is applicable to that fixed value of S. I think you are saying that it can also be considered an estimate of an exact probability, but I am not positive that I know which exact probability it could be considered an estimate of. Is the exact probability Prx, S[xQp(S)] the one being estimated? If the expression Prx, S[xQp(S)] is not the exact probability being estimated, or if there is a more intuitive representation of this exact probability, would you provide it, in symbols if feasible? 𝕃eegrc (talk) 13:13, 6 November 2015 (UTC)

### Week of November 8, 2015

Frp(S) is an estimate, based on sample S, of the probability Prx[xG(p)] at points p=h/(N+1). This probability equals Prx[xG(h/(N+1))] = h/(N+1) = ES[Frp(S)]. Here G(p) is the underlying quantile function that Qp(S) tries to estimate. RATLAM (talk) 16:39, 9 November 2015 (UTC)

If G(p) is the underlying quantile function then Prx[xG(p)] evaluates to p by definition, at least when the the probability density function is continuous, yes? I think you are saying that Frp(S) is an estimate of p.
I think you are transforming the problem of trying to estimate G(p) with Qp(S) to the problem of trying to estimate Prx[xG(p)] = p with Frp(S) = Prx[xQp(S)]. Although G(p) is not a probability, you are then observing that Prx[xG(p)] = p is a probability. You are then stating that since Frp(S) is an estimate of a probability, only functions Qp(S) that give an expectation ES[Frp(S)] = ES[Prx[xQp(S)]] equal to p are reasonable. Despite that R-6 does not satisfy this criterion for the transformed problem for almost any value of p, you are saying that R-6 is best because it does satisfy the criterion ES[Prx[xQp(S)]] = p when the probability distributions has a continuous density function and for the specific values of p = h/(N+1), which you are saying is good enough.
Can you provide a citation indicating that it is never okay to use, e.g., a median instead of a mean when trying to estimate a probability? That is, who ruled out trying to achieve medianS[Frp(S)] = p? 𝕃eegrc (talk) 19:19, 9 November 2015 (UTC)

One could say that it was Andrey Kolmogorov, whose probability axioms define probability in the probability theory. The third axiom means that probability is additive. Therefore, its best estimate is the mean. This is also evident in the frequentist probability that is defined as: "The relative frequency of occurrence of an event, observed in a number of repetitions of the experiment, is a measure of the probability of that event." The relative frequency here is the sum of sample frequencies scaled by the number of samples, i.e. the mean of sample frequencies (not the median). RATLAM (talk) 15:41, 11 November 2015 (UTC)

When we are trying to estimate quantiles from a sample, we already know the value of p; we are not trying to estimate it. Rather, the question is how can we evaluate whether a value Qp(S) is good. A reasonable way to go about that is to ask whether Prx[x ≤ Qp(S)] is near p. Can you clarify why you think the goal of getting near p can be equated with trying to estimate p? For example, a randomized Qp(S) that returns +∞ with probability p and returns −∞ with probability 1−p would lead to an unbiased estimator of p via its nonexceedance values, even though the two realized Prx[x ≤ Qp(S)] values, +1 and 0, are as not near to p as you can get. 𝕃eegrc (talk) 16:09, 12 November 2015 (UTC)

The problem is described by Criterion 5. Criterion 5 equals Criterion 6. Thus, the problem is finding a method so that ES[Frp(S)] approaches p when S gets large. This method is R-6. RATLAM (talk) 12:08, 13 November 2015 (UTC)

When I wrote
I am thinking that the Criterion 5 text, "How well does the method predict Pr[xQp(S)] = p when p=h/(N+1)," is equivalent to the text "How close is Prx[xQp(S)] to p when p=h/(N+1)"; is that right?
you replied "Yes." I seek clarification as to why you are stating that the goal of getting the Pr[xQp(S)] values to be close to p must be equated with the goal of using those values to estimate p. 𝕃eegrc (talk) 12:50, 13 November 2015 (UTC)

### Week of November 15, 2015

We simply wish to find a method that best meets Criterions 5 and 6. We may use the fact that with a continuous variate x, x* = Q(p*) is close to x** = Q(p**) if and only if p* is close to p**. RATLAM (talk) 20:11, 15 November 2015 (UTC)

Yes, if a function is continuous we can look at either its domain or its image for defining "close to". However, why aren't we free to instead use our favorite cost function in order to achieve "close to"? Why are we required to pretend that we are trying to estimate p, which is a value we already know? 𝕃eegrc (talk) 13:43, 16 November 2015 (UTC)

Stating that ES[Frp(S)] = ES{Prx[xQp(S)]} = Prx[xG(p)] = p does not require pretending anything, because it is the definition of probability (see frequentist probability). This definition transforms Criterion 5 into Criterion 6. Thus, the requirement to use Criterion 6, i.e. the mean, could not be more fundamental.

One could still say that the method does not matter if "close to" could not be quantified. But it can. This is explained in connection with the figure above (Week of September 27, 2015). Of course, we should use the theoretically justified and quantitatively best method. It is R-6. RATLAM (talk) 11:23, 17 November 2015 (UTC)

Indeed, we are in agreement that ES[Frp(S)] = ES{Prx[xQp(S)]} = Prx[xG(p)] = p is true for R-6 (for continuous probability densities at specific values of p). However your statement that this sequence of equalities forces "close to" to mean "estimate" is where you lose me. 𝕃eegrc (talk) 13:00, 17 November 2015 (UTC)

Perhaps this problem originated on 9th November when you wrote; "You are then stating that since Frp(S) is an estimate of a probability, only functions Qp(S) that give an expectation ES[Frp(S)] = ES[Prx[xQp(S)]] equal to p are reasonable."

This was not the crux of my argument. Frp(S) represents a sample, and the argument is: Only functions Qp(S) that give an expectation ES[Frp(S)] = ES[Prx[xQp(S)]] equal to p are reasonable, because probability is defined so that ES[Frp(S)] = Prx[xG(p)] = p. RATLAM (talk) 16:46, 17 November 2015 (UTC)

We are trying to find a function Qp(S) that will give a result close to G(p). We can exploit the continuity of the nonexceedance calculation and try to find a function Qp(S) that will give Frp(S) = Prx[xQp(S)] close to p. The definition of probability has no opinion about whether "Frp(S) is close to p" must mean "ES[Frp(S)] = p". 𝕃eegrc (talk) 13:27, 18 November 2015 (UTC)

The question related to Criterion 6 is not: How close Frp(S) is to p?. It is: How close ES[Frp(S)] is to p? RATLAM (talk) 18:03, 18 November 2015 (UTC)

When I am trying to estimate a quantile from a sample, I am trying to some up with a value Qp(S) that is close to G(p). Do you agree? If I want to, I can exploit the continuity of the quantile function at p to reduce the problem to that of trying to come up a value Qp(S) for which Frp(S) = Pr[xQp(S)] is close to p. Do you agree? There is no expectation over S mentioned yet. Do you agree? What I don't get is why "the definition of probability" has any say as to what definition of "close to" is best for my particular application. Sometimes I may care most about root-mean-square deviation; sometimes I may care most about absolute deviation; etc. 𝕃eegrc (talk) 18:41, 18 November 2015 (UTC)

I agree on the first two statements, but not the third one. Expectation over S has been mentioned, because p has been mentioned and p = ES[Frp(S)].

In statistics, one sample value does not give the truth. The truth comes out in a repeated test. This is where the "definition of the probability" enters the scene.

Your comment on what you may care most applies to measures of the variable of interest, i.e. on the x-axis. On the p-axis one is interested in the probability only (i.e. the mean of Frp(S), not some other measure of Frp(S)), because it is the probability that needs to be associated with x in order to estimate the CDF or its inverse.

We need a statistically waterproof verification for the method we are using, and the choice of the method must be based on the definition of probability and its consequences. Otherwise, our quantile estimate will be biased, and no one knows how much. Since Prx[xxh] = h/(N+1) = ES[Frp(S)], a unique probability h/(N+1) is associated with xh also when we have only one sample. This is done in method R-6. When doing so, and considering any given distribution, it comes out that the number of hits of random x-values in bins (-∞,G(p1)], …, (G(ph), G(ph+1)], …, (G(pN),+∞) approaches a uniform distribution, when number of samples S increases. This is a fundamental property of the original distribution, too. No other criterion for "close to" gives this result. RATLAM (talk) 14:22, 20 November 2015 (UTC)

Where you wrote "Prx[xxh]", did you mean "Prx,S[xxh]"? (Or it is more in line with your "definition of probability" argument to say that you meant "ES[Prx[xxh]]"?)
What you intend to convey with "(-∞,G(p1)], …, (G(ph), G(ph+1)], …, (G(pN),+∞)" is not at all clear to me because there is no mention of Qp(S) and thus seemingly no dependence on whether we have chosen R-6. Should I interpret, for instance, "G(ph)" in this expression as "G(ES[Prx[xQh/(N+1)(S)]])"?
We are trying to get Frp(S) = Pr[xQp(S)] to be close to p rather than using the former (in isolation or together with repetitions of the experiment) to estimate/predict the latter. Thus, I still do not see why I should have to follow a recipe that is designed to estimate/predict the latter.
To clarify, perhaps it would be of value to back up a step. In my formulation of "close to" in terms of loss/distance functions, I can focus on how close Frp(S) = Pr[xQp(S)] is to p because the quantile function is assumed to be continuous at p. That is, the continuity of the quantile function tells me that if I have a sequence of values that tends to p then applying the quantile function to this sequence will give a sequence of values tending to G(p), which is important because getting close to G(p) is what I am after. Is continuity of the quantile function at p sufficient in your formulation of the problem? That is, what formulation of "Qp(S) close to G(p)" will necessarily follow from getting "Frp(S) = Pr[xQp(S)] close to p" in the sense you mean, ES[Frp(S)] = p? Or, saying it another way, how does your definition of getting it right on the p-axis translate into the language of getting it right on the x-axis? For example, if you were arguing for medians rather than means you could observe that medianS[Frp(S)] = p does imply medianS[Qp(S)] = G(p) when the quantile function is continuous at p because the quantile function is monotonically increasing. (Alas, you prefer means over medians.) In symbols, what can we say in x-axis language about how Qp(S) for R-6 is close to G(p)? 𝕃eegrc (talk) 18:29, 20 November 2015 (UTC)

### Week of November 22, 2015

Yes, using your formalism, I meant Prx,S[xxh].

Your comment “I still do not see why I should have to follow a recipe that is designed to estimate/predict the latter” appears to include a misunderstanding. Taking the probability as the mean of sample frequencies has nothing to do with estimating or predicting probability. It is the definition of probability.

Let us try again with the bin-criterion demonstrated in the figure. Perhaps, this will clarify the issues related to some of your questions. Considering any given distribution, it comes out that the number of hits of random x-values in bins (-∞,G(p1)], …, (G(ph), G(ph+1)], …, (G(pN),+∞), where ph = h/(N+1), approaches uniform distribution, when the number of samples S increases. This is equivalent to the fact that the number of hits of p = F(x)-values in bins [0,p1],(ph,ph+1] …, (pN,1] also approaches uniform distribution. For a given sample S and given Qp(S), take m random x-values, and record, how many times the values of the inverse of Qp(S) hit in each bin [0,p1],(ph,ph+1] …, (pN,1]. Repeat this with a new sample S and so on. Then, the total number of hits in each bin [0,p1],(ph,ph+1] …, (pN,1] approaches uniform distribution. This is the case only if ph, the probability associated to xh in Qp(S), is h/(N+1). RATLAM (talk) 22:11, 22 November 2015 (UTC)

Do I have this description of your argument right?: We know that the value we compute, Qp(S), has a non-exceedance value Prx[x ≤ Qp(S)], which we are hoping is close to p. As we can see, the non-exceedance value is the probability of an event. In this case, the probability is for the random variable x with a given/fixed value of S. However, we can also look at Qp(S) as an event over random variables x and S and compute a (second) probability, Prx,S[x ≤ Qp(S)]. This probability is mathematically equivalent to an expectation, ES[Prx[x ≤ Qp(S)]]. R-6 is best because it gives that this second probability is exactly equal to p (at least for continuous probability densities, some particular values of p, and when S is assumed to be a fixed number of independent, identically distributed draws), whereas the other R-* formulae do not give p in these circumstances.
If I have all that right then maybe all you need to do is convince me that getting Prx,S[x ≤ Qp(S)] to be close to p is more important than getting Prx[x ≤ Qp(S)] to be close to p. 𝕃eegrc (talk) 14:09, 23 November 2015 (UTC)

You have that right. For a given p, xp,S = Qp(S) is a variate which depends on N variates x1,..., xN. However, for a given p and S, F(xp,S) = Prx[xQp(S)] is a number, the value of which cannot be calculated because F is unknown. Therefore, it is impossible to compare, whether this number is close to p or not. Even more important is the fact that, in statistics, the goodness of an estimate can never be based on one sample. In other words, for given p and S, it is irrelevant to ask whether Prx[xQp(S)] is close to p or not. RATLAM (talk) 13:30, 24 November 2015 (UTC)

I don't quite get the xp,S and F notation, but I will try with notation that I am more comfortable with and maybe we can proceed from there. I agree that we cannot compute Frp(S) = Prx[x ≤ Qp(S)] for a given set S. However, we can select a universe (S ~ exactly N independent, identically distributed draws) and compute the likes of ES[Frp(S)], medianS[Frp(S)], and ES[(Frp(S) − p)2]. The first of these, ES[Frp(S)], can also be written as the probability Prx,S[xQp(S)]. Your argument is that normally we would be able to use any of these three (or others) in our definition of "close to" but, because Frp(S) is a probability, we do not have this freedom. Instead we must choose only a definition of "close to" that involves a formula that can itself be written as a probability, leaving us with only ES[Frp(S)]. Is that right? 𝕃eegrc (talk) 21:01, 24 November 2015 (UTC)

Yes. RATLAM (talk) 11:34, 25 November 2015 (UTC)

When I am trying to get a probability Frp(S) = Prx[x ≤ Qp(S)] close to a probability p, of the many criteria for "close to" I am forced to use ES[Frp(S)] = p because the left-hand side of ES[Frp(S)] = p can be written as a probability Prx,S[xQp(S)]? I am finding that hard to accept. 𝕃eegrc (talk) 17:13, 25 November 2015 (UTC)

You don't want to compare apples with oranges. What you need is a statistical measure of Frp(S) that can be compared with p. p is a probability. Therefore, your measure must be a probability, i.e., it must be ES[Frp(S)]. RATLAM (talk) 07:57, 26 November 2015 (UTC)

I don't need a statistical measure of Frp(S) to equal p. Instead, I need Frp(S) to be close to p. In particular, I reserve the right to minimize a cost function rather than to achieve an equality. However, even if we continue with your argument that it must be probabilities and that the probabilities must be present in an equality: the goal of medianS[Frp(S)] = p can be written as the condition that PrS[Frp(S) ≤ p] = PrS[Frp(S) > p], at least under your assumption of a continuous probability density. Like the condition ES[Frp(S)] = p, the condition PrS[Frp(S) ≤ p] = PrS[Frp(S) > p] has a probability equaling a probability; apples to apples. These probabilities for the median condition have the added advantage that they are written with an explicit Frp(S), unlike with the ES[Frp(S)] = p condition where the the left-hand side as a probability, Prx,S[xQp(S)], gains x as a subscript and loses the explicit mention of Frp(S). 𝕃eegrc (talk) 14:04, 27 November 2015 (UTC)

I was not using the word "equal". "Comparing with p, of course, means quantifying "close to p". For doing that, one needs an estimator of p. You are arguing that medianS[Frp(S)] is a better estimator of p than E[Frp(S)].

Consider an analogy: A coin is flipped ten times, and this sampling is repeated S times. We then get sample frequencies for the coin landing heads-up, Fr1, Fr2,..FrS. We wish to know how close to being fair the coin is. To find that, we need to compare an estimator of Frp(S) with the probability p = 0.5 of an ideally fair coin. Since p is a probability, and probability is defined as ES[Frp(S)], when S gets large, our unbiased estimator is the mean, and our criterion is "how close ES[Frp(S)] is to p". When we test many coins, the one for which ES[Frp(S)] is closest to p is the fairest one.

But, in analogy with this, you insist that the criterion should not be the one used above, but instead "how close medianS[Frp(S)] is to p". Note that the distribution related to an unfair coin may not be symmetric.

Why? Do you actually think that such a criterion would provide us with a better selection of the fairest coin?

RATLAM (talk) 14:45, 28 November 2015 (UTC)

### Week of November 29, 2015

Who says that to get "close to" p one needs a formal "estimator"?
I think R-6 is one of several fine approaches; I do not consider it inferior.
I don't see any quantiles in your analogy; what distribution are we sampling from so as to estimate a quantile? Regardless, off the top of my head, my approach to the "is the coin fair?" example might be Bayesian. I would start by asking the engineers who are making the coins whether they have any prior distribution for the parameter p. If not, maybe I would start with a beta distribution, which is the conjugate prior, and use either beta(1, 1), the uniform prior, or beta(0.5, 0.5), the Jeffreys prior. From there I could easily compute the posterior distribution for p using the total number of observed heads and the total number of observed tails. From there, I would ask the engineers what matters most to them. Is there a hard boundary, an ε such that only p ∈ [1/2 − ε, 1/2 + ε] will do? (Or, is there a requirement to minimize a root-mean-square deviation? Etc.) If there is a hard boundary based upon an ε, I would integrate the posterior from 1/2 − ε to 1/2 + ε. That is the number I would use to quantify "close to" for each coin as I compare multiple coins. I might then suggest that further testing be done on those coins that look best. 𝕃eegrc (talk) 17:35, 30 November 2015 (UTC)

How do you define "close to"?

How do you quantify "close to" without using an estimator?

The other approaches are not "fine", because they give an erroneous probability P(xxh), as has been discussed (e.g. Week of October 18, 2015).

The analogy was not about quantiles, but about how to evaluate "close to". I would like to understand what you imply, and hence repeat my question: Do you think that the median of sample frequencies results in a better quantitative estimate of the probability than the mean? RATLAM (talk) 08:48, 1 December 2015 (UTC)

How to define and quantify "close to" depends upon context and purpose. Generally, I would use the expectation of a cost function. Reasonable cost functions include one that equally penalizes a value outside of a strict tolerance [pε, p + ε], or one that measures absolute deviation, or one that measures root-mean-squared deviation.
I don't know what you mean by "P(xxh)" because you left off the subscripts. All approaches, including R-6, give an erroneous probability Prx[xx(h)] almost surely. (That is, PrS[Prx[xx(h)] = h/(N + 1)] = 0.) This is why we are haggling over the definition of "close to". We agree that only R-6 gives Prx, S[xx(h)] = h/(N+1), but we disagree as to whether this is the only reasonable definition of "close to".
In the example you give for "is the coin fair?" you allow exactly 10 samples for each experiment, so that the achieved values can be only multiples of 1/10, yes? In that case, I would find the median to be a poor way of estimating the probability of heads. Fortunately, in the quantiles example, Frp(S) can be anything in (0, 1) so the median does not suffer from this granularity. In fact, if one's context and purpose suggest that minimizing the expected absolute deviation of Frp(S) from p is desirable then the median condition medianS[Frp(S)] = p will do quite nicely. 𝕃eegrc (talk) 15:31, 1 December 2015 (UTC)
If I understand you correctly, you are asserting:
1. Instead of trying to quantify how close Qp(S) is to G(p) we must focus on quantifying how close Frp(S) = Prx[x ≤ Qp(S)] is to p.
2. "Close to" means that some statistic of Frp(S) must equal p.
3. Because p is a probability, the above-mentioned statistic for Frp(S) must be a probability.
4. The above-mentioned statistic for Frp(S) need not be of the form PrS[…], we are free to consider probabilities of the form Prx,S[…].
My thoughts on these are:
1. I do not agree with "must", but do agree that this is a reasonable reduction of the original problem.
2. I completely disagree. There is no reason we cannot use cost functions.
3. I have some sympathy for this one, except …
4. That the probability need not be PrS removes quite a bit of the punch from this otherwise plausible requirement. Sure, you came up with a probability, but only by reviving a random variable that we did away with in Step 1.
𝕃eegrc (talk) 15:55, 1 December 2015 (UTC)

Thank you for clarifying the nature of our disagreement. I think that it is becoming clear now, so that we should finally be able to sort this out.

The main disagreement is reflected in your response "How to define and quantify "close to" depends upon context and purpose". I agree with that. However, what you are missing is that, in our task, the context and purpose are known! The purpose is to estimate the quantiles, which is a well defined statistics problem, for which there exists a specific statistical measure.

About the coin analogy: You imply that the only reason for the median not to work better than the mean is granularity? What if we have 1000 tosses per experiment?

I will comment your understanding of my assertions below.

1. Yes. It is the same problem, so that we can use "can" instead of "must" if you wish. However, I am not comfortable with your wording here, as it implies that this is our goal. Comparing random samples individually with p does not lead to anything. We need a statistical measure. Thus, only the next issue (2.) is relevant.

2. Not equal. The statistic should be a consistent estimator of p. The mean of sample frequencies is the consistent estimator of a probability.

3. Yes. But this can be formulated more generally: It should be a consistent estimator of p.

4. See, 3. above.

As to your thoughts about issue 2, I agree that one can, in principle, use the minimum of a cost function as the statistical measure here. However, for that measure to be reasonable, it would have to be a consistent estimator of p. Since the consistent estimator of p is the mean of Frp(S), the function used must be the squared-error loss function (see, loss function). Thus, it cannot be freely chosen, and using it reduces the estimation problem to applying the mean of Frp(S) in the first place.

RATLAM (talk) 21:42, 2 December 2015 (UTC)

No, 1/1000 is still granular.
Perhaps you will forgive me for focusing on item #2 because that is where I perceive that we are furthest apart. In order to prove that Frp(S) is "close to" the known value p, who says that we must use a consistent estimator? The cost function, say, ES[| Frp(S) − p |] is not comparing p to an individual sample S. This cost function directly quantifies "close to" in terms of the expected distance to the known value p. Who has overruled such a natural approach?
In case it comes up later … are you using consistent estimator in terms of the size of an individual sample N = | S | growing without bound or in terms of the number of Frp(S) values growing without bound? 𝕃eegrc (talk) 23:57, 2 December 2015 (UTC)
File:Bin analysis for random samples from uniform distribution. Weibull vs. Jenkinson.jpg
50 000 new random observations taken from a uniform distribution falling into three bins. Mean (Weibull) vs. median (Jenkinson).

You ask who says that we must use a consistent estimator. The estimation theory says so, for example "Consistency is a relatively weak property and is considered necessary of all reasonable estimators" (http://www.cc.gatech.edu/~lebanon/notes/consistency.pdf).

We now appear to be able to reduce the core of our disagreement into one question: Is the mean or median of sample frequencies to be favoured when estimating a probability?

This I believe because you imply, in your reply to the coin analogy (in which N is constant and number of FrS values grows without bound), that you would favour the median (in the absence of the granularity issue), and argue for using a cost function for that corresponds to the median estimator (Loss function: "the median is the estimator that minimizes expected loss experienced under the absolute-difference loss function").

Accordingly, I will resolve this issue by showing that the mean must be favoured.

THEORY:

Frequentist probability: "As the number of trials approaches infinity, the relative frequency will converge exactly to the true probability". The mean of sample frequencies equals the total relative frequency. Therefore, as the number of trials approaches infinity, the mean of sample frequencies converges exactly to the true probablity. Thus, the mean of sample frequencies is a strongly consistent estimator of probability.

A reasonable estimator should be consistent, but the median of sample frequency does not converge to the true probability. This is easy to see, because the mean does that exactly, and the mean and the median converge to a different value (for an asymmetric distribution). Hence, the median of sample frequencies is an inconsistent estimator of probability.

In fact, in our problem, we take the mean over the whole parameter space S, and that mean ES[Frp(S)] equals probability by definition. Obviously, no estimator can do a better job. This is the reason why we should not even consider formulas other than R-6 when estimating quantiles by a sample.

NUMERICAL EXAMPLE:

In the figure enclosed, the probabilistic bin-analysis, described on 22th November, is applied to a Monte Carlo simulation for a uniform distribution between 0 and 1. Three random values of x are chosen, the first two of them are order-ranked (x1 and x2), a straight line L is drawn through points (x1,p1) and (x2,p2) where p1 and p2 are the plotting positions, and the probability estimate of the third value, i.e. L(x3), is taken from this line. This cycle is repeated 50 000 times both for Weibull formula (mean) and Jenkinson's formula (median). Frequency of hits of L(x3) in bins 1 (probability interval [0,1/3], 2 ((1/3,2/3] and 3 ((2/3,1] is recorded. "Exact" refers to the original distribution. The method based on the mean criterion works as expected, whereas the method based on the median criterion is far from ideal.

RATLAM (talk) 21:56, 3 December 2015 (UTC)

If you do not mind, I would like to focus on my Item #2. Why, if I am trying to get Prx[x ≤ Qp(S)] to be close to p must I use an estimator of p (instead of a loss function)? 𝕃eegrc (talk) 01:23, 4 December 2015 (UTC)

Because p that minimizes the expected loss experienced under a loss function is an estimator of p. Our problem is to find p which is close to Prx[x ≤ Qp(S)], not vice versa.

We know the ordered x-values xh of a sample, but we do not know F(xh). Therefore, we need to associate to each xh a certain probability ph, called plotting position. This ph depends on the rank h but not on the actual value of xh. The crucial point here is that it is ph we have to find, not xh. Take e.g. the loss function S[F(xh) - ph]2. The relevant question is, how to choose ph to minimize the loss function when xh gets all possible values, and the answer is ph = ES(F(xh)) = h/(N+1). The reverse question: With a given p, how to choose x to minimize the loss function, results in nothing useful.

Different criteria have historically been used to determine ph. They each result in different ph-values. This is easy to see by taking uniform distribution between 0 and 1, assuming that x1=1/3, x2=2/3 and plotting the Qp:s according to the methods R-4 - R-9 on paper. It is statistically impossible to have two or more quantiles for a given distribution and given p. Thus, only one of the corresponding criteria can be correct. As I showed in my previous reply, it is R-6.

RATLAM (talk) 19:54, 4 December 2015 (UTC)

### Week of December 6, 2015

I think I better understand your argument now that I see that you consider the loss function approach to be a form of estimation. Please tell me where you think I go astray in the following:
1. Qp(S) is an estimator function whether it is motivated by a desire to solve an equation or minimize a loss function.
2. Qp(S) estimates the p-th quantile, G(p), of an unknown distribution.
3. Qp(S) takes as its input a sample S.
4. Qp(S) is the estimate of G(p) that arises from the sample S.
5. In this case, a consistent estimator is an estimator having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to G(p).
6. Frp(S) = Prx[x ≤ Qp(S)]
7. Frp(S) is an estimator function even though we cannot compute it when the distribution Prx is unknown.
8. Frp(S) estimates Prx[x ≤ G(p)] = p
9. Frp(S) takes as its input a sample S.
10. Frp(S) is the estimate of p that arises from the sample S.
11. In this case, a consistent estimator is an estimator having the property that as the number of data points used increases indefinitely, the resulting sequence of estimates converges in probability to p.
12. Under reasonable continuity assumptions, the consistency of Frp(S) implies the consistency of Qp(S), and vice-versa.
13. All the R-* estimators are consistent, when expressed in Qp(S) (i.e., x-axis) form. In particular, they all give the same answer, G(p), in the limit as the number of data points used N = | S | increases indefinitely.
14. All the R-* estimators are consistent, when expressed in Frp(S) (i.e., p-axis) form. In particular, they all give the same answer, p, in the limit as the number of data points used N = | S | increases indefinitely.
15. None of the Qp(S) forms for the R-* estimators are unbiased because the unknown distribution is arbitrary and thus ES[Qp(S)] could be anything.
16. Of the Frp(S) forms for the R-* estimators, only R-6 is unbiased, though only for some values of p and some values of N. For R-6, we have that ES[Frp(S)] = p whenever h = (N+1)p is an integer.
17. If eliminating the bias of the Frp(S) form of the estimator is the highest priority then R-6 is the best choice.
18. If, e.g., minimizing ES[| Frp(S) − p |] is the highest priority then R-6 is not the best choice.
𝕃eegrc (talk) 02:27, 6 December 2015 (UTC)

I am sorry, but I have no resources to continue this discussion any further. It seems to be endless and keeps diverging into irrelevant issues. The important aspects have been outlined above, e.g. in "THEORY" and there is no reason to repeat them. There are also journal publications that include clear explanations of this topic, in particular, Makkonen and Pajari (2014). Numerical simulations attached to this discussion also speak for themselves.

None of the items of your list are both accurate and relevant to our problem. I mention a few problems. The formalism is not clear, e.g., Qp(S) cannot be both an estimator (item 1.) and an estimate (item 4). "Number of data points" in items 5 and 11 should read "number of samples". Items 13 and 14 are irrelevant to our problem, which is to estimate quantiles by a sample, i.e. our N is fixed. The estimators of item 15 are not useful for anything, so that their bias is irrelevant. 16 is true, but hides the most important issue, which is that only R-6 provides a consistent estimator. 17 and 18 are very misleading statements. Generally speaking, they are true, but there is no "If" in our problem. For our problem, we definitely need to use the mean as the estimator, because it is the only consistent estimator of probability. This means that the loss function minimized must be the squared error. We do not want to minimize anything else, because that would lead to an estimator which is not the mean, and to an incorrect quantile.

I believe that the main reason for these difficulties is that thinking of this problem in terms of the reverse of the CDF is difficult. It is much easier to consider this in terms of estimating the CDF. Below is this explanation.

Estimating CDF and quantiles by a sample
CDF is defined so that it provides the connection between a variable X and its non-exceedance probability P. The quantile function is the reverse of CDF, so that it gives the same connection. Thus, when we have a sample x1, x2, ...xN, our task is:
Estimate the connection between X and P using the sample.
The only way to start with this task is to use order-statistics, which provides us the connection between the rank h (1,2,...N) and the probability ph associated with that rank. This connection, ph = h/(N+1), is valid irrespectively of the values of X and the underlying distribution. Thus, we know the probabilities associated with our observations falling in each rank. Moreover, when estimating a quantile, we have a given probability p in mind. Therefore, we do not need to estimate the probability. We only need to estimate the variable value associated with a given probability.
Since all the information we have are our observations x1, x2, ... xN, we must use them. Of course, X associated with p1 is the observation x1, and so forth. We can then plot these points (xh,h/(N+1)) and estimate the CDF in between these points e.g. linearly. This is method R-6. All other methods give an incorrect probability ph as the starting point of the analysis.

There it is in its simplicity. I doubt if anyone has been following our extensive discussion, but if so, I would advice them to neglect all of it, and consider this explanation only.

I would like to see the Quantile page improved, and accordingly, also the Q-Q plot page, where the plotting positions are discussed. I would be pleased to comment on concrete suggestions towards that goal. But this discussion, I need to close on my part now. Thank you.

RATLAM (talk) 20:50, 9 December 2015 (UTC)

Thank you for taking the time to respond yet again. I agree that we have reached the time for closure and I regret that agreement continues to be elusive. It is still clear to me that the theory you are pushing is both WP:FRINGE and false. I had hoped to persuade you of the latter, but have been unsuccessful. We can call it quits here, but in case you find more energy to discuss this, the following are my responses to your most recent points.
1. I agree with your observation about an estimator not being an estimate; I should have used distinct notation there. Excepting the specification of the limits, G(p) and p, Items 5 and 11 come directly from the Wikipedia page on consistent estimator and my subsequent results on consistent estimators lead directly from them. That you think "consistency" is not about N = | S | increasing without bound is symptomatic of the flaws in your reasoning. I repeat, all of the R-* estimators are consistent estimators. R-6 differentiates itself only in that it unbiased (though only on the p-axis and only when h = p(N+1) is an integer).
2. With regard to your summary explanation, we are in complete agreement that ES[Prx[xxh]] = h/(| S |+1). However the expectation over S is only relevant for bounded values of N = | S | when one is talking about whether an estimator is unbiased. I like unbiased estimators for quantiles in many contexts, but when minimizing expected absolute deviation is more important then minimizing expected absolute deviation is more important.
I agree with your statement that pretty much all of the text of this discussion can be deleted; it won't help any third party to read it. I will delete everything prior to 9 December 2015, though feel free to "undo" if you want. Although our discussions did not bear fruit, I thank you for your diligent contributions. Fare thee well. 𝕃eegrc (talk) 21:42, 9 December 2015 (UTC)

### Week of December 15, 2015

This long-lasting discussion between two parties is related to a more than century long disputation about plotting positions. So far, it seems to be closed without agreement. I wonder whether the following contribution could help in finding some kind of convergence.

In methods R-4 … R-9, a probability estimate of the form ph= (h+a)/(N+b) is associated to each observed, order-ranked value xh. a and b are constants specific to each method. The different values of a and b form the essential difference between the methods. ph is an estimate of F(xh), the true probability of x not exceeding xh. We can concentrate on the properties of (xh,ph) pairs where h is 1,...,N, because all other (x,p) pairs predicted by the methods are obtained either by interpolation or extrapolation.

By a loose definition, a consistent estimator of a parameter is a sequence which converges to the correct value of the parameter when the number of samples “grows to infinity”. Leegrc states that all methods R-4 … R-9 are consistent estimators of Qp(S). I disagree. It is true and nice to know that when N increases indefinitely, all such estimates Qp(S) converge in probability to G(p), but this does not mean that the methods are consistent estimators because N is the size of one sample, not the number of samples.

Instead of Qp(S) I would like to use notation Qi(p,Sj) where i = 4,…,9 refers to the index of methods R-4 … R-9 and j to the index of the sample. In order statistics, a sample consists of a set of order-ranked values x1,...,XN. Increasing the number of samples (=M) means increasing the number of such sets, each of size N, not increasing N. Given a probability p, the sequence provided by any of the methods above is Qi(p,SM) which does not converge with increasing M. It follows that none of the methods is a consistent estimator of G(p) as such.

We can construct a consistent estimator A e.g. by calculating the mean of successive values, i.e. A = [Qi(p,S1+...+Qi(p,SM]/M which is a consistent estimator of parameter E(F(xh)) when ph= h/(N+1). In the same way, B =Median[Qi(p,S1,...,Qi(p,SM] can be regarded as an approximately consistent estimator of Median(F(xh)) when ph is = (h-1/3)/(h+1/3). When we estimate two different parameters, the estimators are different and the existence of a consistent estimator does not tell which method is better.

The previous discussion has dealt with the question: If Qi(p,Sj) is close to G(p), does it matter how we define “close to”? Is it e.g. OK to minimize the sum of absolute values of deviations p-F(xh) as in R-8 or should we minimize the sum of squared deviations as in R-6? When the loss functions are minimized, the resulting ph,6 and ph,8 are different. The original question can now be reformulated: Does it matter whether we are close to p1 or close to p2 when p1p2. The answer should be obvious but let’s continue. It follows that Q6(ph,6,S) = Q8(ph,8,S)=xh. Any estimator e applied to M samples gives e(Q6) = e(Q8) which both converge to the same value, say x0. If G(p) is the parameter to be estimated and e is a consistent estimator of G(p), then x0 = G(ph,6) = G(ph,8). This is impossible because ph,6ph,8. Consequently, either R-6 or R-8 is wrong or they both are.

Since both LEEGRC and RATLAM approve R-6, there is no need to promote it here anymore. All other methods applying plotting positions other than h/(N+1) contradict R-6 in the same way as R-8. If we prefer estimates of G(p) which are unbiased and consistent for N probability values, we choose R-6 from the Table. It is very difficult to find a situation in which a biased estimate, given by any other method and converging to a parameter other than G(p) should be preferred. Therefore, the article could be improved by adding a warning and, in the long run, by introducing new methods, e.g. R-6 modified in such a way that the interpolation is carried out on probability paper, and by moving the methods other than R-6 to a historical review with explanation. BERKOM 18:59, 18 December 2015 (UTC) BERKOM 13:10, 22 December 2015 (UTC) — Preceding unsigned comment added by BERKOM (talkcontribs)

### Week of July 17, 2016

BERKOM (talkcontribs) writes:
"It is very difficult to find a situation in which a biased estimate, given by any other method and converging to a parameter other than G(p) should be preferred. Therefore …"
Fortunately, all listed methods R-4 through R-9, not just R-6, converge to the same value as N=| S | goes to infinity; they are all consistent. Whether one cares about bias on the p-axis for specific values of p vs. minimizing absolute deviation, etc. is another matter, over which people are free to disagree, hence the plethora of equally important options. No "warning" or markings as "historical" would be appropriate. 𝕃eegrc (talk) 14:19, 21 July 2016 (UTC)

### Week of November 20, 2016

It is true that when N increases indefinitely, all methods R-4 ... R-9 converge to the same value. More generally, when b is an arbitrary positive real number, the probability h/(N+b) associated to each observed order-ranked value xh also produces an estimator which converges to the same value when N increases indefinitely. It is obvious that such a convergence alone is not enough to justify any of the methods in any relevant case, i.e. when N is definite. BERKOM 14:16, 22 November 2016 (UTC)

Agreed that consistency is only one of several criteria one could desire for an estimator. Agreed that since all of R-4, …, R-9 are consistent, consistency does not help us choose any of R-4, …, R-9 as better than the rest. If the cumulative probability distribution function is continuous then R-6 will be best in the situation where one first samples N+1 random real values, then randomly selects N of them, then asks what is the probability that the unselected value will fall into any of the N+1 intervals defined by the N selected values, and then seeks to provide quantile estimates that give those interval probabilities in an unbiased fashion. If that is not your situation or your optimization criterion then you may prefer, e.g., R-8. 𝕃eegrc (talk) 20:04, 22 November 2016 (UTC)

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).