Talk:Mann–Whitney U test

From Wikipedia, the free encyclopedia
  (Redirected from Talk:Mann–Whitney U)
Jump to: navigation, search
WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.


"All the formulae here are made more complicated in the presence of tied ranks, but if the number of these is small (and especially if there are no large tie bands) these can be ignored when doing calculations by hand. The computer statistical packages will use them as a matter of routine."

First, what does one do with ties? (A link is sufficient if it is described elsewhere.)

Second, what do computer statistical packages use routinely? The ignoring procedure, or the proper (undescribed) way to handle ties?

dfrankow (talk) 19:45, 29 December 2008 (UTC)

Yes, that was not well expressed. I have had a go at rephrasing it - does it make better sense now? The actual formula in the case of ties would be a bit of a pig to enter in Wiki code and I will leave that job for someone more fluent in the coding than I am, though it's certainly true that for completeness we ought to have it here. seglea (talk) 01:40, 30 December 2008 (UTC)

One-tailed versus two-tailed distributions[edit]

"Note that since U1 + U2 = n1 n2, the mean n1 n2/2 used in the normal approximation is the mean of the two values of U. Therefore, you can use U and get the same result, the only difference being between a left-tailed test and a right-tailed test.

Huh? Perhaps it would be clearer to say which value goes with a left-tailed and which with a right-tailed test. dfrankow (talk) 19:45, 29 December 2008 (UTC)

This test can be reduced to one formula with one reference table[edit]

Really, someone should look into that... I don't have the time. COYW (talk) 22:26, 11 May 2015 (UTC)


In the general formulation, the null hypothesis of Mann-Whitney U-test is not about the equality of distributions. Is is about the symmetry between two populations with respect to the probability of obtaining a larger observation. Of course, two identical distributions possess the property of symmetry but two different distributions (for example, 2 normals with the same mean but different variances) can also be perfectly symmetric with respect to the probability of obtaining a larger observation.

The whole issue of correct formulating a null hypothesis is very important for consideration of the power of the test. Consider again 2 normal distribution with the same mean and different variances. If the null hypothesis is defined as the equality of 2 distributions we are likely to fail to reject the null hypothesis even that we know a priori that it is not true. Only for 2 distributions with similar variance but different means (more specifically, 2 distributions with a location shift) we will have a fair chance (i.e. good power) of rejecting the null hypothesis. So such a formulation of null hypothesis severely restricts the applicability of the test.

However, if we define the null hypothesis as a hypothesis of symmetry with respect to obtaining a larger observation then everything works perfectly and the power of the test does not depend on diverging variances. Indeed, the inability to reject the null hypothesis for 2 normals with the same means but different variances is not a failure of the test because we know a priori that in this case the null hypothesis is satisfied.

—Preceding unsigned comment added by Marenty (talkcontribs) 02:19, 29 July 2010 (UTC)

The article gives a misleading information on the assumptions of Mann-Whitney U test. It says:

"In a less general formulation, the Wilcoxon-Mann-Whitney two-sample test may be thought of as testing the null hypothesis that the probability of an observation from one population exceeding an observation from the second population is 0.5. This formulation requires the additional assumption that the distributions of the two populations are identical except for possibly a shift (i.e. f1(x) = f2(x + δ) )"

Testing the alternative hypothesis P(A>B) > 0.5 (where A is from population 1 and B is from pupulation 2) does not require the restricting assumption that both distributions are equal except for a shift in location! How come? What is the basis for this statement? The test statistic in U-test is just the proportion of pairs such that the first observation is from population 1 and the second from population 2. The distribution of this test statistic can be perhaps most easily be theoretically calculated for the special case shifted distributions but it does not restrict the use of the test and has nothing to do with test assumptions! —Preceding unsigned comment added by Marenty (talkcontribs) 22:06, 24 May 2008 (UTC)

Assumptions seem necessary, in the unequal variance case under the null hypothesis p-values are not uniformly distributed (I used two normals, same mean different variance). (talk) 23:55, 13 February 2009 (UTC)

can anyone typeset the formulae better? I am not familiar with Tex. seglea 05:39, 17 Jan 2004 (UTC)


The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution. The central tendancy hypothesis requires the additional assumption that the distribution of the two samples are the same except for a shift (i.e. f1(X) = f2(X+delta)). The test can also be described as a general test of equality of distribution (H0: f1=f2). In this case the shift alternative is not required, however, the test is used most often as a test of central tendency, so the original formulation (with the addition of the shift assumption) is most appropriate. I have added this assumption to the main page. —Preceding unsigned comment added by (talk) 00:20, 30 January 2008 (UTC)

"The hypothesis stated in this article refers both to the testing of equality of central tendency, and equality of distribution"

Really, this test is precisely only for testing stochastic dominance of two variables A and B, that is, of Prob(A>B) > Prob(B>A). In other words, it tests whether a randomly chosen sample from A is expected to be greater than a sample from B. Look at the test statistic: it is a function of the proportion of pairs A>B where A is from the 1st distribution and B is from the 2nd distribution. For testing the stochastic dominance, ao additional assumptions are needed (beside the assumption that the underlying distribution is ordinal.)

It is incorrect to use MU U-test for general testing of "equality of two distributions" as asserted. Two normal distributions A and B with the same mean and different variances are different distributions, the test will(incorrectly) never reject the null hypothesis if we are testing for equality of the distributions. But if we are testing testing for stochastic dominance instead, then the test (correctly) does not reject the null hypothesis.

On the other hand, "central tendency" is a nebulous concept, but in reality testing "equality of central tendency" with U-test will be nothing more than testing of stochastic dominance. If we want to use the test for detection of a "shift", then we do need to add an assumption about distribution74.0.49.2 (talk) 01:51, 8 June 2009 (UTC)s A and B having the same shapes. But this additional (and unnecessary) assumption follows from the very definition of the "shift" rather than from intrinsic requirements of U test.

In summary, the test should be used in general for testing that Prob(A>B)>0.5, and as such has only one assumption that the samples are comparable (i.e. ordinal). (talk) 01:51, 8 June 2009 (UTC)

P value[edit]

I beleive that one cannot interpret results from this test with out understanding the P-value. As I understand it the smaller the P value the more different the two populations are. What I would like to know if there is a critical value like there is with a T-test? Thanks ADS

Inexact explanation of what this test should be used for[edit]

A significant MW tests does not necessarily imply that the distributions have different medians. This is a common misconception. It is most powerful for detecting a difference in medians, which is why this is commonly misstated. The MW tests that the samples were taken from different distributions.

calculation of mu in the normal approx[edit]

Is this right? I would have thought it should be symmetrical in n1 and n2... 21:11, 28 December 2006 (UTC)JWD

Introduction requires clarity[edit]

These two statements are not equivelent

  • It requires the two samples to be independent, and the observations to be ordinal or continuous measurements
  • i.e. one can at least say, of any two observations, which is the greater.

Link to rank article needs to be more specific[edit]

The 'rank' link currently points to a disambiguation page which doesn't include an article explaining what the rank of a sample is.Tim (talk) 02:31, 3 June 2008 (UTC)

Distribution of U-statistic and Table of Values[edit]

The table of values link (pdf) is broken, I am changing the address to a different document. I have been unable to find on the web some kind of explanation of the distribution of the U-statistic. This article could use at least some explanation of how the statistic is distributed, and optimally a formula or plot, if possible. I'll keep working, but if someone's got it on hand, that would be great. Lovewarcoffee (talk) 19:57, 4 August 2008 (UTC)

Just what is it that we are talking about?[edit]

The article starts:

In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test) is. . . .

Thereafter it talks of "MWW". "MWW" strikes me as an odd abbreviation for "Mann-Whitney U test." If this article is correctly titled, I suggest that the test should be abbreviated as "MW."

David J. Sheskin devotes pp 513–75 of Handbook of Parametric and Nonparametric Statistical Procedures, 4th ed. (Boca Raton: Chapman & Hall, 2007) to this one test, which he calls the "Mann–Whitney U" test. (If you're a purist, note the dash: it's not one statistician with a double-barreled name, but two separate people, Mann and Whitney.) He writes at the start:

Two versions of the test to be described under the label of the Mann–Whitney U test were independently developed by Mann and Whitney (1947) and Wilcoxon (1949). The version to be described here is commonly identified as the Mann–Whitney U test while the version developed by Wilcoxon (1949) is usually referred to as the Wilcoxon–Mann–Whitney test. Although they employ different equations and different tables, the two versions of the test yield comparable results. (513)

(Unfortunately even Sheskin's 1700+ pages don't include any further coverage of [what he calls] the Wilcoxon–Mann–Whitney test.)

And Sheskin adds in an endnote:

The test to be described in this chapter is also referred to as the Wilcoxon rank-sum test and the Mann–Whitney–Wilcoxon test. . . . (569)

This of course doesn't agree with what's written in this Wikipedia article. To follow Sheskin, it would instead say something like:

In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW) or Wilcoxon rank-sum test) is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . . .

What's the authority for what the article now says? Tama1988 (talk) 09:49, 17 November 2008 (UTC)

The Mann-Whitney U test and Wilcoxon two sample test were developed independently, but provide an identical test statistic. Both Snedecor and Cochran and Sokal and Rolf retain the distinction and neither concatenate the names. Regards—G716 <T·C> 20:58, 18 November 2008 (UTC)
So how about:
In statistics, the Mann-Whitney U test — also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon two sample test, or Wilcoxon rank-sum test — is. . . . (Although very similar, the "Wilcoxon-Mann-Whitney" test is different.) . . .
? Or are you saying that when Sheskin talks of "comparable results" he means "the same results" and that Wilcoxon's 1949 test (the "Wilcoxon-Mann-Whitney test") is the same as Mann-Whitney? Tama1988 (talk) 09:14, 19 November 2008 (UTC)

Receiver operating characteristic[edit]

I have removed the following from the section on Herrnstein's rho. As it stands, it does not make sense. It may well be true, but if so it needs a lot more explanation.

"ρ is also known as the area under the receiver operating characteristic (ROC) curve."

seglea (talk) 23:14, 12 January 2009 (UTC)

This is well known so I'm unclear on why this was removed. See

 author = {Hanley, J. A. and McNeil, B. J.},
 year = 1982,
 title = {The meaning and use of the area under a receiver operating
         characteristic ({ROC}) curve},
 journal = {Radiology},
 volume = 143,
 pages = {29-36},
 annote = {diagnosis;testing;ROC;c index}

Harrelfe (talk) 14:32, 14 February 2009 (UTC)

I just want to add a little comment here on the rho statistic discussed above. It is a bit unclear since in one section it says that the AUC is directly related to the U statistic, giving the equation AUC=U/n1*n2, which is clear and good, and then below in the section on rho, which is exactly the same formula except it is given as rho= U/n1*n2, the reader is informed about "this commonly used test statistic..." would it not be better to integrate these two sections? Also, rho is such a common symbol in math/stats, it would be nice to call it Herrnstein's rho, to make it clear that he was the first to call this statistic by this name. I don't mind that in this case the AUC is called by something else, since usually people are interested in AUC as a tool in classification of binary outcomes at different probability levels, where having AUC values close to one indicates a "good" score for a classifier, and scores approaching 0.5 are considered valueless as classifiers, since they are no better than random assignment. In the case of rho, however, where the test is coming from a comparison of two populations, there is not this judgement about values close to 1 being "better". ~Frieda

Assumptions and Formalization of Hypotheses[edit]

I will shortly change the "Formal statement of object of test" and "Assumptions" sections and put them into one section. There were several errors.

1. Previously the article stated that the MWW test does not test for differences in medians. But if you make the location shift assumptions, then it does in fact strictly test for the differences in medians.

2. Previously the article stated that one proper formulation for the MWW test is to have the null hypothesis be that P(X>Y)=0.5. In fact, this is not true. If you have two normal distributions with the same mean but different variances then the MWW test is no longer valid under that null even though P(X>Y)=.5 (see Pratt, 1964, Journal of the American Statistical Association, 665-680).

3. Previously the article stated "Without making such a strong assumption [about the location shift] (and verifying its validity) it is incorrect to use the MWW test as a test for shift in location." In fact, although we can invalidate the location shift assumption, we cannot verify that assumption with a finite amount of data. (Testing and finding no significant violation of an assumption is not the same a verifying that assumption. You could have not been able to find significance because of a small sample size). In statistics we make assumptions all the time, so saying it is incorrect to use an assumption without verifying it seems contrary to the practice of statistics. Although it may be a good idea to check the assumption if you can.

4. The following statement was made: "the Mann-Whitney U test is valid for testing of stochastic dominance under very broad conditions, without making any additional assumptions, including any additional assumptions about variances of the two samples". This is not correct, see Pratt, 1964 referenced above. The paragraph following was mostly redundant, so I deleted it. Mpf3205 (talk) 05:18, 7 March 2010 (UTC)

Mann-Whitney vs. Wilcoxon[edit]

I spent a great deal of time puzzling over the table of critical values in the second external link (, wondering why it didn't match up with the first link, and more importantly, why some of the values appeared to be theoretically impossible (e.g. greater than 100 for a 10*10 test). After careful reading, I realized that the test statistic in the link was calculated differently than the one in the article. (i.e. a straight R1 sum of ranks versus the R1 - n1(n1+1)/2).

I have no external experience with this, but the best I can tell from the external link, while the Mann-Whitney and Wilcoxon tests are equivalent, they are not identical, in that the numeric form of the statistic differs. Whereas the Mann-Whitney includes the n1(n1+1)/2 adjustment, the Wilcoxon is a straight sum of ranks. While not changing the application or conclusions of the test, this is crucial to know when looking at critical value tables, as what works for one won't work for the other.

I altered the text for the external link so hopefully others will not be as confused, but could someone who has a better understanding of the history and situation add a clarification about the different functional forms to the article? (If you would add info about why they're equivalent, and why one form might be preferred over the other, so much the better.)

P.S. While you're at it, a discussion on how to treat identical valued items in calculating the test statistic would be also be appreciated. The Auckland link discusses it for the straight Wilcoxon sum of ranks, but I'm still not sure how they are accounted for in the Mann-Whitney statistic. -- (talk) 22:30, 8 March 2010 (UTC)

Just thought I'd point to a reference which touches on the difference: Journal of the American Statistical Association, Vol. 59, No. 307 (Sep., 1964), pp. 925-934 [1] -- (talk) 22:39, 8 March 2010 (UTC)
Different implementations are using differently defined test statistics! There definitely needs to be a section that makes this clear. It would save a lot of wasted time. I have been using some R implementations, the standard one in 'stats' called via wilcox.text and the advanced version in the 'coin' package invoked via wilcox_test. The test statistic reported from wilcox_test (coin package) is equal to the sum of ranks R1 in the wikipedia example [in the R idiom, this is accessed via 'statistic(out, "linear")' where "out" is the output of wilcox_test-- without the "linear" option, you'll get a Z score]. The test statistic reported from wilcox.test (stats package) is equal to the sum of ranks R1 minus a factor that corresponds to n1(n1+1)/2 in the wikipedia notation [in the R idiom, this is invoked via out$statistic, where "out" is the output of the wilcox.test]. Neither of these implementations computes the "U" statistic as defined in the wikipedia article. Dabs (talk) 19:34, 4 December 2014 (UTC)

What do you need to assume under the null hypothesis?[edit]

Under the section that describes the assumptions, I had previously added that you need to have both distributions be equal under the null. That was deleted. But I assert that you need that assumption. If you state the null hypothesis as only needing that Pr[X>Y]+ .5 Pr[X=Y] = .5, that does not give sufficient conditions for validity. Here is a counter example (see Pratt, 1964, JASA, cited in my previous notes above): if you have two normal distributions with the same mean but different variances then Pr[X>Y]+ .5 Pr[X=Y] = .5, but your type I error can be inflated (i.e., the test can reject the null hypothesis more often than the nominal significance level).

The paragraph that I deleted also had that same mistaken idea. —Preceding unsigned comment added by Mpf3205 (talkcontribs) 05:31, 27 August 2010 (UTC)

Un-clarity in terms : ranks or observations ? lower rank or smaller ?[edit]

"Taking each observation in sample 1, count the number of observations in sample 2 that are smaller than it (count a half for any that are equal to it)."

If I understand correctly the test does not require to compare observed values but just the ranks, so this sentence should be changed using ranks.

"Choose the sample for which the ranks seem to be smaller" "count the number of hares it is beaten by (lower rank)"

is 1st the lowest rank ? For me it is the opposite. The word smaller seems less ambiguous to me.

"Arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in."

This has to be done for small samples or big samples, so this sentence should precede.

I will do the modifications I propose, it is probably better so you see what I mean.

Arnaud —Preceding unsigned comment added by (talk) 16:10, 16 November 2010 (UTC)

This does not seem to have gotten fixed. I just fixed it. There are two issues here. First, the explanation as given was incorrect, as implied above. We want to count the *wins*, and counting the wins for sample 1 means counting the observations in sample 2 that are *larger* (and counting 0.5 for ties). The second issue is about whether or not ranks are needed. The answer is no, they are not needed with the simple method introduced in the "calculations" section. It is possible to calculate U correctly without assigning numeric ranks. Just do each pairwise comparison, and count the wins. Here is a numeric example: A = (1, 3) and B = (2, 3, 7). The pairwise wins for A are (3, 1.5), i.e., 1 beats 2, 3 and 7, and 3 beats 7 and ties with 3. This means that U equals 4.5. For B, the number of wins is only 1.5, because 2 beats 3, and 3 ties with 3. The sum of these two U values is 6, which is the produce of 2 and 3, as expected. If you go through the second method of calculation, the sums of ranks are 1 + 3.5=4.5 for A, and 2 + 3.5 + 5 = 10.5 for B (this particular numeric example isn't ideal because in this case r1 just happens to equal 4.5, same as u1). Given that n1=2 and n2=3, the formulas for U will give you exactly the same result as for the simple method. So, to reiterate, we do not need to assign numeric ranks to do the simple method-- we only need to make binary comparisons. Dabs (talk) 21:53, 8 December 2014 (UTC)

Where is the non-technical summary?[edit]

I'm sorry, but WikiPedia is used by non-experts to gain an understanding of something they may have run across in a technical or semi-technical setting. As such, one of the joys of reading many WP articles is that someone has taken the time to explain, in layman's terms, just what exactly is covered in the topic. That is NOT the case here. I think it's great that so many of you can chime in here as "experts", able to contribute because you have the required background.

But when the first sentence of the article uses a phrase "have equally large values" -- it just doesn't make sense. On the face of it, why should it be difficult to determine whether two "samples" have "equally large values"? Doesn't that simply mean looking at the largest value in each sample and seeing if they are the same? Clearly not, which is precisely why a layman's version, at least in the first paragraph, should be offered.

I hope someone is willing to stoop to the level of the non-cognoscente and explain what the heck this is all about. I find that in statistics, almost more than in any other discipline, practitioners are unwilling to translate their statements into simple real-world examples and plain speaking. I often wonder if that's because they're afraid, in some way, that someone will claim the emperor has no clothes? -roricka 1/1/11 — Preceding unsigned comment added by Roricka (talkcontribs) 22:29, 1 January 2011 (UTC)

Hi Roricka. Reading your comment, I truly would like to help out but am not sure how. You wrote how the first sentence doesn't make sense. Well, indeed it doesn't make sense to anyone not familiar with the most basic notions of probability and statistics. However, you can not put them into the first sentence since it would require explaining what a random variable is and how that relates to statistical tests and hypothesis testing (notice how proper wikilinks are present in the first sentence).
Also, IMHO, I don't think all wikipedia articles can be formulated as independent modules of knowledge, easily understood without context. This particular article is a good example of that.
If, after reading more, you'd gain an insight as to how to make this article clearer - I'll be most interested to see how.
Talgalili (talk) 10:32, 2 January 2011 (UTC)

I read the Spearman ciefficient page and understood it immediately. This page I just found incomprehensible in comparison. I have to echo what the OP said. I suggest using the Spearman page as an example of "how to do it right" maybe? Especially the images which were great! — Preceding unsigned comment added by (talk) 14:27, 6 June 2011 (UTC)

Would someone please clarify how the name of the test is pronounced, for those who aren't used to seeing the symbol?radcen (talk) 20:36, 11 December 2015 (UTC)

Separate pages for Mann-Whitney U test and Wilcoxon rank-sum test?[edit]

Despite Mann-Whitney U test and Wilcoxon rank-sum test are equivalent, they are two different tests, as pointed out by In the current version, Mann–Whitney U is described, while Wilcoxon rank-sum test is not. Since the are two different tests, shouldn't we create a new page for Wilcoxon rank-sum test, containing the description of this method, and then say that Mann-Whitney U test and Wilcoxon rank-sum test are equivalent?--Gorif (talk) 23:42, 12 February 2012 (UTC)

Do Ranks start at 1 or at 0[edit]

The Wikipedia article says that ranks start at 0, because you are considering how many hares the tortoise beats, which could be 0. But everything else I've seen about Wilcoxon test says the ranks start at 1. If one has tables for looking up the implication of Wilcoxon rank, it is rather important whether ranks are from 0 or 1. Further the observation that U1+U2=n1*n2 only holds if ranks start at 0, not 1.

This issue needs clarification!!

Ian Davis — Preceding unsigned comment added by (talk) 01:52, 2 December 2014 (UTC)

Ranks begin at 1, otherwise the formulas will be off. However, this is *not* contradicted by the tortoises and hares example, because it does not use ranks. This is explained in the beginning of the "calculations" section-- "method 1" is simply to count the pairwise wins. This calculation does not require assigning numeric ranks, and it gives the correct value. Dabs (talk) 21:52, 8 December 2014 (UTC)
To follow up: It is not true that U_1 + U_2 = n_1 * n_2 only holds if ranks start at 0. Since ranks start at 1, the case of n_1 = n_2 = 1 gives U_1 and U_2 being 1-1=0 and 2-1=1, which sum to n_1 * n_2 = 1. LachlanA (talk) 01:29, 5 August 2016 (UTC)

Is the t-Test Valid for Non-Normal Distributions?[edit]

The introduction says the Wilcoxon is more efficient than the t-test when the distribution is non-normal. That's misleding. The t-test is not just inefficient (is that term meaningful for a test as opposed to an estimator?), but rather fails to be valid if the distribution is not normal. It's a parametric test. See's_t-test Erasmuse~enwiki (talk) 12:22, 22 June 2015 (UTC)

Can a two-tailed test show the sign of the difference?[edit]

The example of the hare and the tortoise says

significant evidence that hares tend to have lower completion times than tortoises (p < 0.05, two-tailed)

My understanding of a two-tailed test is that the alternative hypothesis is simply that the distributions are unequal, rather than one dominates the other. To test whether hares have lower completion times, don't we need a one-sided test? In particular, being significant on a two-sided test doesn't AFAICT show that the sample with the lower mean was drawn from a distribution with a lower mean with the same significance. (If it did, why would anyone use one-sided tests?) LachlanA (talk) 01:20, 5 August 2016 (UTC)

Paragraph about Wilcoxon signed-rank test[edit]

There's been some edit activity re: the paragraph where this test is compared to a Wilcoxon signed-rank test (1, 2, 3, and mine). I think I have clarified it but I don't want to start an edit war, so @TshiliM: @DisillusionedBitterAndKnackered: What do you think of the current phrasing? —Cousteau (talk) 13:34, 21 July 2017 (UTC)

Thank you for this. I did intervene initially because I was worried that it looked as if @TshiliM: had got it wrong, identifying something as repetition (and removing it) when it was not. Then, however, I thought that I had myself perhaps been somewhere along the hasty<------>ignorant spectrum in making this edit (and in my edit on the user's Talk page) so I undid it all. Because I have no understanding of this topic area I think it was unwise editing from me, and I am withdrawing from further involvement, leaving it to people who actually know what they are doing. So: apologies, thanks and best wishes DBaK (talk) 20:03, 23 July 2017 (UTC)