Talk:Multiple comparisons problem

From Wikipedia, the free encyclopedia
  (Redirected from Talk:Multiple comparisons)
Jump to: navigation, search
WikiProject Statistics (Rated Start-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

Start-Class article Start  This article has been rated as Start-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.

Coin flip calculation[edit]

"the likelihood that a fair coin would come up heads at least 9 out of 10 times is 11 * (½)10 = 0.0107."

Can someone please explain where the 11 came from? My own understanding is that it should be a 10—Preceding unsigned comment added by (talk) 05:07, 31 December 2008 (UTC)

It says "at least nine", not "exactly nine".
And "likelihood" is the wrong word here; I've changed it to "probability". Michael Hardy (talk) 05:57, 31 December 2008 (UTC)

Can you add a reference of some kind ? Maybe looks obvious to a statistician, but not so much for me. And I'm a scientist ! Would be very helpful for the general reader. —Preceding unsigned comment added by (talk) 09:48, 31 March 2009 (UTC)

Any outcome of tossing a fair coin 10 times has probibility of (1/2) ^ 10. There is only one way to get 10 heads and then 10 ways of getting 9 heads (once for each of the 10 throws being a tail e.g. first is a tail or third is a tail) hence (1 + 10) * (1/2). (talk) 14:40, 24 March 2010 (UTC)

Multiple comparisons or multiple testing ?[edit]

I have always heard of the problems explained in the article under the name multiple testing (cf also the Benjamini & Hochberg paper cited at end of the page), so I would be tempted to suggest to move the article to this name, but it may simply be a bias on my side. Any opinion ? Schutz 21:00, 21 September 2005 (UTC)

Multiple comparisons is what I've always heard; multiple testing to me sound like simultaneous testing of multiple null hypotheses, and that is not the same topic. So I oppose such a move. Michael Hardy 22:09, 15 October 2005 (UTC)
multiple testing is exactly what you call multiple comparisons, but I'd be interested to know of any reference that documents the meaning that you indicated above. As I wrote above, I have read mainly papers that use the terminology multiple testing, but this may be a bias among researchers specialised in a specific area. For example, almost all the literature on the statistical analysis of DNA microarray use this term (I just added a sentence on this on the microarray page). Given that noone else has answered my suggestion (thanks for jumping in !), I will not move the page, but I will indicate clearly that multiple testing is (also) what we are talking about here. Schutz 00:03, 16 October 2005 (UTC)
Ok, after rereading the article again, it seems to me that it is very confusing in its present state ! If one reads only the first sentence, multiple comparisons is indeed not the same as multiple testing: if, by multiple comparison, this article means basically "what you do after obtaining a significant ANOVA F-test", then indeed we are not talking about multiple testing in general (which covers more generally the problems of "using statistical tests repeatedly" as indicated in the intro). Do you agree ? In this case, most of the discussion could go in a more general article on multiple testing, that I will be happy to start. Unfortunately, all the definitions I have been able to find so far for multiple comparisons blur the distinction with multiple testing. Schutz 00:53, 16 October 2005 (UTC)
I have to agree with Michael Hardy that the term often used is multiple comparisons. I have used it myself, and had reviewers of my own research papers claim I need to do "multiple comparisons corrections." Also, google returns 177,000 hits for the search bonferroni+"multiple comparisons", and 48,000 for bonferroni+"multiple testing". Debivort 08:09, 3 January 2006 (UTC)
Sorry, I am lost here; please see the last comment I have written above. It seemed clear to me that this article was about the ANOVA F-test multiple comparison problem; indeed, most of the procedures linked from this page are specifically about "comparing sets of means", as was the lead section. Based on this, the article has been split between multiple comparisons (this page), and multiple testing (the general problem). The (good) changes you have made are about the general problem. If the consensus is that multiple comparisons is the general problem, then the two pages should be merged — and this article should be cleaned up. But I must say that I like the split approach, and it seems logical: with the ANOVA, you are really comparing a set of means, while testing really refers to the application of multiple statistical tests, whatever they are. The google searches do not tell us if the two terms have the exact same meaning (some of the links for multiple comparisons point towards the ANOVA question only; some talk about the more general problem). For the record, even though it is not relevant to this particular discussion, I have mostly seen the term multiple testing for the general problem, including in reviews of research papers. Hey, the only paper in the bibliography of the article that mention anything says multiple testing, and it is about the general problem ;-). Schutz 15:30, 3 January 2006 (UTC)
Mathworld says that Bonferroni corrections address the multiple comparisons problem. They alas do not have an entry on "multiple testing". It seems like the article text (parts not by me) and all of the statistical tests linked below that I am familiar with address "multiple comparisons" as the problem is conceived by me and Mathworld. Is your conception of multiple comparisons (i.e. the ANOVA f-test) a specific example of multiple testing/my conception of multiple comparisons? I wonder if we aren't just running into a linguistic rather than content-based hurdle here. Debivort 16:54, 3 January 2006 (UTC)
Basically, I first thought it was only a linguistic question when I started this discussion a few months ago. It is only based on the comments above (it was mentioned in particular that multiple testing and multiple comparisons were not the same thing), and the content of the article that I assumed that multiple comparisons (i.e. the ANOVA f-test) was a specific example of multiple testing — while it was not my conception, I was ok with the distinction and spinned-off the multiple testing article, which no one objected about. This is why I am a bit puzzled about the going back. I wonder if there may be a systematic difference in vocabulary between statisticians working in different fields; the statistical papers I have seen so far were all about multiple testing (starting with Benjamini-Hochberg, as mentioned above). This is probably why I easily believed that multiple comparisons was the special case, but it may be only a bias. In any case, if the consensus is that multiple testing==multiple comparison (hopefully other people will say something), then the first priority would be to merge the other article, instead of rewriting it (although it may be too late). As a side note, at least some of the linked articles are indeed specially related to ANOVA. Schutz 17:35, 3 January 2006 (UTC)
Yeah, it does seem like we need to rope in some other comments. I'll ask around if anyone has the time to comment on it. Maybe you can do the same? Debivort 05:20, 4 January 2006 (UTC)
Try asking at Wikipedia talk:WikiProject Mathematics. linas 15:08, 5 January 2006 (UTC)

Disclaimer: I don't know if I am biased by may concrete problem, as I am not statistician, neither I am english native speaker, but I'll try to help. According to dictionary:

Testing n. 1. The act of testing or proving; trial; proof. [1913 Webster]

Comparison n. 1. The act of comparing; an examination of two or more objects with the view of discovering the resemblances or differences; relative estimate. [1913 Webster]

With these definitions, I think that making _multiple test_ is repeating a test some times. An example, if we want to test if A is better than B (or equal, or whatever). After that we got C and we want to test A vs C, and B vs C. Then comes D and I want A-D, B-D, C-D... If we do that, with a t-test or wilcoxon, it is more likely having false positives (the coin example in the article). In this way, we would be accepting a false hipotesis for example saying that A has the same mean that D. For this reason, we have tests designed to avoid this: ANOVA (parametric), Friedman (nonparametric), others??...

After performing ANOVA or Friedman, we only know that for example H0: A = B = C = D is not true. Then we would probably want to know which one is different from the others. For this purpose, we can apply one of the techniques that allow us to _compare_ every pair: Tukey test, Nemenyi, Bonferroni...

The previous could clearly split article in two, but probably I have left other ideas, like those about techniques to repeat a test in order to increase power that I do not know of. I think we should clearify which contents do we want here before deciding about one or two articles. Arauzo 18:59, 20 April 2006 (UTC)

Revising some bibliografy, in (Zar, 1999) these are chapters 10 an 11:
  • Multiple Hypotheses: the analysis of variance. This chapter introduces the problem of repeating the same test to over different samples to confirm various hypothesis over them (coin example). Then explains ANOVA and their non-paramentric extensions like Kruskal Walls and points to chapter 14 for other techniques with more than one factor ex. Friedman.
  • Multiple comparisons. This chapter explains how the comparisons among pairs of the samples tested in an ANOVA test should be done and different test for comparisons like Tukey.
In the start of chapter 11: 'The term "multpliple comparisons" was introduced by D. B. Duncan in 1951', according to (David 1995).
H. A. David First (?) occurrence of common terms in mathematical statistics. Amer. Statist. 49: 121-133, 1995.
Jerrol H. Zar, Biostatistica Analysis, 4th ed. Prentice-Hall 1999, ISBN 013081542X
Arauzo 11:35, 23 April 2006 (UTC)

Strong Support. I'm late to this discussion, but I've never heard of multiple testing, until just now. I use multiple comparisons as a term all the time (especially Bonferroni and friends). Could it be a UK/US thing, or a case where SPSS has dictated the vocabulary to the world? Otherwise, I think the time for merger may be here. I'll plan to do it in a couple of days, if I don't here from anyone else. -Scott Alberts 03:59, 6 September 2006 (UTC)

Multiple testing and multiple comparisons should remain different pages. The difference is not merely UK/US or terminology, but one of essential difference. Multiple testing, or multiple hypotheses testing, is the general problem of testing several null hypotheses while controlling the overall chance of false positive; Bonferroni correction of P-values is the most common method for doing this, but not the only one. Multiple comparisons are the special cases where the tests are comparisons between groups, typically several pairwise comparisons on a set of groups (e.g. all-against-all or one-against-all); although Bonferroni correction is still valid, it is typically not very powerful as it does not make use of the dependencies between different pairwise tests performed on the same groups. I'll see if I can make an update to the multiple testing page at some point which will make this difference more clear. --Septagon 23:43, 7 January 2007 (UTC)

Lead section[edit]

I think that the lead section should be a little more accessable. The big picture in plain language. There is plenty of room for the subtulties of the concept further down. ike9898 01:59, 8 October 2005 (UTC)

I agree. I don't know what the sentence "The experimentwise α level increases exponentially as the number of comparisons increases." means. What is an α level, or where do I go to look it up? Not really a field I know that much about, so I look forward to a more clear article. -- Jake 07:13, 15 October 2005 (UTC)
I agree as well, and will take a crack at an edit with a more accessible intro, taking into account the current trent in multiple comparisons v testing (above). Debivort 08:10, 3 January 2006 (UTC)

Tukey's Studentized Range Test/Distribution[edit]

There is a nice summary of this by NIST at [1] which I believe is in the public domain, as NIST is a US government agency. In fact I made a template for this: NIST-PD. Btyner 18:57, 15 May 2006 (UTC)


There is a section in the middle of the article that is repeated word-for-word later in the article. Please fix. Thanx. --Cromwellt 5PM, 16 Feb 2007 (having login problems) —The preceding unsigned comment was added by (talk) 23:03, 16 February 2007 (UTC).


Recent reversions may be problematic as they restore claims about Bonferroni that are uncited and have already been called into dispute. Can some of the other editors weigh in on this? de Bivort 15:38, 13 November 2007 (UTC)

Multiple comparisons for confidence intervals and hypothesis tests[edit]

The paragraph says:

"If the inferences are hypothesis tests rather than confidence intervals, the same issue arises. With just one test performed at the 5% level, there is only a 5% chance of incorrectly rejecting the null hypothesis if the null hypothesis is true. However, for 100 tests where all null hypotheses are false, the expected number of incorrect rejections is 5. If the tests are independent, the probability of at least one incorrect rejection is 99.4%. These errors are called false positives."

shouldn't be "However, for 100 tests where all null hypotheses are true, the expected number of incorrect rejections is 5."?

that is, we reject it when we think is false (based on the alpha level). In this case the problem arise because we reject it even if it is true. Sorry if I misunderstood- non-statistician here.Diego Diez 13:22, 23 September 2010 (UTC) —Preceding unsigned comment added by Kurai yousei (talkcontribs)

Further reading[edit]

I recommend the multiple comparison book by Tamhane & Hochberg. They have also worked with the wizards at the SAS Institute to publish a user-friendly guide (and also a workbook) on multiple comparisons. These sources should be authoritative and mainstream.  Kiefer.Wolfowitz 10:13, 24 August 2011 (UTC)

Unhelpful section: "Classification of m hypothesis tests"[edit]

I suggest a re-thinking of the purpose and goal of this section. The main problem is that nothing in this section appears elsewhere in the article, so it should be deleted if no more work is done to make it useful.

The section comes at an important point in the article and risks throwing the reader off-track. The section throws a lot of variables at the reader, forcing him or her to ponder a complicated table just to come up with some pretty intuitive and obvious ideas, such as that the number of false positives and true positives add up to the number of discoveries. Do we need all this quasi-math to know that?

The table is confusing, mostly because there is no clear relation between rows and columns. I believe "Declared significant" means "Researchers believe alternative hypothesis to be true", and "Declared non-significant" means "Researchers believe null hypothesis to be true". This relabeling might make the table clearer.

So the question comes down to: what do you want to teach the reader? Currently, the section teaches nothing worthwhile. — Preceding unsigned comment added by AndrewOram (talkcontribs) 11:36, 22 April 2014 (UTC)

Dr. Wolf's comment on this article[edit]

Dr. Wolf has reviewed this Wikipedia page, and provided us with the following comments to improve its quality:

The article is not structured well, has many holes, and even contains some wrong (or at least misleading) statements. It might be better to scrap it altogether and refer to a well-crafted review paper instead. Sorry about this negative rating but I feel the need to say it as it is.

We hope Wikipedians on this talk page can take advantage of these comments and improve the quality of the article accordingly.

We believe Dr. Wolf has expertise on the topic of this article, since he has published relevant scholarly research:

  • Reference : Joseph P. Romano & Azeem M. Shaikh & Michael Wolf, 2009. "Hypothesis testing in econometrics," IEW - Working Papers 444, Institute for Empirical Research in Economics - University of Zurich.

ExpertIdeasBot (talk) 22:42, 24 September 2016 (UTC)


My issues with this article are: 1) the controlling procedures section is a repetitive since its merge with multiple testing correction so that needs to be condensed and cleaned up 2) the large-scale multiple testing section is largely missing citations for its claims Jbrowning17 (talk) 03:18, 18 October 2016 (UTC)

sources for edit[edit]

I must say it is really ridiculous that something that can be found so easily using a search engine (as I mentioned in the original edit summary), would still need to have a source brought for it and deleted otherwise. Well here are the sources, you could just google "MCP conference XXX" (when XXX is the year of the confernce) and get these as first results, but I'll list those results anyway:

Orielno (talk) 17:43, 19 October 2016 (UTC)