Talk:Design of experiments

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, Top-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 Top  This article has been rated as Top-importance on the importance scale.
WikiProject Mathematics (Rated C-class, High-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
High Importance
 Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.
This article has comments.

Dispute with AbsolutDan[edit]

I have just added again a link to with definitions related to Design of Experiments. If AbsolutDan wants to remove this link again, can he justify the removal as an editor knowledgeable on the topic of Design of Experiments? Can anyone see anything wrong with the contents of TIA has been spammed across multiple articles by several usernames and IPs which appear to be working in concert. Please see the following:

There was a heated discussion about the link here: Talk:Six Sigma where it was determined the link was not extremely helpful and is to a site that is intended to promote their services (6sigma training). If you do a WHOIS on ([1]) and ([2]), you can see they both come from the same ISP. It seems apparent that this is simply the new IP of the above contributor(s), back to try to include links to their site, in blatent violation of guidelines. --AbsolutDan (talk) 17:31, 4 August 2006 (UTC)

As always AbsolutDans argument totally ignores the merits of the link and the content, and instead focuses on continuing his war on perceived spam. Can anyone help me here with a professional review of the proposed link? TIA

Professional Review Requested[edit]

I would like to focus on the suitability of the link to Design of Experiments Terms in Six Sigma Glossary - Is it suitable content for inclusion? TIA

A link is not content. After your aggressive campaigning to include links to your company dies down, editors might even find consensus to cite it, though it doesn't look like a primary source. Femto 12:53, 5 August 2006 (UTC)

Opening sentence[edit]

At this time, the opening sentence says this:

Experimental design is a research design in which the researcher has control over the selection of participants in the study, and these participants are randomly assigned to treatment and control groups.

The appears to presuppose that the statitical units are "participants", a term usually used only when they are humans, as opposed to stalks of wheat, machines, apples, pizza recipes, mice, etc. That assumes too much. "Research design" is presumed to be understood by the reader who is not acquainted with this subject. That also assumes too much. That all experimental designs involve treatment and control groups also assumes too much (see in particular the concrete example I recently added). I'd re-write it if I were confident of what it ought to say instead, and maybe at some point I will, unless someone else does it first. Michael Hardy 02:12, 16 December 2006 (UTC)

comment moved from article[edit]

An unsigned comment by user: was added to the article, in parentheses:

  • Weigh each object in one pan, with the other pan empty. Call the measured weight of the ith object Xi for i = 1, ..., 8. (This is a confusing statement. How can you measure any object in one pan, with no standard mass in other pan?)

You just have to read what it says above about the nature of this device: it's a scale that reports the difference between the two weights. Michael Hardy 23:01, 5 January 2007 (UTC)

Yikes, I had exactly the same confusion when reading the article more than a year later. I expand on my confusion in discussion further below. Briefly, you can say there is "a scale that reports the difference between the two weights", but if I cannot envision how such a scale could possibly exist, I end up going through gyrations to re-interpret what you must mean by "measuring" and so on. Or I am frustrated that you are telling me to hypothetically believe something which can't exist, does exist, and then how could this example possibly have any practical application. See further below for detailed explanation of why i, and apparently others, don't get what is being explained. doncram (talk) 05:32, 28 March 2008 (UTC)

note about the example experiment[edit]

The article states that according to modern experimental design standards, the only important element missing from the experiment with the sailors is randomization of the treatments. However, another pair of sailors who received no treatment is needed for this to be an "experiment".

Distinction between replication and repetition[edit]

A distinction needs to be made between replicates and repetitions. For example, three readings from the same experimental unit are repetitions while the readings from three separate experimental units are replicates. The error variance from the former is less than that from the latter because repeated readings only measure the variation due to errors in reading while the latter also measure unit-to-unit variation. -- Jeff Wu, Michael Hamada, Experiments. Planning Analysis, and Parameter Design Optimization. In this article, the subsection titled replication actually talks about repetition. As far as I remember, the terms are sometimes used interchangeably in literature, but I think a definition like this should be used. Sergivs-en 03:07, 1 May 2007 (UTC)

The scale example is a little confusing[edit]

Part of the advantage of the "designed" experiment is that the scale gives the difference between two measurements for "free." A better example, I think, is to use a single pan scale. In this case you have to take the difference between two measurements and thus you have 4 random variables with variance 2*v rather than 8 random variables with variance v. The designed experiment is still better than the single factor but only by a factor of 2, not the factor of 8 the example shows. —Preceding unsigned comment added by JustinShriver (talkcontribs) 21:18, 22 March 2008 (UTC)

You're mistaken. In every case, what is reported is the difference between the two pans. That's what it says. You need to read carefully. The variance of the difference is σ2. In what you call the "designed experiment" (although really, they're both designed experiments), you have a sum of eight differences, each with variance σ2. The variance of the sum is 8σ2. Then you divide the sum by 8, and hence its variance by 82 = 64. So the result in the article is correct. Michael Hardy (talk) 00:07, 23 March 2008 (UTC)
I don't find the example "simple" at all, and because it is described as being simple, and then I don't get it, then I don't like something (the article? myself? the writer of the article? statistics in general?). It rubs me the wrong way. I do appreciate that there are inefficient and surprisingly efficient approaches to designing experiments, having to do with orthogonal arrays and Latin Square designs and so on. I think another example should be given, or this example should be explained better, or it should be labelled something other than "simple". I suspect it is a clever example, one that is appreciated by some elite who are proud that they understand it thoroughly. And it is indeed impressive to get a factor of eight improvement in some efficiency measure. But I would prefer a more understandable example, and be happy to settle for much less than an 8-fold efficiency difference between good and bad design. doncram (talk) 02:49, 28 March 2008 (UTC)
Well, OK, you've succeeded in taking me by surprise. I'll see if I can rephrase it. Michael Hardy (talk) 02:52, 28 March 2008 (UTC)
Oh my, what a quick response! I think i should apologize for my tone a bit, I got a bit carried away there. It's just I was irritated that i did not understand the example, while I think I should have. Please don't take offense. I expect that if I tried to pick a simple example, and then explained it, it wouldn't be so simple after all. :) (Hmm, which would I pick?) I will try to read the example again now. doncram (talk) 03:05, 28 March 2008 (UTC)
Okay, I could now explain a couple specific ways in which I think the example could be explained better. I find it takes me some time to get this out, so I'll just explain one problem that I have:
First, it is hard for me to understand the basic setup, what the weighing system actually is. I think i am like a lot of readers will be, without a lot of familiarity with pan balances. It is actually stated that we'll use "a pan balance that measures the difference between the weight of the objects in the two pans." I can't picture how it could actually measure the difference. I can see that it can detect which side is heavier, that is all, but not to measure the difference in weight. So I tend to think you didn't mean that it measures the difference, and you will just compare which is heavier. Maybe by "measure a difference" you mean detect that there is a difference, like take a 0-1 measure of which is heavier. Then you go off into multiple weighoff comparisons, and I think you are doing some kind of clever binary search. I just can't get what you mean, because I don't see how you could measure the difference with just a two-pan scale and the objects you are trying to compare. And I am right, you can't with just a simple mechanical scale, and just the objects to be weighed. When I think about how you actually could measure a difference, I realize you could do it if you had a set of little measured gram weight thingies, and the way you would take a measure is to put a bunch of those thingies on the lighter side of the scale, until it balanced evenly, and that would take some time to get the balance right, so like there is some cost to taking a measurement. But you didn't say you had the set of little gram weight thingies. Now I realize if you had those, you could use those to take a measure of each of the individual objects, one by one, and use up eight weighing opportunities. And that is in fact measuring the difference between the weight of one object, vs. no weight on the empty pan (except for the little weighing thingies you put on). Or for the same number of weighing opportunities, you could make the eight other comparisons, or no I mean you could take the eight other difference measurements. So anyhow, you need to explain how the weighing system works, and that there are the little gram weight thingies available. P.S. I know you gave a wikilink for pan balance, but I didn't go off there, and I still have not. I thought I understood naturally, that what a 2-pan balance system does is compare which side is heavier, not that it measured a difference in weight. (And that is what I still think, too. But I suspect that there may exist some kind of electronic pan-balance tool which gives out a measure of differences, and maybe that's what you think we should know what that is like. But if you want to tell me about that, I think I'll get stuck on the electronic scales at the deli counter, where it gives a readout of weight, and maybe there's an issue about tare or not, but it is not measuring a difference of anything, it is just measuring the weight of something directly as far as I know.) Does this help? doncram (talk) 04:59, 28 March 2008 (UTC)
Digital kitchen scale.
Continuing just on that, now I go see pan balance, and it turns out to have redirected me to Weighing scale with a picture of the kitchen/deli counter type scale, which does not help me at all!
Typical historical balance

Okay, yes later in the article it talks about other scales, too. But this pic is very explicit about showing you have the whole set of little gram weight thingies, maybe they are called "standard weights", not just the 2-pan scale. Or is this not the two-pan scale? There is a different kind of two pan scale pictured further down the article. Hmm, maybe that one has dials and things you can use to take a measure of difference. Well if that is the case, then, why, I never would have believed it could exist. Can you in fact take a measure of difference by moving dials on a mechanical 2-pan scale?! To keep it "simple", why not use the "historical" scale with the weight thingies, as in the picture here, whatever that is properly called?  :) doncram (talk) 05:14, 28 March 2008 (UTC)

Second, I was going to inform you that the stated example had a display problem that showed 1 2 3 4 5 6 7 8 all together in the first weighoff, while you mean to show 1 2 3 4 was balanced against 5 6 7 8. I was going to tell you what internet browser I was using, and so on, to explain to you that what I was seeing must be different than what you are seeing or intended. However, now I see that in the edit history, User:Mdevelle was already fixing it, as in this recent version where it is balanced properly and you are expressing frustration that Mdevelle was not understanding that it was intended to be imbalanced that way. I think that Mdevelle and my perception that this must be a mistake in your presentation (or whoever authored it first), is because we really really do not believe you are talking about a scale that takes a continuous measure of difference, rather than a binary (lighter vs. heavier) measure of difference. It is pointless to compare all the weight vs. none of the weight, you get no information from that, of course all of the weight is heavier. So your first comparison must be wrong, it must not be what you intended to convey. The problem is with the browser, or with the formatting. I think it is great that Mdevelle figured out how to fix the format! Thanks Mdevelle! However, if you do succeed in getting us to believe you have a scale that measures a difference, as in a "historical" scale balance with an auxiliary set of standard weights, then we could understand that in the first trial you are taking a measure of the total of all eight of the items. Although it still may help for you to narrate that in the first trial you are doing that, you are taking a measure of the total weight of all eight items combined. Because it certainly does look different than all the other trials. Visually, it looks like you are making a mistake, so you need to counter that by explicit narration of what you are doing. Hope this helps. doncram (talk) 06:05, 28 March 2008 (UTC)

It's definitely NOT a binary measure. It does not merely report which is heavier. If it were, then what would be the point of talking about the variance? Mdevelle was clearly wrong. Please look up the reference by Hermann Chernoff. Michael Hardy (talk) 16:02, 28 March 2008 (UTC)

You misunderstood me. I do now understand that the scale is to be believed to measure, on an interval measurement scale, the difference between the weights of objects in two pans. I was trying to explain to you how a reader, myself included the first time through, could easily misunderstand what was meant to be conveyed in the example. I do think the exposition of the example needs to be improved. If you read my discussion above, I think I make it clear why it would be helpful to mention having an auxiliary set of standard weights as part of the available equipment in the example. Given that no set of standard weights is provided, it is highly reasonable for a reader to cast about for an alternative explanation of something, as something is clearly wrong in the setup. It is highly reasonable, I think, for a reader to hypothesize that the example is talking about a binary measure, which measures on an ordinal measurement scale, which merely compares which is heavier. True, this hypothetical understanding of the situation does not hold up either, for example because it would not be consistent with talking about the variance, as you point out. But the fact that you want the reader to understand differently than the reader does understand is a problem with the exposition of the example, not a problem with the reader. I may try making a few edits myself to clarify the example in the article, perhaps, later. I am sorry if I upset you by first actually misunderstanding and then seeming to continue to misunderstand. doncram (talk) 17:22, 28 March 2008 (UTC)


Now I'm getting accused of "vandalism". I have a Ph.D. in statistics and I care about the subject. Very few people have more experience editing Wikipedia math articles than I do. Very few people have more experience editing Wikipedia generally than I do.

  • Please actually check the math.
  • Please tell us why it could make sense to talk about "variance" if it's only a binary measure reporting which is heavier.
  • Please go to the library and check the cited reference by Chernoff. It will back me up.
Michael Hardy (talk) 16:17, 28 March 2008 (UTC)
I am sorry if my comments on Talk:Design of experiments contributed to Hot200245 believing that Hardy's edits constituted vandalism, and then justified Hot200245 rolling back several edits by Hardy to Mdevelle's last contribution. Hardy was certainly not vandalizing, and Mdevelle's edit that Hardy reverted was clearly wrong. I am perhaps disagreeing with Hardy about the exposition of the example, in terms of what is the best way to explain it, but Mdevelle's attempt to "correct" the example is wrong. I'm sorry if my stated appreciation of Mdevelle's edit was misleading. I found it amusing in that Mdevelle showed that he/she had the same difficulty that I did in understanding the exposition, and I appreciated Mdevelle's effort. But what Mdevelle actually changed the exposition to was wrong.
I think it was understandable that Hot200245 could have believed it was helpful to roll back the edits, though, so let's not take offense, let's just talk here about the example and then improve it in the article, okay? doncram (talk) 17:10, 28 March 2008 (UTC)
It is "obvious" that Michael Hardy is correct here. Just consider the contribution of object 8 in Michael's weighing scheme. It is always in the left pan, so makes the same contribution to each of the Y's. It follows from the formula for the estimate, which has 4 plus signs and 4 negative signs, that the contribution of object 8 to the estimate for object 1 entirely cancels out, as it should. If object 8 were in the right pan on the first weighing, this cancellation would not happen.
However perhaps the description of the experiment should be clarified a little by:
  • saying that additional weights are added to make the pans balance;
  • saying explicitly that the result of the first step is an estimate of the total weight of the objects so as to emphasize that what appears is definitely what is meant.
Melcombe (talk) 17:25, 28 March 2008 (UTC)

some answers[edit]

A rather large amount of material appears above. I'll answer them one by one but not all of them right now. I'll start with the one about how the mechanical balance can measure the difference. The point is that it doesn't matter. It's like simple arithmetic problems that say "If Johnny has 14 apples in each of his three pockets, then how many apples does he have in his pockets?". The question of whether anyone could really fit that many apples into a pocket is beside the point. More later...... Michael Hardy (talk) 03:34, 29 March 2008 (UTC)

the straightforward approach is 8 times as good as the indirect approach[edit]

Please take this lightly. As explained in the example, the straightforward approach to using the scales is to measure each of the eight objects once, directly yielding 8 measures. The indirect approach involves comparing 8 sets of objects according to a complicated schedule that yields nothing of direct use. But that schedule is cleverly arranged so that with additional calculations one can retrieve an estimate of the weight of the first object, after all. It is asserted that it provides a relatively good estimate of that one object's weight. To believe this assertion, you have to believe a goodly host of somewhat unlikely matters:

  • that the schedule is set up cleverly and correctly to accomplish what it intends
  • that the person implementing the weighings is able to adhere to the schedule, perfectly, in the series of weighings
  • that the person implementing the weighings WANTS to adhere properly to the schedule (but consider the motivations: if one of eight fishes is being weighed to be sold, wouldn't the weigher have an incentive to switch a fish or two here or there, to reach a more favorable result in the end?)
  • that there is no recording error of what the 8 weighings measured (and note, there could possibly be very odd numbers, including negative numbers, which do not meet a basic check test like "oh, here's a big one, this is what is measured, that sounds about right, yes it hefts just about that heavy" which would apply for weighing a single object)
  • that there is a computer or abacus or other adding machine available
  • that there is a computer or other dividing machine available
  • that the person operating the computer or other calculation machines can perform the calculations with no errors
  • that the person operating the calculation machines WANTS to reach the accurate answer, and is not in cahoots with the seller or the buyer
  • That the scales and the calculator have adequate firewall protection from random hackers
  • Etc.

Okay, but suppose you believe ALL of the above conditions are met, what do you get? From the first, straightforward approach, you get measurements of 8 objects. From the indirect approach, as explained, you get one (better) measurement of one object. Therefore, I submit the straightforward approach is EIGHT TIMES BETTER!

Again, please take this lightly. :) doncram (talk) 15:39, 29 March 2008 (UTC)

Yes, and are we to believe that Johnny really has 14 apples in each of his three pockets? Michael Hardy (talk) 18:26, 29 March 2008 (UTC)
Well, if I don't believe that the apples can fit in a pocket, I can reinterpret what you meant to say as, the apples are in a satchel. Or I can suppose that maybe in the writer's part of the world, a "pocket" means a satchel. And these interpretations don't interfere too much with my understanding the example. :) doncram (talk) 19:11, 29 March 2008 (UTC)
By the way, I meant to say Thank you, for editing the intro to the example so that it no longer asserts it is simple. That actually does help me with the readability of it, and I think it will help other readers to get into trying to understand it, too. Thanks. doncram (talk) 19:11, 29 March 2008 (UTC)

(unindent) I guess i expected the example would be edited by now. My point that eight items are measured in the first approach, and only one item is measured in the second approach, still is apparent in the example presented. I am supposing that perhaps there are other calculations you can do, to extract clever measures of the other seven items, given the eight weighings. But I am not sure, a formula to compute the weight of item 2 is not obvious to me. So I am tending to think that the example is unfair again, it is saying only that the complicated approach gives a good measure of item 1 weight, and compares that to a single measure of item 1's weight, when it should compare it to the info value of eight weighings of item 1. doncram (talk) 01:47, 18 April 2008 (UTC)

Trying again[edit]

The example explicitly compares one weighing of one object (with variance sigma-squared) vs. an approach involving eight weighings of eight objects (with variance for the one object at sigma-squared/eight). If you weighed one object eight times, what variance do you get, is it sigma-squared/(sqrt(8) ? If so, the relevant improvement is not 8 times but it is sqrt(8) instead. Anyhow, the improvement is not eightfold. You have to compare eight weighings of one object vs. the suggested schedule of eight weighings of multiple objects, or the example is ridiculous and makes no valid point at all. doncram (talk) 16:22, 6 May 2008 (UTC)

Just to clarify the comparison a bit more:
The example compares the two experiments:
a) eight weighings of one object each time
b) eight weighings of multiple (eight) objects each time
In both experiments you get the weights of each item. From the results of b) you can calculate the single weights by different formulas, as example the formula for item 1 is given in the article, the formula for item 2 - as asked by Doncram - would be: \widehat{\theta}_2 = \frac{Y_1 + Y_2 - Y_3 - Y_4 + Y_5 + Y_6 - Y_7 - Y_8}{8}.
The total effort of both experiments is the same (eight weighings), but the precision of b) is eight times better than that of a).
The alternative experiment of weighing each single item eight times would result in the same precision, but at the cost of an eightfold effort of 64 weighings.
TSK —Preceding unsigned comment added by (talk) 10:44, 10 July 2008 (UTC)

This is important detail. I have expanded the article to reflect this and to clarify the point being made (I hope). Melcombe (talk) 11:09, 10 July 2008 (UTC)

Proposal to merge Study design here[edit]

That article lists some types of experimental designs, mostly from medicine. There's not much text there. Xasodfuih (talk) 08:36, 1 January 2009 (UTC)

Keep separate. I think that an article looking at experimental design from a medical starting point is worthwhile, although it would probably benefit from a modified title. It would form a reasonable part of a collection of articles on medical stats, as is being grouped by the existing nav bar. Melcombe (talk) 10:12, 2 January 2009 (UTC)
Revise Study Design to be a list of experimental designs I don't see, in the current study design article, how it is particularly medical-study-focused. Perhaps what is needed is a list of research study designs (perhaps including experimental designs and non-experimental designs), to complement the intro article about the topic, Design of Experiments. There are lots of wikipedia article pairings where one is an intro to a topic and the other is a corresponding list of types. The Study Design article seems to be a draft list of research designs that include experimental designs. doncram (talk) 18:40, 2 January 2009 (UTC)

I would agree that it is not clear that this article is medical-study-focused. Not only is there not a list of research study designs of experimental designs and non-experimental designs, but there is not even a complete list of experimental designs. A few are mentioned scattered throughout the article, but it could be made much more clear. I think it would be useful to list between group designs (true and quasi experiments, and factorial designs) and within group or individual designs (time series experiments, repeated measures experiments, and single subject experiments). — Preceding unsigned comment added by (talk) 18:19, 11 March 2012 (UTC)

My impression that this was/is a medically oriented article was because of the selection of articles mentioned, which don't include any of the "agricultural" design topics, the "see also"s, the nav bar for Biomedical research, the categories in which the article presently appears, and the articles in "what links here". So I suggest leaving a copy of whats here now, re-labled as "clinical study design" or "clinical trial design" (which might fit better with titles of related articles) ... and also aim for another more general article which might be a list of "experimental designs" (but I haven't looked to see if anything like that already exists). Melcombe (talk) 10:09, 5 January 2009 (UTC)

I think it would be fruitful to keep the entries for study design and the design of experiments separate for the time being because study design applies to nonexperimental observational studies as well as experimental studies. I think study design can be of potential benefit to undergraduates majoring in, or just passing through psychology or sociology as well as members of the general public who may not be aware of the variety that exists in the area study designs outside of the experiment.Iss246 (talk) 01:37, 16 March 2009 (UTC)

Keep separate. This article deals with mathematical qualities of designing an efficient matrix of experimental runs, such as orthogonality. The other deals with equally important qualities of a medical experiment, such as double-blind. Both important, but different topics, addressed by different people. (Needs a sub-topic on Resolution.) -MBHiii (talk) 18:12, 24 March 2009 (UTC)

I suggest the narrative 'Step-by-step procedure in the effective design of an experiment' be moved out of here. It is a description of an approach to experimentation in general, not DoE in particular. John Pons (talk) 08:32, 11 August 2009 (UTC)

I have removed the merge templates as the discussion indicates that this is not what is wanted and leaving them is likely to be distracting.Melcombe (talk) 10:12, 11 January 2010 (UTC)

Is the current article REALLY a description of DoE?[edit]

I suggest that the current article on DoE is misplaced. It is not-too-bad on the general principles of experimentation. However with regard to the specific statistical method of Design of Experiments it is inadequate: it fails to identify how the method differs from other statistical processes, it fails to identify which situations are more (less) relevant for a DoE approach, it fails to identify how the designs are set up (except in the vaguest terms), and it fails to explain how to interpret the results. The article seems to confound the two concepts of experimentation and DoE. I suppose one could argue that those two concepts are linked, and there is some truth in that. All the same, I suggest that the article needs a more thorough treatment of the statistical concept of DoE if it is to be useful to practitioners.

As a guide to what content might be included in a revision, I recommend the chapter on ‘Experimental Design (Industrial DOE)’ by Hill & Lewicki, (2006). To quote from the introduction:

‘Experimental methods are widely used in research as well as in industrial settings, however, sometimes for very different purposes. The primary goal in scientific research is usually to show the statistical significance of an effect that a particular factor exerts on the dependent variable of interest. … In industrial settings, the primary goal is usually to extract the maximum amount of unbiased information regarding the factors affecting a production process from as few (costly) observations as possible. While in the former application (in science) analysis of variance (ANOVA) techniques are used to uncover the interactive nature of reality, as manifested in higher-order interactions of factors, in industrial settings interaction effects are often regarded as a "nuisance" (they are often of no interest; they only complicate the process of identifying important factors). … These differences in purpose have a profound effect on the techniques that are used in the two settings. If you review a standard ANOVA text for the sciences, for example the classic texts by Winer (1962) or Keppel (1982), you will find that they will primarily discuss designs with up to, perhaps, five factors (designs with more than six factors are usually impractical; see the ANOVA/MANOVA chapter). The focus of these discussions is how to derive valid and robust statistical significance tests. However, if you review standard texts on experimentation in industry (Box, Hunter, and Hunter, 1978; Box and Draper, 1987; Mason, Gunst, and Hess, 1989; Taguchi, 1987) you will find that they will primarily discuss designs with many factors (e.g., 16 or 32) in which interaction effects cannot be evaluated, and the primary focus of the discussion is how to derive unbiased main effect (and, perhaps, two-way interaction) estimates with a minimum number of observations. ‘

That paragraph is infinitely more useful as a descriptor of DoE than the current Wikipedia article which blandly states that ‘Design of experiments, or experimental design, (DoE) is the design of all information-gathering exercises where variation is present, whether under the full control of the experimenter or not.’

So what I’m suggesting is that the current article be tagged as requiring radical editing, and that most of the material currently in the article (with the exception of the scurvy example which is great!) is superfluous to the purpose of providing an introductory explanation of DoE to practitioners.

What do you think? John Pons (talk) 09:33, 11 August 2009 (UTC)

What is there seems OK for a lead section: see WP:LEAD. What you suggest could reasonably be put in a new "Introduction" section, following the lead, where there would be room to say something useful . Melcombe (talk) 10:10, 14 August 2009 (UTC)

"Information gathering" in lead sentence includes non-experiments[edit]

The whole point of experimentation is the active manipulation/treatment. Why does the article include non-experimental "information gathering" activities, particularly when the editors soundly defeated a proposal to merge this article on design of experiments with another on "study design"? (The current bad definition has spread to the category on experimental design, etc., btw.) Kiefer.Wolfowitz (talk) 00:27, 7 January 2010 (UTC)

You may be taking a too restricted view of what is meant by "experiment". The article doesn't attempt to do this and doesn't link to experiment, but it seems from the coverage in the article "experiment" that the term has a wide meaning, certainly including "observational studies". And of couse there is Natural experiment to consider.
You also say that the merger proposal was "soundly" defeated but, since the merge templates are still there in both articles, the discussion is notionally still open. I don't know that any action has actually been taken to resolve the issues raised. Melcombe (talk) 10:23, 8 January 2010 (UTC)
RE "Experiment". Introductory books on statistics (Freedman, Moore & McCabe, etc.) emphasize the distinction between observational studies and experiments. This distinction is endorsed by statistical authorities, e.g. the American Statistical Association (in its curriculum guidelines). So-called "quasi-experiments" (Cooke & Campbell) and "natural experiments" are not experiments (no more than a "red herring" be a "herring"!); Campbell is quoted in Freedman's "Statistical Models" as regretting the confusion around "quasi-experiments".
  • I'll try to write a Baconian/statistical definition of experiment.
  • This paragraph would note the secondary use of "experiment" for Baconian observational studies (intelligently gathering new observations to test specified hypotheses with a specified procedure).
  • The third (mis)use of "experiment" for usually (non-Baconian) observational studies, which are not experiments but (at best) "quasi-" and "natural" experiments.
I would first post such a paragraph here for the community's comments.
Thank you for the useful comments and especially for correcting my mistatement "soundly defeated", which was (my) wishful thinking! (Maintaining the question on top of the article seems to distract readers more than facilitate revision, for more than a year, imho.)
Kiefer.Wolfowitz (talk) 16:18, 8 January 2010 (UTC)
Reviewing the article, I see almost no discussion of observational studies. Indeed, all the serious examples (remembered by me) are Baconian experiments (with manipulation). (Remark: This is the first case where Melcombe has not convinced me of my error(s) immediately!) Kiefer.Wolfowitz (talk) 22:03, 8 January 2010 (UTC)
For the lead, the question should not really be what is usual in statistics texts, or what is already in the article, but rather what a reader might expect to be the topic being considered. This must take some account of what is said in the article experiment. However, I think a quick diversion to other articles that would deal with the stuff not covered by the statistical topic of "design of experiments" would certainly be OK. Melcombe (talk) 10:03, 11 January 2010 (UTC)
This is partly just to note that, given the gap in time, I revised the lead to try to clarify the restriction to "controlled experiments", and to note for other readers that there is now a separate article on quasi-experiment to go with natural experiment. If these other topics are to be excluded here, then there seems a need for articles on "Design of natural experiments" and "Design of quasi-experiments": for the latter Quasi-experimental design may be a starting point. The article opinion poll already contains some stuff about what is effectively "Design of opinion polls" or maybe "Design of observational studies". Melcombe (talk) 10:49, 12 February 2010 (UTC)

Experiment --- the horror, the horror[edit]

I corrected the experiment#Observational studies section. I remind the editors that the ASA recommends the distinction between observational studies and randomized experiments as a key idea in a first course in statistics for non-statisticians; this suggestion is followed in the exemplary textbookds of Freedman et alia and of Moore and McCabe. Kiefer.Wolfowitz (talk) 14:29, 12 February 2010 (UTC)

Selecting the number of observations[edit]

It is insufficient to obtain one observation. Depending on the desired analysis, there are certain factors that need to be taken into consideration when deciding on the number of observations. This includes the number of trials before a subject becomes familiar with the experiment, the number of trials before fatigue sets in and the number of trials required to obtain statically significant data.

That the "subject" will become familiar or fatigued is not very general! There are other problems here. For now I only change the word factors (bold) to points because there is another, technical use of factors in this article. --P64 (talk) 18:57, 11 April 2010 (UTC)


This and linked articles do not really explain what is a factor. --P64 (talk) 20:18, 11 April 2010 (UTC)

Factors are things that can affect the results of an experiment. They can be materials or processes or environmental conditions or whatever. And you note their value and magnitude and write them all down and resolve to not let them change until you want to run an experiment. Then for the experiment you take each individual one and arrange to test it at at least 2 different levels in the experiment. And the ones you don't take out you don't change. And after the experiment is done you take the results and use mathematical calculations to determine which of the factors are the most important in affecting the results of the experiment. The calculations usually calculate the variability of the results around a minimum variance line and assign a percentage of the variability to each factor and also note how much (percentage)of the variability that is left unexplained by the test.WFPM (talk) 19:42, 20 March 2013 (UTC)

Experimental designs after Fisher[edit]

How could anyone list important contributors to Experimental Design after Fisher and not mention G.E.P. Box? --Eduard Sturm (talk) 18:02, 19 April 2010 (UTC)

If Box is missing then I'll add him to the list. Kiefer.Wolfowitz (talk) 12:24, 20 April 2010 (UTC)
George E. P. Box was one of a group of writers who authored articles about the design of experiments and associated statistics for the Chemical industry in the early 1950's. They also edited and published two books: "The Design and Analysis of industrial Experiments" and "Statistical Methods in Research and Production with special reference to the Chemical Industry" Which were Edited by Owen L Davies Msc., Ph.D and Published by the Hafner Publishing Company New York City and evidently funded by The Imperial Chemical Industries Limited, London. They call themselves a team of chemists and statisticians dealing with the application of modern statistical methods to the design and analysis of experiments in chemical and allied fields of research. The books were originally writen circa 1958 and later revised.WFPM (talk) 20:07, 17 March 2013 (UTC)

Unclear introduction...[edit]

Indeed the design of experiments in a rather broad topic. As mentioned in the introduction:

"any information-gathering exercises where variation is present, whether under the full control of the experimenter or not"

However a clear distinction should be made between experimental designs and all other type of studies. In an experimental design it is tried to establish a causal relation between two or multiple variables:

if X then Y, and if NOT X then NOT Y.

To establish this relation the design of the study is of great importance, and therefore the guidelines of Fisher should be addressed. I propose that this charachteristic of an experiment, is used to link article of Fisher to the concept of experimental design. —Preceding unsigned comment added by Sjaak135 (talkcontribs) 11:52, 24 November 2010 (UTC)

James Lind and blocking[edit]


In the paragraph "Controlled experimentation on scurvy" it is unclear to me how Lind's pairing introduces blocking. If I understood blocking correctly, Lind should have grouped his subjects into two groups of six men. Blocking would mean to let the men in each group be as similar as possible. I do not see the advantage of letting men that receive the same treatment be as similar as possible.

Maybe this is just me misunderstanding blocking? Or maybe the explanation could be clearer? Jonas Wagner (talk) 16:12, 17 February 2011 (UTC)

I corrected the error. Please complain if it is not clear. The beginning of a competent textbook in statistics, e.g. Moore-McCabe or Freedman, describes the empirical failings of non-randomized assignment, discussing e.g. Student on milk supplementation.  Kiefer.Wolfowitz  (Discussion) 16:58, 17 February 2011 (UTC)
I don't think that's right, Keifer, having checked the full text of existing ref 1. He ensured that all twelve men with scurvy were similar, i.e. had similar symptoms and lived under similar conditions, with the exception of the experimental treatments. Two men took each of six treatments; there was no non-treatment group. Lind does not mention how he chose which treatment to give to which patients. Jonas Wagner is right; there was no blocking. I don't think there's any reason to mention blocking at all in our article. Discussing the lack of randomization is enough. Qwfp (talk) 17:02, 17 February 2011 (UTC)
I would strengthen your kind statement to "I know that you are wrong Kiefer, because (unlike you) I read the reference! Shame on you!" :-)
I'll remove the blocking distraction.  Kiefer.Wolfowitz  (Discussion) 17:07, 17 February 2011 (UTC)
Done! 17:13, 17 February 2011 (UTC)

Proposed rename of category[edit]

Proposing rename to "Design of experiments" to match previous rename of this article. Melcombe (talk) 16:50, 8 March 2011 (UTC)

Most DOE articles (14-1 on the nav-box) use "design" as the noun (updated 21:21, 9 March 2011 (UTC)) without mentioning "experiment". We should rename this page "Experimental design".  Kiefer.Wolfowitz  (Discussion) 20:07, 9 March 2011 (UTC)
In the phrase "design of experiments", both "design" and "experiments" are nouns. In "experimental design", the first word is an adjective and the second is a noun. Either way, "design" is a noun. So your point is somewhat unclear. Michael Hardy (talk) 20:34, 9 March 2011 (UTC)
The DOE footer has 14 articles with terminal "designs" and only 1 article (factorial experiment) with a terminal "experiment" besides the flagship article "design of experiments" (that would be better as "experimental design").
Is this easier to understand?
In fact, searching for " "factorial design" OR "factorial experiment" " yields hits with roughly the same number of "factorial experiments" and "factorial designs" (despite the advantage of "factorial experiments" being the name of the article), at least by my fallible count.  Kiefer.Wolfowitz  (Discussion) 21:18, 9 March 2011 (UTC)
Personally I prefer 'design of experiments' for the title of this topic, as 'experimental design' is potentially ambiguous: it could mean experimenting with designs rather than designing experiments. The same issue doesn't arise for names of specific designs. Also it's usually abbreviated DoE rather than ED. Qwfp (talk) 21:25, 9 March 2011 (UTC)
"Design of experiments" for the same reasons given by Qwfp. Mathstat (talk) 03:29, 10 March 2011 (UTC)
The established "DOE" (or "DoE") abbreviation is a good point. Now I'm neutral (leaning towards mildly supportive) of a name change for the category.  Kiefer.Wolfowitz  (Discussion) 14:04, 10 March 2011 (UTC)
By coincidence, I now find myself viewing Wikipedia under an experimental design as part of a designed experiment in user interface design, demonstrating that experiments with design really do take place! --Qwfp (talk) 23:03, 11 March 2011 (UTC)
Designs improve through experimentation. First a design is proposed to solve a problem. Then its properties are studied and codified in textbooks. Then the designs are applied to other problems, whose characteristics force a re-evaluation of the original designs, and suggest improvements (which even after 40-50 years are ignored by Montgomery!).  Kiefer.Wolfowitz  (Discussion) 14:00, 12 March 2011 (UTC)
From "optimal design": Charles Sanders Peirce introduced the topic of experimental design with these words:

Logic will not undertake to inform you what kind of experiments you ought to make in order best to determine the acceleration of gravity, or the value of the Ohm; but it will tell you how to proceed to form a plan of experimentation.

[....] Unfortunately practice generally precedes theory, and it is the usual fate of mankind to get things done in some boggling way first, and find out afterward how they could have been done much more easily and perfectly.[1]

  1. ^ Peirce, C. S. (1882), "Introductory Lecture on the Study of Logic" delivered September 1882, published in Johns Hopkins University Circulars, v. 2, n. 19, pp. 11–12, November 1882, see p. 11, Google Books Eprint. Reprinted in Collected Papers v. 7, paragraphs 59–76, see 59, 63, Writings of Charles S. Peirce v. 4, pp. 378–82, see 378, 379, and The Essential Peirce v. 1, pp. 210–14, see 210-11, also lower down on 211.

As in the template, the place to discuss this and vote is at this discussion. Melcombe (talk) 09:40, 10 March 2011 (UTC)

Another editor suggested pursuing this discussion here, as Melcombe knows.  Kiefer.Wolfowitz  (Discussion) 14:08, 10 March 2011 (UTC)
I said to discuss it here if you wanted to rename this article, not to discuss renaming the category here. postdlf (talk) 14:50, 10 March 2011 (UTC)

Pazman, Fedorov, and Pukelsheim[edit]

In the optimal design of experiments, the following authors are world leaders.

  • Fedorov, V. V. (1972). Theory of Optimal Experiments. Academic Press. 
  • Pázman, Andrej (1986). Foundations of optimum experimental designa. Mathematics and its Applications (East European Series) 14 (Translated from the Czech ed.). Dordrecht: D. Reidel Publishing Co. pp. xvi+228. ISBN 90-277-1865-2. MR 838958. 

Montgomery's textbooks and their influence are discussed in by Bailey & Atkinson in Biometrika's review.  Kiefer.Wolfowitz 00:34, 20 November 2011 (UTC)

A. C. Atkinson and R. A. Bailey: One hundred years of the design of experiments on and off the pages of “Biometrika”. Biometrika 88 (2001), 53–97. [Maths Reviews 1841265 (2002b: 62001)]

Technometrics probably has a dozen reviews by Ziegel mentioning the importance of Myers's books.  Kiefer.Wolfowitz 00:40, 20 November 2011 (UTC)


I suggest that persons who have never taught a course in a statistics department stop editing this page. Persons who have never taught a course on DoE in a statistics department should also consider editing other articles.

Sincerely,  Kiefer.Wolfowitz 09:28, 3 April 2012 (UTC)

That's not how Wikipedia is meant to work. Wikipedia:About and countless other policies, guidelines and traditions emphasise that those with expert knowledge should not try and take ownership of articles in the field of expertise. It's fine to challenge individual edits that reduce the quality of an article. It is not OK to scare off all editors except specialists. Work with editors whose contributions cause you concern. Mentor them. Help them find a way of contributing to the article without introducing erroneous content. We need people who don't know much about design of experiments to help shape this article. They will make it more readable to the general public. Experts are usually very bad at explaining their subject to general audiences. --Ben (talk) 10:43, 3 April 2012 (UTC)
Have you looked at the latest edits explaining that experimental units are explanatory variables, etc.? I don't have time to defend this article against editors who don't know what they are doing. Either the Huns leave or I leave. It is that simple.
If Kirk wants to edit, fine. God bless him. Then I should be glad to mentor him, and clean up after him. I would gently discuss whether his textbook's terminology should be jettisoned in favor of mainstream statistical terminology (which is considerably more general and so applicable). People throwing in "grounded theory" or other cliches in this core article should go back to Gesselschaft where they belong.
This is supposed to be an encyclopedia---not a playground upon which everybody can "do their own thing".  Kiefer.Wolfowitz 12:31, 3 April 2012 (UTC)
Ben, you write so little (and nothing memorable here) that you really have a lot of nerve and little sense in writing such a stringing together blue-linked, clichés.  Kiefer.Wolfowitz 14:31, 3 April 2012 (UTC)
I don't know anything about what edits are happening here, but I saw your comment and it struck me as the sort of thing that should not go unchallenged. It seems you're having some problems with another editor who's adding content you consider incorrect or of low quality. The usual approach is (i) ask the other editor to stop editing the article, (ii) discuss it on the talk page, (iii) get consensus on the talk page, and then and only then (iv) edit the article to reflect the consensus. If this other editor really is refusing to compromise, then you can go to Wikipedia:Dispute resolution.
You're having problems with one or two editors. Sort it out with them. Don't tell everyone else they shouldn't edit this article. --Ben (talk) 22:27, 3 April 2012 (UTC)
Deeds, not hortatorial sermons, are needed.
Will you maintain quality on this article? (And this article's quality is pretty bad, in fact. I nearly abandoned it some months ago.)
 Kiefer.Wolfowitz 18:58, 4 April 2012 (UTC)
KW, if you no longer want to contribute/maintain this article, then simply don't. No one is required to maintain it, including you. No one is required to have "taught a course in a statistics department" in order to edit this article. Since we're not publishing original research, subject experts are not required. We're just republishing what the experts have published. You've been here long enough to know that, and you've been here long enough to know that demanding good faith editors stop editing an article is inappropriate. If there are errors in the article, then deal with the errors (or don't). If there is an editor who is consistently introducing unambiguously incorrect and unsourced information into the article, then report it at an appropriate noticeboard. Otherwise, don't deride other editors because you believe they are not as intelligent/experienced/accomplished as you believe you are. This is exactly the type of behavior that the current ANI thread about you is regarding. —SW— gab
Scottywong, SW, (formerly Snottywong):
You mis-state what I wrote. I did not demand that anybody stop editing. I asked that they consider stopping. (Please note the subjunctive mood in the seemingly harsher sentence.) Second, the recent edits were not using reliable sources, and in fact cited nothing.
I don't understand e.g. your use of "we". In the past, Scotty, you were more humble about your editing articles. Perhaps you have been republishing what experts have written, and I didn't check at your second RfA, after you failed your previous RfA because of your own behavioral problems.
I understand that Wikipedia has no requirement of competence. However, editors are free to advocate competence.
 Kiefer.Wolfowitz 20:20, 4 April 2012 (UTC)
You are attempting to intimidate editors with the intention of discouraging them from editing. This is WP:HARASSMENT, which is defined as, "a pattern of repeated offensive behavior that appears to a reasonable observer to intentionally target a specific person or persons. Usually (but not always) the purpose is to make the target feel threatened or intimidated, and the outcome may be to make editing Wikipedia unpleasant for the target, to undermine them, to frighten them, or to discourage them from editing entirely." Claiming that you are "advocating competence" would be putting a spin on your behavior that even Bill O'Reilly can't match. So, please stop. —SW— squeal 20:57, 4 April 2012 (UTC)
Please do not question my intentions, violating WP:AGF and violating WP:NPA (I think) by accusing me of harassment. Some WP policy has a definite prohibition on such frivolous claims of harassment.
I enforced WP:Lede and WP:RS. I agree that I should have written a more pleasant note, as I usually do.
Now stop your dogging me and remove your WP:BLP violation referring to O'Reilly.  Kiefer.Wolfowitz 21:08, 4 April 2012 (UTC)
P.S. A partial citation to a "Cresswell" had been removed by another editor before I reverted the edits, per WP:RS and WP:Lede. 21:12, 4 April 2012 (UTC)

Kiefer, you say "Deeds, not hortatorial sermons, are needed." I disagree. This article will be looked after much better, and Wikipedia will retain more good editors, if you stop being horrible to anyone who doesn't meet your standards. --Ben (talk) 12:17, 10 April 2012 (UTC)

Is this a Nightmare on Elm Street sequel where Benjah-bmm27 (Ben)'s nightmares are mistaken for reality?  Kiefer.Wolfowitz 13:05, 12 April 2012 (UTC)

Reply to "What's wrong with social science?"[edit]

I wrote the following on an editor's talk page, but perhaps it's better here.  Kiefer.Wolfowitz 22:02, 4 April 2012 (UTC)


The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Social science

Hi Aroehrig!

There is nothing wrong with social science. I have been happy to credit psychology with the invention of randomized experiments (citing Stephen Stigler), in numerous articles in Wikipedia, for example. I have written substantial portions of articles in education, political science (political sociology), economics, and related articles in history and gender-studies, etc.

That said, most "research methods for social scientists" books should be burned, voluntarily of course.

The text I reverted had several infelicities. One was the statement that the design of studies had several of wonderous properties, minimizing costs, maximizing information, etc. In fact, most studies (particularly in social science) are so bad that they should never have been done, according to Campbell & Cooke & Shadish.

In fact, most experiments are not optimal in any sense, and in fact the most popular textbook in the US used to have a bizarre attack against optimal designs (that did maximize information or minimize costs), according to the centenary article in Biometrika by Atkinson and Bailey. Almost all books on RSM emphasize designs that are known to be inefficient---an honorable exception being the exceptional (University of Michigan) textbook by C. Jeff Wu (now a Rambling Wreck!) and Hamada. Given these facts, a statement that DoE optimizes anything (especially two distinct criteria) is false, and appears as false advertising for DoE as a field. (Experts in DoE, especially researchers, probably can be described as optimizers in some sense, of course.)

Sincerely,  Kiefer.Wolfowitz 19:44, 4 April 2012 (UTC)

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.


Removing "Statistical control"[edit]

I'm deleting the "statistical control" section that seems to be referring to statistical process control, a specialized method in industrial process management. The only article cited in this section is in the journal "Quality Engineering", which does not seem to be at the right scope of general statistical relevance for this article. VAFisher (talk) 21:53, 15 August 2013 (UTC)

I have restored the section you have deleted. Please review Common cause and special cause (statistics): Special causes must be removed from an experiment and the measurement process to allow proper analysis of the effects under study. Statistical control is a critical part of experimental design. Rlsheehan (talk) 00:08, 16 August 2013 (UTC)

Ok, I see there's some point to be made here, but I'm concerned that the word "control" is used at least 3 different ways in this field: 'controlling' or 'adjusting' for a covariate, the 'control group' or reference group in a randomized control trial, and the 'statistical process control' which this section is describing. I recognize it was a breach of etiquette to simply remove the section, but I was tutoring a medical student for biostats who had found this page very confusing, because of these different uses of the same word. I'll work on a rewrite that includes some disambiguation. VAFisher (talk) 20:21, 19 August 2013 (UTC)

Use of animals[edit]

This article hardly touches on the design of experiments using animals. This has some extremely important concepts such as ethics, housing and animal welfare, note yet covered in the article as well as some concepts already covered. I am prepared to include these if other editors think this is appropriate.__DrChrissy (talk) 18:08, 17 August 2013 (UTC).

I think it's worth mentioning, but perhaps not in very much detail.
There is already some mention of this topic in the article Animal rights, but in a rather jumbled and limited fashion. (Possibly elsewhere too, I haven't looked in detail.) Maybe it needs more expansion there than here. --Demiurge1000 (talk) 18:15, 17 August 2013 (UTC)
There is plenty of information out there which is more neutral than Animal rights such as The Three Rs (animals), Bateson's cube‎ and meta-analysis which are useful tools for those designing experiments using animals.__DrChrissy (talk) 20:24, 17 August 2013 (UTC)
Those articles are not in good condition either, are they? I suggest you improve all five articles! :) --Demiurge1000 (talk) 21:26, 17 August 2013 (UTC)

Animal testing covers much of your concern. This article covers the physical planning of experiments to allow valid statistical analysis and resulting decisions. These subjects are different.`Rlsheehan (talk)

Aha! A sixth article! DrChrissy, at which points in this article do you feel that there should be wikilinks to the Animal testing article? --Demiurge1000 (talk) 02:08, 18 August 2013 (UTC)
There are many other experiments (studies) using animals other than "Animal testing". I am not sure about the motivations behind the negativity to my suggestion, however, I feel my editing efforts will be better received elsewhere. By the way, before suggesting other articles are not in "good condition", perhaps a good look at this article is needed - there is not even a link to Statistical power!__DrChrissy (talk) 13:57, 18 August 2013 (UTC)

Re Grey box completion and validation[edit]

“See also“ “Grey box completion and validation“ has been removed anonymously without explanation from this and several other topics. Following advice from Wikipedia if there are no objections (please provide your name and reasons), I plan to reinstate the reference in a weeks time.

In designing experiments the method of analysis should be considered (perhaps simulated to ensure feasibility) the removed reference covers techniques of potential application in the analysis af experiments particularly when a model is involved. In particular most models are incomplete (i.e. a grey box) and thus need completion and validation. This reference seems to be within the appropriate content of the “See also” section see Wikipedia:Manual_of_Style/Layout#See_also_section.

BillWhiten (talk) 05:36, 22 March 2015 (UTC)

COI editing[edit]

Martijn.Berge please explain why your books should be cited in this article. thanks. Jytdog (talk) 14:07, 4 May 2015 (UTC)