# Talk:Regression toward the mean/Archive 2

## What exactly is "regression toward everything"?

Before evaluating the merit of your proposal, I would like to understand "regression towards the median", "regression towards multiple points", "regression to every point in the sample range", and anything else that is pertinent. For each of them, I would like to know (with credible sources):
• What is its precise definition?
• Is it an empirical (observed) phenomenon only (without mathematical proofs)? What are some documented observations?
• Are there relevant mathematical theorems also? What are they? (Full statements of the theorems with sources for the proofs, please)--Palaeoviatalk 02:42, 25 July 2010 (UTC)
I think you're mixing two different things. My proposal is not related to the section on regression toward everything. It's an attempt to make the article coherent. I think we all agree the current state is confused, as different people have edited with different points of view. There's no flow or organization, style is inconsistent and some parts are difficult to understand. I'm not blaming anyone, it's a case of too many cooks spoiling the stew.AaCBrown (talk) 07:48, 25 July 2010 (UTC)
The classic early papers are by Efron and Morris in the Journal of the American Statistical Association, 66 (pp. 807-815), 67 (pp. 130-139), 68 (pp. 117-130) and 70 (pp. 311-319). There is a good popular account by Everson in "A statistician reads the sports pages", Chance 2007 (20) pp. 49-56. The best general references for this topic are "Improving Efficiency by Shrinkage" by Marvin Gruber, "Theory of Preliminary Test and Stein-type Estimation with Applications" by A.K.M.E Saleh and "Bayes and Empirical Bayes Methods for Data Analysis" by Carlin and Louis. There is extensive treatment in "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman. It's also covered more briefly in most basic mathematical statistics texts, for example "Statistical Decision Theory and Bayesian Analysis" by J.O. Berger (pp. 364-369).
There isn't one precise definition. The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.
This simple idea can be made precise and rigorous in a number of ways. Some of them require strong assumptions and state exactly how much regression to expect. For example, if you assume the underlying true values have a Normal distribution with known parameters and the errors are i.i.d. samples from another Normal distribution with known parameters, you can compute the precise distribution of the regression effect. Other treatments require few assumptions but say only that regression exists. Some of these methods go by names like shrinkage estimators, ridge regression and empirical Bayesian estimates. The sources above contain many examples with real data, but it's easy to prove for yourself. Pick any two sets of measurements, or generate some random numbers in Excel (you can put rand() in column A, A1+rand()-.5 in column B and A1+rand()-.5 in column C and regard B and C as two different noisy estimates of A). Pick any point within the first set. I'll bet you'll find that the observations lower than the point on the first set went up on the second set more than half the time.
This effect is of most practical importance when the distribution has multiple modes. The largest shrinkage tends to be toward the modes, not the mean or the median. It also can be significant when you have mixed populations or incomplete information.AaCBrown (talk) 07:48, 25 July 2010 (UTC)

## Bogus citation

AaCBrown justified the ridiculous nonsense discussed in the previous section, with this claim (see section: Complete Misunderstanding and Utter Nonsense):

The pitcher with the highest batting average among pitchers by the All-Star break is more likely to regress down toward the mean batting average of all pitchers than up to the mean batting average of all players in the second half of the season. That need not be true, but it can be and happens to be (see: GEORGE, E. I. (1986). Minimax multiple shrinkage estimation. Ann. Statist. 14 188–205.)

Now, this is the abstract of the cited article:

For the canonical problem of estimating a multivariate normal mean under squared-error-loss, this article addresses the problem of selecting a minimax shrinkage estimator when vague or conflicting prior information suggests that more than one estimator from a broad class might be effective. For this situation a new class of alternative estimators, called multiple shrinkage estimators, is proposed. These estimators use the data to emulate the behavior and risk properties of the most effective estimator under consideration. Unbiased estimates of risk and sufficient conditions for minimaxity are provided. Bayesian motivations link this construction to posterior means of mixture priors. To illustrate the theory, minimax multiple shrinkage Stein estimators are constructed which can adaptively shrink the data towards any number of points or subspaces.

Can anyone with some familiarity with mathematical and statistical literature doubt that this citation is deceitful and bogus?--Palaeoviatalk 02:56, 24 July 2010 (UTC)

I'm not sure why you think this is bogus. Do you not believe the paper could contain the example I cited? If you won't read past the abstract to find out, perhaps I can point out the relevance of the example to the abstract. The multivariate mean in the example is a vector of all baseball player batting averages. If we didn't know about regression toward the mean, we might estimate each player's average individually. But RTM tells us that that highest observed average is probably an overestimate of true ability, and the lowest observed average is probably an underestimate. So we would shrink our estimate toward the mean, estimating each player's future performance as something like w*Observed_Average + (1-w)*Mean_Average_of_All_Players; with some function for w.
The "multiple shrinkage estimators" include shrinking not toward overall player mean, but mean by position (a different mean for pitchers, first basement, shortstops and so on). Of course the paper is far more general than the example. The paper covers the question of what to do when you don't know which point the batting averages will regress toward.AaCBrown (talk) 04:04, 30 July 2010 (UTC)

You said: "Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error". Consider data-point 15 in the data range [2, 30]. 15 is above 10, so positive error is expected of the data-point 15. 15 is below 20, so negative error is expected of the data-point 15. . Please explain.--Palaeoviatalk 08:18, 25 July 2010 (UTC)

This is an excellent question, and illustrates how confusing this stuff can be. A randomly-selected point above 10 is more likely to have a positive error than a negative one. A randomly-selected point below 20 is more likely to have a negative error than a positive one. These are incomplete information statements. If we know subject was measured at 15, statements about randomly-selected measurements are not relevant.
Suppose, however, you know the distribution of measurement errors is Normal with mean zero and standard deviation one. Say the 15 is the 90th highest out of 100. Before you knew the value, you expected the 90th out of 100 observations to have a positive measurement error of 1.28. So you might shrink the 15 down to 13.72. However, this is another incomplete information problem, you could make a better estimate looking at all the data, not just knowing one point and its rank order. Moreover in practical situations, you rarely have exact knowledge of the measurement error distribution and no information at all about the underlying distribution of true values.

AaCBrown (talk) 02:43, 30 July 2010 (UTC)

Let X be a random variable, with any pdf, in the range [3, 20]. Let the error E be a random variable with any pdf. A "random point" refers to the random variable X+E.

• A random point below 20, i.e. a random point in [3, 20), has negative expected error.
• A random point above 3, i.e. a random point in (3, 20], has positive expected error.

Are you asserting this?

Don't you need a probability distribution before you can assert this?--Palaeoviatalk 04:37, 30 July 2010 (UTC)

Whatever you assert, we need a precise mathematical statement (I see many subtle points in formulating such a statement. Since you know the material, please provide it), a rigorous proof, and genuine sources.--Palaeoviatalk 09:13, 30 July 2010 (UTC)

I think the following mathematical fallacy (discussed above) would be an excellent challenge to gifted eighth-graders:
Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.--Palaeoviatalk 03:36, 26 July 2010 (UTC)
Here may be a path to understanding. Could you explain why this is incorrect? I agree it's counterintuitive, and puzzles many people when it is first presented. I'm pretty confident it's true, I've thought a lot about it and it is generally accepted among statisticians and probability theorists. So if you think there's an elementary fallacy here, I probably didn't explain it well. If you describe what you think the error is, I think I can satisfy you. I've taught this for a lot of years.
But I don't mean to be patronizing. I admit the possibility that you have discovered something original. In that case it's far more important that you communicate it.AaCBrown (talk) 03:39, 30 July 2010 (UTC)

## "Sources"(?) for "regression toward everything" examined

I am completely puzzled by your citing a paper on "estimating a multivariate normal mean" (by George) as central to this discussion. Such citations are apt to arouse my suspicion.--Palaeoviatalk 02:54, 25 July 2010 (UTC)
The citation was for the baseball player example included in the paper. But the topics are related since the regression toward the mean phenomenon is exploited in estimating the mean of a multivariate Normal distribution with three or more dimensions. For example, suppose all students in a class take the three SAT tests, Math, Verbal and Writing. You might try to find the multivariate average for the class by averaging the Math scores, averaging the Verbal scores and averaging the Writing scores. But you do better (under certain assumptions and definitions) to reduce whichever of your estimates was highest and increase your estimate of whichever was lowest; even if the scores are independent.AaCBrown (talk) 07:48, 25 July 2010 (UTC)

AaCBrown said of "Regression towards everything":

The classic early papers are by Efron and Morris in the Journal of the American Statistical Association, 66 (pp. 807-815), 67 (pp. 130-139), 68 (pp. 117-130) and 70 (pp. 311-319). There is a good popular account by Everson in "A statistician reads the sports pages", Chance 2007 (20) pp. 49-56. The best general references for this topic are "Improving Efficiency by Shrinkage" by Marvin Gruber, "Theory of Preliminary Test and Stein-type Estimation with Applications" by A.K.M.E Saleh and "Bayes and Empirical Bayes Methods for Data Analysis" by Carlin and Louis. There is extensive treatment in "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman. It's also covered more briefly in most basic mathematical statistics texts, for example "Statistical Decision Theory and Bayesian Analysis" by J.O. Berger (pp. 364-369).

"Improving Efficiency by Shrinkage" by Marvin Gruber is online. Please indicate the relevant pages, as they are not immediately apparent.--Palaeoviatalk 08:36, 25 July 2010 (UTC)

We're obviously miscommunicating about something basic, since the entire book is about ridge regression and shrinkage. Page 28, for example, includes a discussion about regression toward multiple points, which I think may be the phenomenon you reject (but I'm not sure, I'm not trying to provoke you by being dense).AaCBrown (talk) 03:08, 30 July 2010 (UTC)

Shrinkage estimator (and Stein's example, James–Stein estimator) appears to be a focus of the cited literature, but I fail to see any relevance.--Palaeoviatalk 08:50, 25 July 2010 (UTC)

I assume that you disagree that a shrinkage estimator is an adjustment for regression toward the mean. Is that correct? Does that mean you don't accept shrinkage? Or you have an alternative derivation? Or do you feel that adjusting for regression toward the mean is a separate topic that does not belong in this article (I agree with that to the extent that a full discussion of shrinkage, which is much more general than regression toward the mean, is out of place, it should be a link; but why not mention the concept?).AaCBrown (talk) 03:21, 30 July 2010 (UTC)

"Statistical Decision Theory and Bayesian Analysis" by J.O. Berger (pp. 364-369) is also online. Once again, in my view it is an utterly bogus citation. (Judge for yourself.) Please enlighten me. --Palaeoviatalk 01:08, 26 July 2010 (UTC)

I fear this will not help, but here is the relevant quote starting from the last sentence on page 364: "The most sensible way of selecting a minimax estimator would, therefore, seem to be to decide, a priori, where theta is felt likely to be. . . " I acknowledge this is not clear without the equations preceding it to define theta. I included it because it is explicitly non-Bayesian (I thought this might be your objection, as some people get passionate about that) and mathematical (you seem to demand a rigorous mathematical demonstration). If you continue reading to 369 as given in the citation, you will see the argument for using heuristics to select the point to regress toward.AaCBrown (talk) 03:16, 30 July 2010 (UTC)
I asked for sources for "regression towards everything" (simple enough), and this is the best that there is? Completely irrelevant. (Again, judge for yourself.)--Palaeoviatalk 05:55, 30 July 2010 (UTC)

## Quality of this article: Guarding against regression to chaos and idiocy

In response to AaCBrown's proposal:

I think the only solution is for one person to redo it. There's no need to write new material, I think we can take it that every relevant point has been written at one time by someone or other. The hypothetical new editor should just search the history and include the best version of every coherent point backed by adequate citations. If there are contradictory points, list both side-by-side for the reader to decide. The only writing should be for transitions and stylistic consistency. It's possible the topic should be broken up, say into RTM (History of genetics), RTM (Regression analysis), RTM (Experimental design) and so on. Or maybe some of those should be sections in other articles (like Genetics, Regression or Experimental Design) with only links from this article.

My views are as follows. The article's version on April 29, 2009 was appallingly useless and misleading, the section "Regression towards everything" (introduced in this edit on October 14, 2008 by AaCBrown) being particularly egregious. There has since been a major overhaul of its contents, and it is now clear and correct, on the whole.

There is no need for complicating this article with "contradictory points" (because there is none), confusing the issue again with "shrinkage" (which is an estimation technique, and completely unrelated to this article), and "regression towards the median", "regression towards everything point in the sample range" (which are utter rubbish). All such proposals would only lead to regression to chaos, or reversion to maximum entropy.--Palaeoviatalk 05:36, 27 July 2010 (UTC)

The following is a most impressively authoritative-sounding extract from "Regression towards everything" that is full of crackpot, kooky ideas that are totally ridiculous:
In general, you can expect the net effect of regressions toward all points to pull an individual toward the closest mode of the distribution. If you have information about subgroups, and the subgroup means are far apart relative to the differences between individuals, you can expect individuals to be pulled toward subgroup means, even if those do not show up as modes of the distribution. For unimodal distributions, without strong subgroup effects or asymmetries, individuals will likely be pulled toward the mean, median and mode which should be close together. For bimodal and multimodal distributions, asymmetric distributions or data with strong subgroup effects, regression toward the mean should be applied with caution.
Vigilance is called for against this article's regression to idiocy.--Palaeoviatalk 11:49, 27 July 2010 (UTC)
Okay, this is more promising, perhaps I can understand your disagreement. Do you agree that if you gave an SAT test to a population of fifth graders and college seniors, that upon retest the fifth graders would regress toward the mean of fifth graders and the college seniors would regress toward the mean of college seniors? If so, then I think you agree with regression toward subgroup mean. If not, I think you're wrong, but I would be happy to hear your reasons or sources.
If you agree with that, would you agree that even without subgroup information, we would expect a bimodal distribution of scores in the experiment above, and that we would expect observations to regress toward the closest mode, assuming we have no information about the age of the subject? Again if you disagree, I would be happy to discuss it further.
Finally, if the above statements are true, doesn't it make sense to apply regression toward the overall sample mean with caution?
Or is there some other part you think is incorrect, such as that the effect doesn't matter much in unimodal distribution without strong subgroup effects or asymmetries, or that in such distributions the mean, median and mode should be close together.
If you can express your objection more clearly, I think we can resolve this easily.AaCBrown (talk) 03:32, 30 July 2010 (UTC)
Palaeovia, what is a reliable source that comments directly on the statements that you are disagreeing with here? I'm just wondering, because I like to look up sources on this subject. -- WeijiBaikeBianji (talk) 13:47, 27 July 2010 (UTC)

The current exchange between AaCBrown and me on this Talk page (beginning with the section "Is this utter nonsense?") relates directly to the kookiness of the quoted text. This text is part of a deleted section "Regression towards everything" written by AaCBrown, which he(she) recently argued for restoration. I asked for and obtained from AaCBrown an explicit and unambiguous definition of what he meant by "regression towards everything". (see "I'm not sure why you're so worked up about this" above.)

The relevant portions that show why the text is full of crackpot ideas are as follows:

There isn't one precise definition. The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

This simple idea can be made precise and rigorous in a number of ways. Some of them require strong assumptions and state exactly how much regression to expect. For example, if you assume the underlying true values have a Normal distribution with known parameters and the errors are i.i.d. samples from another Normal distribution with known parameters, you can compute the precise distribution of the regression effect. Other treatments require few assumptions but say only that regression exists. Some of these methods go by names like shrinkage estimators, ridge regression and empirical Bayesian estimates. The sources above contain many examples with real data, but it's easy to prove for yourself. Pick any two sets of measurements, or generate some random numbers in Excel (you can put rand() in column A, A1+rand()-.5 in column B and A1+rand()-.5 in column C and regard B and C as two different noisy estimates of A). Pick any point within the first set. I'll bet you'll find that the observations lower than the point on the first set went up on the second set more than half the time.
This effect is of most practical importance when the distribution has multiple modes. The largest shrinkage tends to be toward the modes, not the mean or the median. It also can be significant when you have mixed populations or incomplete information.AaCBrown (talk) 07:48, 25 July 2010 (UTC)

You said: "Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error". Consider data-point 15 in the data range [2, 30]. 15 is above 10, so positive error is expected of the data-point 15. 15 is below 20, so negative error is expected of the data-point 15. . Please explain.--Palaeoviatalk 08:18, 25 July 2010 (UTC)

I think I already answered this above. A randomly-selected point above 10 is more likely to have a positive error than a negative one. A randomly-selected point below 20 is more likely to have a negative error than a positive one. These are incomplete information statements. If we know subject was measured at 15, statements about randomly-selected measurements are not relevant.
Suppose, however, you know the distribution of measurement errors is Normal with mean zero and standard deviation one. Say the 15 is the 90th highest out of 100. Before you knew the value, you expected the 90th out of 100 observations to have a positive measurement error of 1.28. So you might shrink the 15 down to 13.72. However, this is another incomplete information problem, you could make a better estimate looking at all the data, not just knowing one point and its rank order. Moreover in practical situations, you rarely have exact knowledge of the measurement error distribution and no information at all about the underlying distribution of true values.

AaCBrown (talk) 02:43, 30 July 2010 (UTC)

I think the following mathematical fallacy (discussed above) would be an excellent challenge to gifted eighth-graders:
Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.--Palaeoviatalk 03:36, 26 July 2010 (UTC)

The above is an exercise in debunking patent nonsense. "Regression towards everything" is rubbish. Therefore the quoted text is impressive sounding rubbish. Since the text gives a Wikipedia editor's (not a scholar's) crackpot ideas, no sources exist to refute them. --Palaeoviatalk 15:41, 27 July 2010 (UTC)

Once again, I ask you for your reasoning. If it could be discovered by an 8th grader, and it's so patent, why won't you explain it?
I'm not trying to bait you, I honestly would like to understand your objection. Let's make it precise. There is a sample of noisy measurements, which neither one of us has seen. We pick a point within the range, say the mean, median, mode, average of min and max, 90th percentile; any formulation guaranteed to give a point within the range. We're going add up all the measured values for the observations above the selected point, and compare that to the sum of the true values. At even payout odds, would you bet over (the measured sum is greater than the true sum) or under (the measured sum is less than the true sum)?
I think the answer is "over," but if you disagree with that, we can have a productive discussion. It's also possible you agree with that, but feel that it's unrelated to regression toward the mean and doesn't belong in the article. That's also a point I'm happy to discuss.
The more you tell me about your views, the easier it will be to come to agreement.AaCBrown (talk) 03:47, 30 July 2010 (UTC)
If the other editor's statements are based only on his own reasoning, and not published sources, they should ordinarily be deleted under the No original research policy. But I have the impression that some professional authors about statistics write quite a lot about lay errors in statistical reasoning, so generally when I see a statement labeled as nonsense in a discussion of statistics, I ask for a reference to some expert's refutation of that statement. It is generally useful while editing Wikipedia to have sources to back up opinions, especially if the opinions are true but disagreed with by some members of the public. So I will continually ask for sources on article talk pages all over Wikipedia, the better to ensure that articles follow the Wikipedia verifiability policy. -- WeijiBaikeBianji (talk) 16:00, 27 July 2010 (UTC)
The whole shebang is obviously "original research". Debunking the fallacy is easier to do than pinning down AaCBrown's bogus citations (which I have also done), though.--Palaeoviatalk 16:13, 27 July 2010 (UTC)

## I withdraw my proposal

I still think the existing article is unsatisfactory to everyone, but I have failed to avoid a flame war with Palaeovia. I'm honestly not sure why he's so worked up against shrinkage, or perhaps the relation between shrinkage and regression toward the mean. While there are a number of related ideas sometimes labeled "regression toward the mean," including some of considerable historical interest in the history of genetics and statistics, as far as I know the only modern applications are experimental design and shrinkage. In experimental design, regression toward the mean is a general concept, not a rigorous theorem. It is an important reason to use randomized controls, or other methods, to avoid inferential error. Shrinkage is the modern understanding of how to exploit regression toward the mean for estimation. There are many variants of shrinkage using different assumptions, including a Bayesian version. It is a concept that many people have trouble accepting on first presentation.

Palaeovia considers some aspect of this to be not just wrong, but offensive rubbish. I think the problem may be that shrinkage deals with only one sample. It estimates population parameters to be used to predict future samples. Classic regression toward the mean deals with two samples, and posits a relation between the second sample and the first. There's no significant mathematical difference between the two. You get the same numerical prediction for a second sample whether you do it in a classic regression toward the mean computation or a modern shrinkage one (but shrinkage is much more general and has wider application).

I'm not sure whether Palaeovia doesn't like shrinkage, accepts shrinkage but thinks it's mathematically distinct from regression toward the mean, or accepts the mathematical equivalence but feels they belong in different articles because they have distinct traditions. Frankly, while all three are valid points, they don't seem to be emotional ones, or ones that would require suppression of discussion rather than simply noting the disagreement in the article. His opinion is clear, but not his reasons.

However, I'm not strongly in favor of covering shrinkage-based understanding of regression toward the mean. There's already an article on shrinkage. I think it would be nice to at least link to it, but not at the cost of starting a fight. I got drawn into discussing this because I took Palaeovia at face value when he asked if there were objections to removing the discussion, then when he found fault with my objections and citations, I tried to explain them. Clearly I missed what was bothering him, because each answer and each citation did more harm than good.

There's no reason to let that issue stand in the way of someone cleaning up this article. I'm content to leave out everything that offends Palaeovia. There's still a lot of good stuff. The problem is that it's not communicated well. It is not organized properly, and some parts make no sense at all because of some clumsy editing by people with different views. All I'm suggesting in one person rewrite it in consistent style and organization. I'm clearly not the person to do this work, but I wish someone acceptable to Palaeovia would.

AaCBrown (talk) 01:41, 30 July 2010 (UTC)

What are some sources that mention some of the issues you mention in this post? -- WeijiBaikeBianji (talk) 01:59, 30 July 2010 (UTC)
Do you mean references to shrinkage? Or discussions of the relation of shrinkage to concept of regression toward the mean? There are a number of both above (but Palaeovia disagrees), here are two more that are available on-line: http://www.economics.pomona.edu/GarySmith/BBregress/baseball.html, http://www.ncbi.nlm.nih.gov/pubmed/8109577. These are more or less random, there are literally thousands of articles on this subject.
Here is a quote from the first citation above that explains the point: "Because baseball performances are an imperfect measure of underlying abilities, batting averages and earned run averages regress toward the mean. Outstanding performances exaggerate player skills and are typically followed by more mediocre performances. The average correlation coefficient for adjacent-season performance is 0.39 for batting averages and 0.25 for earned run averages. Predictions of standardized batting averages and earned run averages can be improved consistently and substantially by using correlation coefficients estimated from earlier seasons to shrink performances toward the mean." —Preceding unsigned comment added by AaCBrown (talkcontribs) 02:56, 30 July 2010 (UTC)
However, I would prefer to move this discussion away from shrinkage. That's a minor side issue that causes flaming. I would be very happy to see this article written tightly with no mention of shrinkage or the shrinkage-based interpretation of regression toward the mean.

02:51, 30 July 2010 (UTC) —Preceding unsigned comment added by AaCBrown (talkcontribs)

Thanks. That gets me started toward further research. -- WeijiBaikeBianji (talk) 03:57, 30 July 2010 (UTC)

## Palaeovia's focus and position

My objections were to the deleted text ("Regression towards everything", quoted above), whose first paragraph is:

Notice that in the informal explanation given above for the phenomenon, there was nothing special about the mean. We could pick any point within the sample range and make the same argument: students who scored above this value were more likely to have been lucky than unlucky, students who scored below this value were more likely to have been unlucky than lucky. How can individuals regress toward every point in the sample range at once? The answer is each individual is pulled toward every point in the sample range, but to different degrees.

Shrinkage was not even remotely implicated in the deleted text. Therefore when I said that the text is utter and unadulterated nonsense, of course I was referring to the deleted text, and not to shrinkage. My objections to the deleted text are (quoting previous postings):

• Notice that in the informal explanation given above for the phenomenon, there was nothing special about the mean. We could pick any point within the sample range and make the same argument
Objection: There are rigorous mathematical reasons for the phenomenon of "regression towards the mean" (RTM), and they are specific to the mean, and you cannot replace "the mean" by "any point in the sample range" in the mathematical argument and still have a correct argument. Informal arguments are not proofs, and only serve to aid understanding. If the word "mean" does not appear in an informal argument, it does not follow that "the mean" plays no role in the rigorous proof.
• students who scored above this value were more likely to have been lucky than unlucky, students who scored below this value were more likely to have been unlucky than lucky. Later amplified version: There isn't one precise definition. The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.
Objection: This is a patent fallacy which I spotted the moment I read it. Consider data-point 15 in the data range [2, 30]. 15 is above 10, so positive error is expected of the data-point 15. 15 is below 20, so negative error is expected of the data-point 15.

What follows the first paragraph in the deleted text (e.g. gravity in the solar system) is pure farce.

Therefore AaCBrown's description of my views as objections to including "shrinkage" in the article is wildly inaccurate.--Palaeoviatalk 04:02, 30 July 2010 (UTC)

This is helpful. I think you accept shrinkage. I'm sorry you thought I was trying to describe your views on the subject, I was trying to ask you to describe them.
Now I think your objection is you don't think the quoted passage has anything to do with shrinkage, please note I intend this as a question.
I think you accept that the passage itself is factually true, that is the referenced explanation for RTM did not reference any special property of the mean. So I assume your real objection is to the referenced explanation. Again, I'm asking.
Now I'm left with two possibilities, if I'm right so far. You don't believe the original explanation is correct for RTM, or you don't believe it is supported by the citations. It's true the citations are primarily concerned with shrinkage (that is, for adjusting estimates to account for the effect of regression toward the mean) rather than RTM itself.
If you would tell me which of the above is the right one, I think we can progress. Or if I've made a false deduction, please let me know that. You keep posting the [2, 30] argument which, with all due respect, has been answered twice and in any case is your original research. AaCBrown (talk) 04:16, 30 July 2010 (UTC)

My views are as follows.

• The deleted text asserted that in exacly the same way that measurements exhibit "regression towards the mean", they exhibit "regression towards everything".
We agree.AaCBrown (talk) 14:39, 30 July 2010 (UTC)
• The deleted text is wrong.
This is clear. But what's not clear is why you think so. I claim that the general concept of regression toward the mean led to the development of shrinkage estimators to exploit the phenomenon in estimation, and that modern statistical work on the concept is done under the same shrinkage or damped regression or ridge regression or other terms. Is this your disagreement? I then claim that the modern statistical work (which actually started with the idea of regression toward zero) places no special emphasis on the mean, that some methods shrink toward other points, and some methods shrink toward multiple points. Is that your disagreement? Or, if you agree with both of those, is it that you feel my write up of the work did not reflect those facts.AaCBrown (talk) 14:39, 30 July 2010 (UTC)
• The deleted text is "original research".
Again, that is clear. The question is why you don't accept the sources? Are they not authoritative? Not on point? I think I've taken a lot of time to give you multiple sources of varying degrees of rigor, with specific pages and quotes when you asked for them. You insist they are all "bogus" but you don't say why.AaCBrown (talk) 14:39, 30 July 2010 (UTC)
• The deleted text has absolutely nothing to do with "shrinkage".
This suggests that your disagreement is my claim that regression toward the mean is the phenomenon exploited in shrinkage. If you mean that shrinkage is a much more general concept, I agree. The progress of statistics has often been from vague general concepts to precise technical work. That doesn't make the general concept wrong, nor does the precise work replace the general concept, which often has applicability beyond the narrow assumptions required for rigor. This is an article on a general concept, but all the modern work on the subject is done in more precise technical ways.AaCBrown (talk) 14:39, 30 July 2010 (UTC)
• An informal verbal explanation of RTM may help understanding, but there are rigorous mathematical results pertaining to RTM. If such results are specific to "the mean" (and not generalizable to "any point"), then an informal verbal explanation has no business in generalizing RTM from "the mean" to "any point".
There are results specific to the mean, and strong results applicable to the median and mode. There are rigorous results applicable to other points, and to multiple points. The generalization from the mean (in fact, the original concept was "mediocrity" rather than mean) is important for experimental design and for deeper understanding.AaCBrown (talk) 14:39, 30 July 2010 (UTC)
• "Shrinkage" is a highly technical issue concerning estimation, and has no place here.--Palaeoviatalk 05:12, 30 July 2010 (UTC)
"Highly technical" is in the eye of the beholder. Simply stated, shrinkage exploits the tendency of regression toward points to improve estimation. I agree a technical explation of shrinkage is not appropriate for this artice. But you asked for rigorous proofs of the assertions. Those proofs are done today in the context of estimation rather than simply regression. People want to know exactly how much things regress, and toward exactly what points.AaCBrown (talk) 14:39, 30 July 2010 (UTC)

A summary of the debating tactics so far: I have been trying to pin down and debunk "regression to everything", an utter nonsense. This has been my sole, unambiguous focus throughout. AaCBrown's response has been to interpret everything that I said as refering to "shrinkage", a highly technical estimation technique. This is to create an impression that I am being unreasonable. Am I to understand that this is a distinction so subtle that it is beyond AaCBrown's grasp? Is he(she) not sufficiently familiar with mathematical and statistical discussion? a layman(laywoman)?--Palaeoviatalk 10:45, 30 July 2010 (UTC)

I don't take everything you say as referring to shrinkage. I now believe you accept shrinkage, and also that shrinkage is "toward everything." I think our disagreement is mainly that you see a night and day difference between observing that true values or subsequent observations tend toward something, and estimating the size of that tendency. Unfortunately, I know of no source that rigorously investigates what things regress toward without also considering estimation.AaCBrown (talk) 14:39, 30 July 2010 (UTC)

The fallacy of "regression toward everything", being false, is not the justification for shrinkage estimators. Stein's example provides the motivation for studying them. "Improving Efficiency by Shrinkage" by Marvin Gruber provides the mathematical foundation for them. --Palaeoviatalk 11:15, 30 July 2010 (UTC)

I'm not sure what "justification" means in this context. There are two main reason shrinkage improves efficiency. The first is the tendency of points to regress toward something, the second is the properties of some loss functions (particularly squared-error). However, the shrinkage papers and books were cited only for their discussions of regressions and their proofs.AaCBrown (talk) 14:39, 30 July 2010 (UTC)

## Ed George's participation

It's true the discussion has become diffused, but it's not a "tactic." I just answer your questions where you post them. I noticed you do not answer my questions. I think we could clear things up much faster if you would do so. I've given you citations for all your questions, but you think they are "bogus." I've offered to get answers to any of your questions from any expert you agree to accept, but you won't specify the question. I can't reasonably ask an important statistician to read the entire thread and offer opinions. If you can put your objections into simple questions, we could solve this quickly. For example:
Is the phenomenom of regression toward the mean specific to the mean or can it be used for median, mode and other points, or even multiple points?
Would that suffice? And whom would you accept?AaCBrown (talk) 14:39, 30 July 2010 (UTC)
My questions to George has been given above.--Palaeoviatalk 19:44, 30 July 2010 (UTC)
That's great. I'll ask him. But I want to be completely clear on this. You want an answer from Ed to, "Is the phenomenom of regression toward the mean specific to the mean or can it be used for median, mode and other points, or even multiple points?" I understand the suspicions you mentioned earlier. Will it suffice if he publishes the question and answer on his home page?AaCBrown (talk) 22:57, 30 July 2010 (UTC)

No. That is not my issue at all. I repost here my previous post:

My points are as follows:
You have attributed (vaguely) the following assertion (concerning regression toward everything, an utter nonsense, and not to be confused with shrinkage) to Charles Stein ("For full disclosure purposes, Ed George was my dissertation advisor, and the material I wrote is based on some work by Ed's dissertation advisor, Charles Stein. It is not original work by me, but I admit one reason I would like it in there is to recognize Charles and Ed."):

The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

Note: Bold phrases above show that the text plainly specifies that the context is that "the first set of observations are given and known". AaCBrown now requires that the context be changed to "before any observation is made." This is to completely change the assertion. So the unambiguous assertion quoted above, as written by AaCBrown, is a fallacy.

• You have not provided a source for this assertion.
• The assertion (as stated plainly(#), without your later qualification that the data point is yet unknown) is a fallacy. (I suspect that, even with that qualification, it is still false. But I need a precise statement of the theorem, and a source for the proof.)
(#)Your text started with "Take any set of data". This means that the data are given and known. You don't start with no data. Therefore your finessing qualification that the data is yet unknown is not allowed by the text. The assertion is a fallacy, plain and simple.
If you like, please invite Professor George to address the issue here. (Of course I would need to authenticate his identity by emailing him.) In such academic and mathematical matters, I would not accept a "verdict", but would be perfectly open to valid arguments.--Palaeoviatalk 23:09, 30 July 2010 (UTC)

## Formalizing "Regression toward everything": first step (Fallacy: Part Three)

Granting his finessing (evasive) qualification that no measurement has yet been made, I posted the following:

Are you asserting the following?

Let X be a random variable, with any pdf. Let the error E be a random variable with any pdf. A "random point" refers to the random variable X+E. Let [a,b] be the range of X+E.
• A random point below b, i.e. a random point in [a,b), has negative expected error.
• A random point above a, i.e. a random point in (a,b], has positive expected error.
(I see counter-examples.)
Not quite. The statement is about samples, not distributions. See below.AaCBrown (talk) 15:51, 30 July 2010 (UTC)

Whatever you assert, it is a non-trivial mathematical claim, for which we need a precise mathematical statement (I see many subtle points in formulating such a statement. Since you know the material, please provide it), a rigorous proof, and genuine sources. Obviously, such a statement should be free of "shrinkage", "estimator", and all such extraneous concepts. I want a precise mathematical statement of your English text (labelled "fallacy"), no more and no less.

Please discuss the issue here. I don't want to scan several threads for a single issue.--Palaeoviatalk 10:13, 30 July 2010 (UTC)

If you didn't post the same questions several place, I wouldn't answer them in several places.

## Plain, unambiguous text is not so plain, after all (Fallacy for gifted eighth graders: Part two)

Debunking the fallacy of "regression towards everything" (written by AaCBrown) is now diffused over several sections and several discussion threads. This tactic creates the false impression that I am unable to respond to AaCBrown's postings. To counter that problem, I will address the issue centrally in this section.

The fallacy----------------------------------------------

The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

Debunking the fallacy-----------------------------------------

Consider data-point 15 in the data range [2, 30]. 15 is above 10, so positive error is expected of the data-point 15. 15 is below 20, so negative error is expected of the data-point 15. A contradiction.
This is an issue that many people have when first encountering this subject. It is addressed rigorously in the citations. The verbal (non-rigorous) explanation is there are factors arguing for the error to be in the direction of different points. Which factors dominate depend on the specifics. Another issue is you interpret everything as a complete information problem. Many applications of regression refer to incomplete information. If all you know is that a point is above 10, that supports some inference. If you know the point is 15, those inferences need not be valid. The point is the same, but your information about it is different. Precisely how you formalize that depends on whether you take a Bayesian or frequentist approach.AaCBrown (talk) 14:54, 30 July 2010 (UTC)

AaCBrown tried to finesse his argument by stating that the assertion is true at the point before (but not after) any measurement is made. However, his text started with "Take any set of data". This means that the data are given and known. We don't start with no data. Therefore his finessing move is not allowed by his text. The assertion is a fallacy, plain and simple.

I think your confusion here is between statements about samples and statements about distributions. Perhaps my writing is not clear. However, whenever I offer a verbal explanation intended to aid understanding, you seem to take it as an attempt at rigorous proof. And when I offer citations to rigorous proofs, you object that they don't help understanding.AaCBrown (talk) 14:54, 30 July 2010 (UTC)
This is what the text says: "Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data (meaning that the sample range is known) . The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error...." Therefore either the text is so poorly phrased and so misleading that it is useless, or that you are being evasive. --Palaeoviatalk 19:57, 30 July 2010 (UTC)
I sympathize with your confusion here, this is a tricky point that baffles many people. For practical work, we have data and try to draw conclusions. But probability statements have no meaning once the data are known (to a Bayesian) or determined (to a frequentist). So we have to state rigorous statements in terms of either prior and posterior belief (Bayesian) or hypothetical infinite future repetitions (frequentist). Once the observations are known/determined, the errors are what they are, no probability statement about them is meaningful. What we can make probability statements about is what might have happened.
Specifying that a point is within the sample range does not imply the range is known. You know the median is in the sample range before the sample is drawn. Again, I understand this stuff can be confusing.AaCBrown (talk) 23:03, 30 July 2010 (UTC)

This is AaCBrown's unambiguous definition of "Regression toward everything":

The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

Bold phrases above show that the text plainly specifies that the context is that "the first set of observations is given and known". AaCBrown now requires that the context be changed to "before any observation is made." This is to completely change the assertion. (Any well trained mathematical statistician or mathematician would know how vastly different the revised assertion would look. It would be completely different from the quoted text.) So the unambiguous assertion quoted above, as written by AaCBrown, is a fallacy.

Could Charles Stein, an eminent statistician, have been the author of such a patent fallacy (as asserted by AaCBrown)? Who is the author of this fallacy?--Palaeoviatalk 11:28, 31 July 2010 (UTC)

## Let's not forget about sources.

Thanks for the further discussion about issues related to regression toward the mean. I particularly thank participants who are referring to published sources about statistics. Please, let's all remember that Wikipedia doesn't publish original research but must be verifiable. I would be very glad to see editors ponder what reliable sources say and edit the article accordingly. -- WeijiBaikeBianji (talk) 11:59, 31 July 2010 (UTC)

## Charles Stein and Ed George

According to AaCBrown, Edward I. George, Professor of Statistics, Department of Statistics, The Wharton School, University of Pennsylvania is an authority on "regression towards everything." I do not suppose anyone would need to trouble the professor for such an obviously fraudulent claim.--Palaeoviatalk 22:03, 29 July 2010 (UTC)
Ed George is the originator of the idea of simultaneous shrinkage to multiple points. Prior to his dissertation most work on shrinkage concerned finding the correct point to shrink toward (and of course, how much to shrink). While that might be considered "regression toward everything," the phrase is mine, not his. I intended it to mean you have to consider the possibility of regression toward everything, not that estimators should necessarily be adjusted for every possible point (the idea is not absurd, just not one I have done or seen any work on).AaCBrown (talk) 19:22, 1 August 2010 (UTC)
Would you accept his verdict on the question? I'm happy to ask him, or any other authority you would prefer. I think it's best for you to phrase the precise question, because I honestly don't understand your position. Do you think shrinkage is garbage? Or that it has no relation to regression toward the mean? Or something else?AaCBrown (talk) 03:05, 30 July 2010 (UTC)
My points are as follows:
You have attributed (vaguely) the following assertion (concerning regression toward everything, an utter nonsense, and not to be confused with shrinkage) to Charles Stein ("For full disclosure purposes, Ed George was my dissertation advisor, and the material I wrote is based on some work by Ed's dissertation advisor, Charles Stein. It is not original work by me, but I admit one reason I would like it in there is to recognize Charles and Ed."):

The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

• You have not provided a source for this assertion.
• The assertion (as stated plainly(#), without your later qualification that the data point is yet unknown) is a fallacy. (I suspect that, even with that qualification, it is still false. But I need a precise statement of the theorem, and a source for the proof.)
(#)Your text started with "Take any set of data". This means that the data are given and known. You don't start with no data. Therefore your finessing qualification that the data is yet unknown is not allowed by the text. The assertion is a fallacy, plain and simple.
If you like, please invite Professor George to address the issue here. (Of course I would need to authenticate his identity by emailing him.) In such academic and mathematical matters, I would not accept a "verdict", but would be perfectly open to valid arguments.--Palaeoviatalk 08:44, 30 July 2010 (UTC)
According to the same person, the idea of "regression towards everything" originated with Charles Stein (statistician). --Palaeoviatalk 23:21, 29 July 2010 (UTC)

## Formalization deviating from fallacy (Fallacy: Part 4.1)

AaCBrown's fallacy is as follows:

The idea is simple. Take any set of data, even one that includes completely different measurements like the height of Mount Ranier in feet, the time is takes you to run a mile and the weight in kilograms of the average car. Pick any point within the range of the data. The observations greater than that point are more likely to have been measured with positive error than negative error; the observations below that point are more likely to have been measured with negative error. That requires no assumptions about the distribution of true values or errors, just that there are errors. Therefore, if you take a new set of measurements of the same quantities, the observations that were above the point on the first set are more likely to be smaller on the second set than the first set; and the observations that were below the point on the first set are more likely to be larger on the second set than the first set.

The following is his formulation (for the special case of letting the arbitrary point be b). It requires that the error distribution has "median zero and support over the entire real line":

This is not my formulation. It is from the citation you labeled "bogus." That source has other, more general formulations, as do the other sources. This is the very meaning of RTM in modern statistics. It is in every graduate-level data analysis textbook, and taught in every graduate- level course. Not necessarily this precise formulation, but this idea. Your unfamiliarity with it makes it hard to consider you an authority on RTM.AaCBrown (talk) 18:56, 1 August 2010 (UTC)
Let X = {x_1, x_2, . . ., x_n} be any set of unknown points. Let E = {e_1, e_2, . . .,e_n} be unknown i.i.d draws from a distribution with median zero and support over the entire real line. We observe only Y = {x_1 + e_1, x_2 + e_2, . . ., x_n + e_n}. The minimum value of y is a, the maximum value is b.
Before we observe Y, or if you prefer, before E is drawn, consider i such that y_i = x_i + e_i = b (remember, b is unknown at the point we make this statement). Pr{e_i > 0} > 0.5. Consider all i such that y_i < b. Pr{Sum over i of e_i > 0} < 0.5.

This means that his formulation is not a formalization of his fallacy, in two ways:

• His formulation concerns probabilities prior to any data is available, whereas the fallacy starts with a set of data, then asserts certain probabilities (or expectations). (See section Plain, unambiguous text is not so plain, after all (Fallacy for gifted eighth graders: Part two) for details.)
This is one of your most basic confusions. To use RTM, you need to start with some data. However, probability statements about the realization are meaningless. Therefore, we posit a random process that generated the realization, and make probability statements about it. This is perhaps the most basic intellectual underpinning of statistics and probability. It's natural to ask, "If I flip a coin ten times and get eight heads, what is the probability it is a fair coin?" But the answer to that has to be 1 or 0, it is fair or it isn't. What I can answer is "What is the probability that a fair coin will come up heads eight or more times out of ten?" You claim that's "going back prior" to the data. Yes, it is. That's what statistics does.AaCBrown (talk) 18:56, 1 August 2010 (UTC)
• His formulation requires that the error distribution has "median zero and support over the entire real line".
If you would read the citations instead of insisting they are bogus, you would find many more general proofs.AaCBrown (talk) 18:56, 1 August 2010 (UTC)

This means that even if his formulation (in its general form, replaing b by an arbitrary point) can be proven, the fallacy remains a fallacy. --Palaeoviatalk 17:38, 1 August 2010 (UTC)

Here you have a point. The non-rigorous, verbal form, did not mention specific assumptions required for RTM to be valid. It just said it sometimes is and sometimes isn't. When you (not me) demand a rigorous formulation, specific assumptions have to be made. Therefore, the rigorous version will never completely support the verbal version. I think the verbal version is appropriate for Wikipedia, with references to the rigorous proofs for those who want them. AaCBrown (talk) 18:56, 1 August 2010 (UTC)

For anyone confused by AaCBrown, the distinction I make is between the following two kinds of questions:

• Given a set of data {1,5,6,2}, what is the probability that the data 6 has positive error?
• I have not seen the data set. What is the probability that the largest number in the set has positive error?

These are the kinds of questions under discussion here.

Both are meaningful questions. AaCBrown seems to believe that probability statements of the first kind is "meaningless".("However, probability statements about the realization are meaningless".)

I fear that he has not even grasped the basic distinction which I thought he introduced to salvage his fallacy. Perhaps he did not mean to introduce the distinction at all, and I helped in trying to salvage his fallacy. What a muddle! --Palaeoviatalk 23:37, 1 August 2010 (UTC)

## "Proof"(?) (Fallacy: Part five)

The proof of my statement is self-evident. I will give it verbally, but it can be put into symbols. Let x_k be the maximum of the set X (it need not be unique). There is at least 0.5 probability that e_k >= 0 (since the median of the distribution that generated E is 0). If that happens, it is guaranteed that for i such that y_i is the maximum point of Y (again, it need no be unique), e_i >= 0. Either k = i, in which case we know e_i >= 0, or x_i <= x_k (since x_k is the maximum of X). In the latter case, we know x_i + e_i > x_k + e_k, so e_i > (x_k - x_i) + e_k, both of the terms on the left are >= 0, so e_i >=0.
If e_k < 0, then there is still some probability that e_i >=0. Consider any point x_j, j not equal to i. Since the distribution that generated E has support over the entire real line, there is some probability that e_j > x_i - x_j, in which case, as above e_i >=0.
0.5 plus some additional amount is > 0.5.
This is not hard stuff. I'm really baffled why you refuse to accept it.AaCBrown (talk) 16:03, 31 July 2010 (UTC)

Of course I could not have refused to accept your proof before I have even seen it. That is ridiculous. In fact I accept your partial proof.

I would have agreed with your first phrase, but in fact you used words including "crank," "crackpot," "bogus," and even dreamed up a biography for me without seeing it. You removed the section on the topic over my objections without even asking for it.AaCBrown (talk) 18:56, 1 August 2010 (UTC)

You have proved that: "Consider i such that y_i = x_i + e_i = b (remember, b is unknown at the point we make this statement). Pr{e_i > 0} > 0.5."

You have not proved that: "Consider all i such that y_i < b. Pr{Sum over i of e_i > 0} < 0.5". I do not accept this statement as yet.

In fact, I would like to see a proof of the assertion's general form, obtained by replacing the maximum b by a general value in [a, b].--Palaeoviatalk 21:24, 31 July 2010 (UTC)

It is in the citations, and it is as trivially obvious as the first proof. In fact, it's the same argument and I can do better, I can prove it for every point, not just the sum. However, again, I'm just reporting from the sources, none of this is my original work. Consider any point c in the range [a,b]. Consider any j. Unconditionally, e_j is equally likely to be positive or negative. If x_j > c, if e_j > 0 then Pr(y_j > c) = 1. If e_j < 0 then Pr{y_j > c) < 0.5. If x_j < c, if e_j > 0 then Pr(y_j > c) > 0 and if e_j < 0 then Pr(y_j > c) = 0. So either way, for y_j > c, Pr(e_j > 0) < Pr(e_j < 0), so Pr(e_j > 0) > 0.5.AaCBrown (talk) 19:04, 1 August 2010 (UTC)
Thank you, AaCBrown, for showing the steps of the proof. -- WeijiBaikeBianji (talk) 19:07, 1 August 2010 (UTC)

## Palaeovia's views on shrinkage, crackpot, etc

Since this has dragged on for a while, I have formed some opinion. Perhaps what we have here is a former or current Wharton School MBA student (of finance), with no rigorous mathematical training (not being able to see through an obvious mathematical fallacy), and a superficial knowledge of "shrinkage estimators" (making some plausible sounding statements), gleaned from George's class (knowing that Stein was George's thesis advisor) at Wharton. If such is the case, he is out of his depth. I assume mathematical common sense and mathematical maturity here.

If he simply cannot see that "regression toward everything" is a fallacy, then mathematical reasoning is a closed door to him. If such a person does original "mathematical research," we have a crackpot. --Palaeoviatalk 13:15, 30 July 2010 (UTC)

My pursuit is of truth and accuracy.--Palaeoviatalk 20:30, 30 July 2010 (UTC)
A succint summary of my views follows:
• RTM is a commonly observed phenomenon. Mathematical results explain their common occurence (in the context of Gaussian random variables). Of course, peculiar population distributions distort or invalidate RTM. You can probably construct examples exhibiting regression to 2*mean (and not RTM). Does this prove that "regression to twice the mean" is equally prevalent as RTM? Of course not. It occurs in very special cases.
Here is one point of disagreement. The Gaussian case, especially the multivariate Gaussian case, is almost never observed with real data. Moreover, since the mean, median and mode are the same for the Gaussian, it would be just as correct to claim it as proof that the true phenomenom is regression to those points. Also, it's not enough that you have bivariate Gaussian samples, you need identical population and independence among results. These assumptions are so restrictive as to be useful only as classroom exercises.
Would you tell someone to ignore RTM because her data are bimodal, or skewed or otherwise clearly from a non-Guassian distribution? Or tell her to shut up and assume there will be regression toward the mean? Or is she not allowed to come to Wikipedia for guidance?AaCBrown (talk) 22:29, 30 July 2010 (UTC)
• What about simultaneous "regression to every point" under all conditions? Is it even meaningful? Is it an empirical, observable phenomenon? No. Then it must be a mathematical result. It seems to be, with a simple verbal statement. But, the informal statement turned out to be an obvious fallacy. Now there is an attempt to finesse the fallacy into a formal mathematical claim. Let's see.
Of course it's observable empirically. Take any set of measured data. Pick any point within the range. Remeasure and count how many new measurements moved in the direction of the fixed points and how many moved away. Repeat enough times for statistical reliability. Discover that there is regression toward the fixed point. It is also a mathematical result.AaCBrown (talk) 22:32, 30 July 2010 (UTC)
Let's have a reputable source that reports simultaneous regression towards every point under all conditions (no assumptions of population distribution, error distribution). The phenomenon in question is "regression toward everything", not "shrinkage", or any other extraneous, irrelevant things.--Palaeoviatalk 12:26, 31 July 2010 (UTC)
There is no source that says this, it's a straw man you keep bringing up. The rigorous proofs all make assumptions. That does not invalidate the general principle that you can make sound inferences about unmeasured errors based on how a subset is created. If you select the highest scorers on a test, you're more likely to have people who scored above their abilities than below. You can put assumptions around that statement to make it rigorous, but in practice you rarely know whether the assumptions are true. Therefore the general principle is valuable by itself, as are commonsense warnings about how it can be misapplied. That's all anyone has ever suggested. You're the only one who imagines the rest.
And shrinkage is not unrelated. Shrinkage is the statistical attempt to exploit RTM for improved estimation. Your refusal to accept sources that mention shrinkage makes things difficult. The basic data analysis texts discuss the two things together. The best I can do is suggest you get a copy of Mosteller and Tukey's Data Analysis and Regression, and read the section on Regression Toward the Mean. You will find everything you ask for in there, with (I believe) no mention of shrinkage, as it was written in 1977 and shrinkage did not get big until the early 1980s. You will find a discussion of adjusting estimates for the effect, however.AaCBrown (talk) 19:16, 1 August 2010 (UTC)
• If "shrinkage" is based on "regression toward every point", where is the statement and proof of "regression toward every point"?
Shrinkage is not based on regression toward every point. Shrinkage is a general name for a set of techniques for regressing data points toward some fixed point, or set of fixed points. No shrinkage estimator shrinks toward everything, but every point in the range (and some outside the range) is a potential fixed point. Constructing a shrinkage estimator means choosing the point or points to regress toward, and determining how much to regress.AaCBrown (talk) 22:36, 30 July 2010 (UTC)
• In estimation, you might want to shrink the observations towards any given point under certain conditions for the sake of producing estimates with certain statistical properties, but how does this relate to simultaneous "regression to every point" under all conditions? (Under certain conditions, shrinking to certain point(s) might produce "better" (to be rigorously defined) estimates. Any mathematical results seem to be deeply embedded in the context of "shrinkage estimation". There seems to be no theorem of "regression toward everything", independent of "shrinkage estimation".) --Palaeoviatalk 22:01, 30 July 2010 (UTC)
This is all correct. No shrinkage estimator shrinks toward everything, although I think it would be possible in theory. You would integrate over all points in the range. Perhaps you would like the section better if I had titled it, "Regression toward anything"?AaCBrown (talk) 22:39, 30 July 2010 (UTC)

## Positive statement of regression toward more than the mean

Like Palaeovia, I feel the discussion has become fragmented. I have been answering questions, mostly about a host of issues I consider marginally relevant. I don't think RTM is a precise mathematical theorem, although there are precise theorems that reflect the basic insight. I think RTM is a general principle important to experimental design. The mathematical proofs in the citations are things Palaeovia insists on, not what I consider appropriate to include in the Wikipedia article. It is the discussions in the less-technical citations, the popular explanations, that I had objected to removing.

Suppose you want to measure the effectiveness of an SAT prep service. You start with the population of all students in a school district who took the SAT for the first time on Date D1. I think we all agree that if you select the 1% lowest scorers for the service, you are likely to see an increase in average score upon retake on date D2, even if the service is worthless. The scores will regress toward the mean even absent any effect. Similarly, if you select the highest scorers for the service, you may see reduced average scores, even if the service is useful.

This is not a rigorous mathematical statement. It might not be true. For example, if students tend to improve just from the experience of taking the test, that could overwhelm the RTM effect for the top scorers. There are many other reasons, theoretical and practical, why the effect may not turn out as expected. Nevertheless, understanding RTM is essential for anyone designing or interpreting experimental results.

The problem with emphasizing the mean is it suggests if we take any sample of students with mean score above the popluation mean, we expect their average score to decline upon retest. That may not be true. Suppose there is a subpopulation of students who don't care about their results, who leave the answer sheet blank or answer randomly; and those students bring down the average by 20 points. If we select a group of students with scores high enough to essentially guarantee they do not belong in this group, it is more reasonable to assume their scores will regress to the mean of the serious students, not to the overall mean. Or suppose the distribuiton is bimodal, with some high scores from students in a private school for gifted students. We would expect a student from this school to regress toward the mean of her school, not the overall population mean. If we didn't know the school, we might still expect a high scorer to regress toward the high mode rather than the population mean.

None of this is rigorous in the sense it is mathematically certain. All of it can be made rigorous by specifying assumptions about distributions and measurements. People interested in the rigor should study shrinkage and related techniques. But everyone should be aware of the general principles. It is Palaeovia who keeps insisting on rigorous statements and proofs, and who thereby brought shrinkage into the discussion. The original section in the Wikepedia article cited Mosteller and Tukey, Data Analysis and Regression, 1977, which discusses the subject in data analysis terms. This is what I think belongs in the Wikipedia article. The goal is to help people design experiments and understand data, not to teach mathematical statistics.

AaCBrown (talk) 17:00, 30 July 2010 (UTC)

I have found nothing to disagree with in the second, third and fourth paragraphs above. They belong in a general discussion of RTM. However, I am certain, after a lengthy exchange with AaCBrown, that simultaneous "regression toward everything" is original research and a patent fallacy. I oppose any mention in the article of such nonsense.
I am not convinced of the correctness of the statement: "People interested in the rigor should study shrinkage and related techniques." I need a credible source that links RTM to "shrinkage" in an organic, substantive way. You don't get such a source by googling "RTM shrinkage", and cite any paper that you find. If AaCBrown agrees not to mention shrinkage, I guess the problem does not arise.
In summary, if "regression towards everything" and "shrinkage" (including links to "shrinkage") are excluded, then I have at this point no objections to including in the article texts similar to the second, third and fourth paragraphs above. --Palaeoviatalk 05:37, 1 August 2010 (UTC)
Whew! That was a lot of work. Unfortunately, my real issue is getting someone to rewrite the entire article. Adding a section to regression toward points other than the mean, and to multiple points, wouldn't improve the overall organization and clarity. With all due respect, and I sincerely respect the work you have put into this and your passion to exclude error, I think you're getting in the way of improvement. I think you're intemperate, and create fights instead of consensus. I think at this point you're more concerned with showing you were right than learning or compromising.
There's no shame in any of that. You fought the good fight. Now I think it's time for you and me to move on. There are plenty of other articles that can use your knowledge and effort. On this one, you and I are too battle-scarred to help any more. Someone will come along, see the prior discussion, and fix things without acrimony.AaCBrown (talk) 19:16, 1 August 2010 (UTC)

## Intelligence Citations Bibliography for Articles Related to IQ Testing

You may find it helpful while reading or editing articles to look at a bibliography of Intelligence Citations, posted for the use of all Wikipedians who have occasion to edit articles on human intelligence and related issues. I happen to have circulating access to a huge academic research library at a university with an active research program in these issues (and to another library that is one of the ten largest public library systems in the United States) and have been researching these issues since 1989. You are welcome to use these citations for your own research. You can help other Wikipedians by suggesting new sources through comments on that page. It will be extremely helpful for articles on human intelligence to edit them according to the Wikipedia standards for reliable sources for medicine-related articles, as it is important to get these issues as well verified as possible. -- WeijiBaikeBianji (talk) 20:24, 2 August 2010 (UTC)

## "Regression toward everything": mathematical formalization (Fallacy: Part Four)

It's not me who's asserting it, I am summarizing other work. There are many precise mathematical statements underlying this general principle. As I've noted before, we disagree about the nature of RTM. I think it is a general principle that can be applied without precise assumptions, or can be embedded into a precise argument. You seem to believe is it a specific theorem.
Here is one formalization. Let X = {x_1, x_2, . . ., x_n} be any set of unknown points. Let E = {e_1, e_2, . . .,e_n} be unknown i.i.d draws from a distribution with median zero and support over the entire real line. We observe only Y = {x_1 + e_1, x_2 + e_2, . . ., x_n + e_n}. The minimum value of y is a, the maximum value is b.
Before we observe Y, or if you prefer, before E is drawn, consider i such that y_i = x_i + e_i = b (remember, b is unknown at the point we make this statement). Pr{e_i > 0} > 0.5. Consider all i such that y_i < b. Pr{Sum over i of e_i > 0} < 0.5.
Does that suffice? Remember, I'm just trying to answer your questions about regression in general, I don't think anything like this belongs in the article. Also, this is just one example, there are many other fomalizations of this general principle.AaCBrown (talk) 15:51, 30 July 2010 (UTC)
I believe that I have repeatedly made myself perfectly clear what my objections are. I am not going over the same grounds once more. It is a waste of time. My sole focus now is the rigorous mathematical formulation of your mathematical claim.
Firstly, you have made a claim that requires a proof. It is not a shrinkage technique. This is an elementary distinction between a mathematical claim and a statistical technique that I make.
To simplify matters, let n=1. (The usual context for RTM). Let Y=X+E. To generalize, let c be any real (E has support over the Real. Y is not yet known.) The claims are:
• If Y>c, Pr(E>0) > .5
• If Y<c, Pr(E>0) < .5
Are you in fact asserting P(E>0|Y>c)>.5 and P(E>0|Y<c)<.5 ? If not, why not?
Do you agree that these are the claims? Please define your claims rigorously so that there is no wiggle room.--Palaeoviatalk 19:28, 30 July 2010 (UTC)
After settling the exact formulation of your claims, I intend to pursue the following:
• How do you prove them?
• I'll possibly disprove the claims by counter-examples.
• In the context of RTM, y is known, What is the relevance of your claims once y is known?--Palaeoviatalk 19:49, 30 July 2010 (UTC)
This is a bit frustrating. You asked for a rigorous mathematical formulation, which is not what I want to put on the page. I give you one, and you rewrite it into nonsense. You can't make n = 1. If you do there is only one observation (y_i) and a=b=c=y_1. There is no mean to regress toward. It is impossible for y_1 to be greater or less than c.
Y, X and E are sets. You can't add them or compare them to real numbers. c cannot be any real number, it has to be within the range of Y. X is not a random variable, it's a set of points. E is not a random variable, it is the realization of a random variable. I'm not being evasive here, these things matter. What you've written makes no sense at all. What I wrote was, I think, clear.AaCBrown (talk) 22:45, 30 July 2010 (UTC)

I'm sorry. I assumed the usual convention (in statistics) that X,Y,E are random vectors. I then reduced them to random variables (letting n=1). I am now re-interpreting your text.--Palaeoviatalk 23:33, 30 July 2010 (UTC)

I understand now. It is a much simpler formulation than I had assumed. Now, how do you prove it?--Palaeoviatalk 23:46, 30 July 2010 (UTC)

That does not make sense. Even if X, Y and E are vectors, despite clear contrary definitions in a four-sentence statement, you cannot use the operations on them that you did. What does it mean for a vector to be greater than a constant? And what does it mean when you say a random variable is greater than a constant (not the probability of that occurring, the naked statement)? What is the support of a vector, and if it had one, how can it be in the real numbers? Letting n=1 makes regression to the mean meaningless. I'm not trying to pick a fight here, it's that I'm genuinely baffled at what you wrote. Are you just making fun? Can you explain any possible interpretation of your statements that is not mathematical nonsense? I know you don't like to answer questions, but after writing something that bizarre, some explanation of what you were thinking or what you intended would help the discussion.AaCBrown (talk) 16:03, 31 July 2010 (UTC)

This is what I wrote:

Let Y=X+E. To generalize, let c be any real (E has support over the Real. Y is not yet known.) The claims are:
• If Y>c, Pr(E>0) > .5
• If Y<c, Pr(E>0) < .5

I intend X, E, Y to be random variables, not random vectors. With this clarification, I now address this question of yours: " And what does it mean when you say a random variable is greater than a constant (not the probability of that occurring, the naked statement)?"

This still doesn't make sense. The formulation said X, E and Y were sets of real numbers. E was created by i.i.d. draws from a distribution with support over the real line. You said you thought they were random vectors because of "the usual convention." There is no "usual convention" for the letters, they mean what the proof defines them as. And it doesn't make sense to talk about a random vector being greater or less than a real number. Now you say you thought they were random variables, by which you mean univariate variables. That explains how you can write Pr(E>0) but not what this can possibly have to do with RTM, which requires at least two observations.AaCBrown (talk) 18:56, 1 August 2010 (UTC)
The following example can be found in Random variable:
${\displaystyle X={\begin{cases}1,&{\text{if a 1 is rolled}},\\2,&{\text{if a 2 is rolled}},\\3,&{\text{if a 3 is rolled}},\\4,&{\text{if a 4 is rolled}},\\5,&{\text{if a 5 is rolled}},\\6,&{\text{if a 6 is rolled}}.\end{cases}}}$
This is one of your fundamental misunderstandings. I grant that it is an easy and common one to make, the distinction between a random variable and its realization. This is why you think I am evasive about things like when measurements are made. Before the die is rolled, we can let X represent the unknown number than will come up. X is a random variable, and we can make statements about it like Pr(X > 3) = 0.5 or EV(X) = 3.5. Once the die is rolled we have a realization of X, say 5. 5 is not a random variable, and it is not X. Pr(5 > 3) = 0 and EV(5) = 5. The statement above refers to realizations of a random variable.AaCBrown (talk) 18:56, 1 August 2010 (UTC)
• In exatly the same way that X=3 (an event) in this example is understood, X>4 (an event), being the union of the events X=5 and X=6, is understood. It is also exactly how Pr(X=3), Pr(X>4) are understood. "If Y>c" means "If the event Y>c occurs", (Note: For elucidation of this issue, please see subsection Notation of Probability Events below.)
You're wrong there. X = 3, by itself, refers to a realization of a random variable. Pr(X = 3) = 1/6 is a statement about a random variable. The two X's cannot be the same. I admit a lot of non-technical sources confuse this point, and it seldom leads to serious confusion. But for the rigorous proof you demanded, it's night and day.AaCBrown (talk) 18:56, 1 August 2010 (UTC)

Note: For elucidation of this issue, please see subsection Notation of Probability Events below.--Palaeoviatalk 10:58, 2 August 2010 (UTC)

(As I said, I hastily assumed that in your formulation, X,E,Y were random vectors. With this assumption, your formulation was full of errors, but I tried to interpret it, despite them. Of course I apologize for my error in interpretation.)--Palaeoviatalk 22:49, 31 July 2010 (UTC)

Now we're back to random vectors. But there's no error in my formulation if you assume this. I said X was a set, you can generate it as a random vector if you want. However, the randomness in the generation of X plays no part in RTM, only the randomness in the generation of E. I defined them as sets, but you can order them into random vectors, it makes no difference to the proof.AaCBrown (talk) 18:56, 1 August 2010 (UTC)

In case anyone needs further elucidation,

• If Y>c, Pr(E>0) > .5
• If Y<c, Pr(E>0) < .5

means the following: Let y be an observed value (mathematicians say: Let Y=y). y=x+e, where x is the true value, e is error. I don't know the values of x and e. If y>c, then the probability Pr(E>0) that e>0 is greater than .5. And I think you know the rest.

Okay, then using my original notation, you meant to say, "If y_i > c then Pr(e_i>0)>0.5". But that's exactly what I did write. If you were trying to say the same thing as me, why did you replace the realization y_i with Y, whether you thought it was a random vector or a random point, and e_i with E?AaCBrown (talk) 18:56, 1 August 2010 (UTC)

Mathematicians need a succinct language for communication. It is perfectly rigorous and precise. Of course it mystifies and frustrates the layperson. --Palaeoviatalk 02:45, 1 August 2010 (UTC)

So your claim is that your rewriting of my proof was perfectly rigorous and precise, and I am mystified and frustrated because I am a layperson? Are you serious?AaCBrown (talk) 18:56, 1 August 2010 (UTC)

If someone uses the phrase "support over the entire real line", I assume certain mathematical knowledge, and do not expect him to be baffled by what I wrote above. It is a case of mixed signals regarding his level of mathematical knowledge and maturity.--Palaeoviatalk 03:21, 1 August 2010 (UTC)

I wrote that the distribution generating the e_i's had support over the real line. A distribution can have support. A real number or a vector of real numbers cannot (an even if they could, the support of a vector could not be the real line). What could you possibly have meant by saying the realization of a vector had support over the real line? Or are you again claiming that I fail to understand only due to my lack of mathematical knowledge?AaCBrown (talk) 18:56, 1 August 2010 (UTC)

Could anyone have understood support (measure theory) without understanding what "Y>c" means? I assume that "support over the entire real line" was lifted without comprehension.--Palaeoviatalk 17:14, 1 August 2010 (UTC)

It is pointless to prusue any such arguments beyond two rounds. I have stated my case clearly, and will let others come to their own conclusions.--Palaeoviatalk 22:37, 1 August 2010 (UTC)

### Notation of Probability Events

This is simply to show that what I said about "X>4" being an event is the common usage among mathematicians.

From Event (probability theory), we have:

A note on notation

Even though events are subsets of some sample space Ω, they are often written as propositional formulas involving random variables. For example, if X is a real-valued random variable defined on the sample space Ω, the event

${\displaystyle \{\omega |u

can be written more conveniently as, simply,

${\displaystyle u

This is especially common in formulas for a probability, such as

${\displaystyle P(u

I don't know what weird notions, such as those advocated by AaCBrown, have been propagated, or misunderstood by him, in business schools. But this is the accepted usage in the world of mathematics.

"A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again."--Palaeoviatalk 10:54, 2 August 2010 (UTC)