# Talk:Regression toward the mean/Archive 1

## old comment

This topic (the article + this discussion) reads like a mad grad school breakdown. If you please, would someone who has a good grasp of Regression Toward the Mean write an explanation based on ONE COGENT EXAMPLE that reveals unambiguous data, processing steps, results. The audience is dying to know what regression means to them. What is needed is an actual dataset and walkthrough to illustrate the concept. You know, narrate Galton's height experiment, that would be wildly appropriate. Think of your readers as high schoolers stuck with a snotty textbook who want some mentoring on this subject AT THEIR LEVEL. They'll get a kick out of it if you can make it mean something to them, otherwise they'll drop out and live in shipping containers with Teener-Kibble for sustenance. This is, after all, a topic that only first year stats student should still be grappling with, yes? And of course it is Wikipedia.--24.113.89.98 05:24, 23 January 2007 (UTC) qwiki@edwordsmith.com

## Real Data

I have added an real analysis of Francis's data for the problem of regression is better illutrate the law. --Puekai (talk) 08:30, 11 July 2008 (UTC)

I'm not sure this page explains "regression to the mean" very well.

I agree; it's lousy. Michael Hardy 23:26, 2 Feb 2004 (UTC)
The first time I read it, I thought it was lousy. The second time I read it, it was closer to mediocre.

F. Galton's use of the terms "reversion" and "regression" described a certain, specific biological phenomenon, and it is connected with the stability of an autoregressive process: if there is not regression to the mean, the variance of the process increases over time. There is no reason to think that the same or a similar phenomenon occurs in, say, scores of students, and appealing to a general "principle of regression to the mean" is unwarranted.

I completely disagree with this one; there is indeed such a general principle. Michael Hardy 23:26, 2 Feb 2004 (UTC)

I guess I could be convinced of the existence of such a principle, but something more than anecdotes is needed to establish that.

Absolutely. A rationale needs to be given. Michael Hardy 23:26, 2 Feb 2004 (UTC)

Regression to the mean is just like normality of natural populations: maybe it's there, maybe it isn't; the only way to tell is to study a lot of examples.

No; it's not just empirical; there is a perfectly good rationale.

I'll revise this page in a week or two if I don't hear otherwise; the page should summarize Galton's findings,

I don't think regression toward the mean should be taken to mean only what Galton wrote about; it's far more general. I'm really surprised that someone who's edited a lot of statistics articles here does not know that there is a reason why regression toward the mean in widespread, and what the reason is. I'll return to this article within a few days. Michael Hardy 23:26, 2 Feb 2004 (UTC)

connect the biological phenomenon with autoregressive stability, and mention other (substantiated) examples. Wile E. Heresiarch 15:00, 2 Feb 2004 (UTC)

In response to Michael Hardy's comments above --

1. Perhaps I overstated the case. Yes, there is a class of distributions which show regression to the mean. (I'm not sure how big it is, but it includes the normal distribution, which counts for a lot!) However, if I'm not mistaken there are examples that don't, and these are by no means exotic.
2. There is a terminology problem here -- it's not right to speak of a "principle of r.t.t.m." as the article does, since r.t.t.m. is a demonstrated property (i.e., a theorem) of certain distributions. "Principle" suggests that it is extra-mathematical, as in "likelihood principle". Maybe we can just drop "principle".
3. I had just come over from the Galton page, & so that's why I had Galton impressed on my mind; this article should mention him but need not focus on his concept of regression, as pointed out above.

regards & happy editing, Wile E. Heresiarch 22:57, 3 Feb 2004 (UTC)

It's nothing to do with Normality - it applies to all distributions.

Johnbibby 22:11, 12 December 2006 (UTC)

--

The opening sentence "of related measurements, the second is expected to be closer to the mean than the first" is obviously wrong.Jdannan 08:17, 15 December 2005 (UTC)

Small change to the historical background note.

## Principle of Regression

I agree that the "principle" cannot hold for all distributions, but only a certain class of them, which includes the normal distributions. I think R. A. Fisher found an extension to the case where the conditional distribution is Gaussian but the joint distribution need not be. In any case, in the section on "Mathematical Derivation", it should be made clear that the specific *linear* regression form E[Y|X]=rX is valid only when Y and X are jointly Gaussian. Of course there are some other examples such as when Y and X are jointly stable but that is another can of worms. The overall question might be rephrased: given two random variables X and Y of 0 mean and the same variance, for what distributions is |E[Y|X]| < |X| almost surely?

I will make some small edits to the "mathematical derivation" section.

—Preceding unsigned comment added by Rder (talkcontribs) 04:24, 15 July 2005 (UTC)

## Massachusetts test scores

HenryGB has twice removed a reference supporting the paragraph that gives MCAS "improvement" scores as a good example of the regression fallacy. He cites http://groups.google.com/group/sci.stat.edu/tree/browse_frm/thread/c1086922ef405246/60bb528144835a38?rnum=21&hl=en&_done=%2Fgroup%2Fsci.sta which I haven't had a chance to review. At the very least, it is extremely inappropriate to remove the reference supporting a statement without also removing the statement.

We need to decide whether this is a clear case of something that is not regression, in which case it doesn't belong in the article; or whether it's the usual case of a somewhat murky situation involving real-world data that isn't statistically pure, in a politically charged area, where different factions put a different spin on the data. If it's the latter, then it should go back with qualifying statements showing that not everyone agrees this is an actual example of regression. As I say, I haven't read his reference yet, so I don't know yet which I think. I gotta say that when I saw the headlines in the Globe about how shocked parents in wealthy towns were that their schools had scored much lower than some troubled urban schools on these "improvement" scores, the first thing that went through my mind was "regression." Dpbsmith (talk) 12:04, 31 March 2006 (UTC)

## Poorly written

The introduction is poorly written and fairly confusing.

—Preceding unsigned comment added by Aliwalla (talkcontribs) 08:23, 29 October 2006 (UTC)

## "SAT"

Would be better with an example that means something to those of us reading outside the USA. --Newshound 16:08, 5 March 2007 (UTC)

## Sports info out of date

The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy. Melvin Mora of the Baltimore Orioles put up a season in 2003, at age 31, that was so far away from his performance in prior seasons that analysts assumed it had to be an outlier... but in 2004, Mora was even better. Mora, then, had truly established a new level of production, though he will likely regress to his more reasonable 2003 numbers in 2005.

It's now 2007, but I don't know enough about baseball to comment on Mora's performance in 2005 or afterward. I also don't know how to tag this statement as out of date without using an "as of 2004" or "as of 2005" tag (I'm not sure how one could be worked in). Can anybody help? - furrykef (Talk at me) 08:42, 4 April 2007 (UTC)

I have great difficulty understanding this article. Everything including the math is just a mess. It is quite remarkable that I have never heard of the phenomenon "regression to the mean", and it seems that its usage is restricted to certain group such medical and socio.

My guess is that there are two phenomena a) the biological property related to growth first observed in the 19th century, and b) an obvious matter. Let me explain b) the obvious matter. I have a die with possible outcomes {1, ..., 6}. Assume I threw a 6. Then the next time I throw that die, it is very likely that the outcome will be les than 6 (since there is no 7!) If one calls that 'regression to the mean', the expression is more complicated than the fact itself. Can anybody comment.Sabbah67 13:54, 13 August 2007 (UTC)

## "History"

I think the history section is quite good except I think the history of the regression line is a bit off topic. Only if more detail were included (such as a discussion of the implications of the fact that the regression line had a slope <1) would the typical reader see the relevance. My opinion is that the regression line discussion be deleted but I don't feel strongly enough about it to do so myself. —Preceding unsigned comment added by 128.42.98.167 (talk) 19:09, 25 September 2007 (UTC)

## Defeating regression by establishing variance

Right, so i'm measuring quantity X over a population, and looking for an effect of applying treatment A.

If i measure X for all individuals, apply A to the lowest-scoring half, and measure again, i'll see an apparent increase because of RTM, right?

If i apply A to half the population at random, or to a stratified sample, can i expect to not see RTM?

Now, my real question, i guess, if i measure X ten times over the course of a year, then apply A to the lowest-scoring half, then measure X ten more times over the next year, then calculate the mean and variance / standard deviation / standard error of the mean for each individual, and look for improvements by t-testing, would i see an effect of RTM?

If i understand it right, RTM works because the value of X is some kind of underlying true value, plus an error term. If i pick the lowest values of X, i get not only individuals who genuinely have a low true X, but also those with a middling X who happened to have a negative error term when i measured X. Assuming the error term is random, doesn't that mean that taking multiple measurements and working out the envelope of variance allows me to defeat RTM?

-- Tom Anderson 2008-02-18 1207 +0000 —Preceding unsigned comment added by 62.56.86.107 (talk) 12:07, 18 February 2008 (UTC)

## POV template

"If you choose a subset of people who score above the mean, they will be (on average) above the mean on skill and above the mean on luck." -this is not cited at all. additionally, its this provides no information about the process of choice used.

"a class of students takes a 100-item true/false test on a subject on which none of the students knows anything at all. Therefore, all students choose randomly on all questions leading to a mean score of about 50." - therefore is obviously the wrong word here.

"Real situations fall between these two extremes: scores are a combination of skill and luck." -uncited

"It is important to realize" -obvious POV

"he couldn't possibly be expected to repeat it" -again

"The trick for sports executives, then, is to determine whether or not a player's play in the previous season was indeed an outlier, or if the player has established a new level of play. However, this is not easy." -again

"the findings appear to be a case of regression to the mean." -uncited, pov

"Statistical analysts have long recognized the effect of regression to the mean in sports" - more pov

"Regression to the mean in sports performance produced the "Sports Illustrated Cover Jinx" superstition, in all probability." - you get the idea

etc etc

Last, but certainly not least, the appalling:

"Whatever you call it, though, regression to the mean is a fact of life, and also of sports." -make it stop make it stop

This thing needs a complete rewrite, from the ground up. 219.73.78.161 (talk) 15:28, 26 June 2008 (UTC)

I am not sure the POV template is the right one however. Most of your complaints are more about sourcing (or lack thereof) and prose style, specifically that it sounds like how-to and/or an essay (which WP is not). That said, I agree the article has serious deficiencies. Baccyak4H (Yak!) 14:19, 16 July 2008 (UTC)
It's difficult to source that which seems obvious, and which in general aren't statements of fact but rather mathematical tautologies. I think this article should be rewritten to explain things more clearly, but I can't agree that noting that for example, the distinction between 'progression and time' etc is 'obvious POV', any more than stating 1+1=2 is 'obvious POV', or indeed requires much citing.--Fangz (talk) 18:02, 21 July 2008 (UTC)

## Cleanup

I've done some cleanup on the commented-out section; here's the cleaned-up version. Probably it could use more work:

### Francis Galton's experiment

The data can be available from [1] and [2]. They are post-processed by listing all 934 children, of which 481 are male. Some children shared the same parents, so they have the same mid-parent height. Galton assumed that the marriage of people is independent of the heights difference. So it is a random mating.

Initial calculation suggests that the means of the two generations are the same, i.e. 69.219 inches (Galton's result is $68\frac{1}{4}$ inches, it seems he is wrong in calculation) for the mid-parents(69.316 inches for fathers and 64.002 inches for mothers) and 69.175 inches for the offsprings. Not only the means, but the variances are also the same, i.e. 2.6472 for the fathers, 2.5122 for the mothers(after scaling by a factor of 1.08) and 2.6232 for the sons and 2.54452 for the daughters. But the standard deviation of mid-parent is 1.802 inch, or

$\frac{1.802^2}{2.607^2}=0.486\approx 0.5$

of the population. This can be easily explained by the fact that

$\text{midparent} = \frac{1}{2} (\text{fathers height} + 1.08 \times \text{mothers height}),$

therefore the variances:

$\sigma^2_\text{midparent}=\frac{1}{4}\sigma^2_\text{fathers height}+\frac{1.08^2}{4}\sigma^2_\text{mothers height}\approx\frac{1}{2}\sigma^2_\text{fathers height}.$

Further investigation suggests that the correlation coefficient between midparent heights and offspring heights is 0.497, which is not very linear.

If we use least-square method to approximate, we will get by using the following MATLAB code:

pinv([midparent(1:934) ones(934,1)])*offsprings(1:934)


and obtain

\begin{align} \text{offspring} & {} = 0.713\times \text{midparent} + 19.874 \\ & {} = 0.713\times \text{midparent} + 0.287 \times \text{population mean (inches)}. \end{align}

This is illustrated as the blue line in Figure 1.

In fact, if least squares method is used to estimate the slope, it would be equal to

$r\frac{S_Y}{S_X},$

or in this case,

$0.497\times\frac{2.607}{1.807}=0.713\approx \frac{\sqrt 2}{2}\approx\frac{2}{3}.$
Figure 1 The distribution of the 934 adult children of Galton's experiment against their corresponding midparent heights. The blue line is the least squares approximation result while the brown one is the approximation of the medians of the 11 categories of midparents. The semimajor axis is the principal component of the variable space. It is apparent in the figure that the extreme parents do not give births to as extreme offspring, on average. And those extreme offsprings (cycled stars) are born by ordinary parents.

The ellipse indicates the covariance matrix of the offsprings and mid-parents. It is given by

$\left\{(x,y)|[x\; y]^T = {\rm cov}(\text{midparant},\text{offspring})\times [\cos t\; \sin t]^T,\quad t\in(0, 2\pi]\right\}.$

Numerically,

${\rm cov}(\text{midparent},\text{offspring})=\begin{bmatrix}3.248&2.317\\2.317&6.678\end{bmatrix}.$

It is also the contour of the 2-dimensional Gaussian distribution because

$\frac{1}{(2\pi)^{\frac{1}{N}} | \Sigma |^{\frac{1}{2}}} e^{\frac{-1}{2} (\mathbf{x} - \mathbf{u_X})^T \Sigma^{-1} (\mathbf{x}-\mathbf{u_X})}=C$
$\Rightarrow (\mathbf{x} - \mathbf{u_X})^T \Sigma^{-1} (\mathbf{x} - \mathbf{u_X}) = 2\ln{C(2\pi)^{\frac{1}{N}} |\Sigma|^{\frac{1}{2}}}$
$\Rightarrow (\mathbf{x}-\mathbf{u_X})^TU \begin{bmatrix} \frac{1}{\sigma^2_1} &0 \\ 0 & \frac{1}{\sigma^2_2} \end{bmatrix} U^T(\mathbf{x}-\mathbf{u_X})=2\ln{C(2\pi)^{\frac{1}{N}}|\Sigma|^{\frac{1}{2}}}$
$\Rightarrow (\bar{\mathbf{x}}-\mathbf{u_X})^T\begin{bmatrix} \frac{1}{\sigma^2_1} &0 \\ 0 & \frac{1}{\sigma^2_2} \end{bmatrix}(\bar{\mathbf{y}}-\mathbf{u_Y})=\bar{C}$
$\Rightarrow \frac{(\bar{x}-u_X)^2}{\sigma^2_1}+\frac{(\bar{y}-u_Y)^2}{\sigma^2_2}=\bar{C}.$

So the ellipse is where

$\bar{C}=1\text{ or }2\ln{C(2\pi)^{\frac{1}{N}}|\Sigma|^{\frac{1}{2}}}=1.$

The slope of the semimajor axis indicates the ratio of the variances and the length of it is the standard deviation of the principal component. It would be interesting to note the counter-intuitive phenomenon that the slope of the ellipse does not allign with those of the fitted lines, including least-squares and median fitted lines.

Since 0.713 is much smaller than 1.0, therefore we may conclude that for parents of extreme heights, their children will not be as extreme. Galton didn't use least square to approximate, instead he quantized the mid-parent heights into 11 categories, namely, 'Below','64.5', '65.5', '66.5', '67.5', '68.5', '69.5', '70.5', '71.5', '72.5', 'Above'. And for each category, he found out the median of all the offsprings by this category. He drew a line through the medians and found out that the line's slope is about 2/3, perfectly matches the slope predicted by least square method. He therefore concluded that the offspring are not as extreme as their parents, which he termed the law of regression. But this conclusion is easily misunderstood as one may further conclude from this that the height variance will decreases steadily over generations. In fact, the variance(2.58422) of the 934 offspring is almost the same as that of the fathers and the scaled mothers and 2 times that of 'midparents'.

The reason that 'offsprings' height's regress towards mean' is because Galton only analyzed the medians but ignore dthe fact that for each category of mid-parent height, e.g. '72.5', the variance of its offspring is not the same as that of '71.5' nor 'Above 72.5'. For mid-parent approaching population mean, their offsprings' heights are much more dispersed than that of the offsprings of extreme midparents.

For example, for midparent category 'Above 72.5', their offsprings only concentrate on '72.2' and '73.2', whereas for category of '72.5', their offsprings span from '68.2' to 'Above 73.2'. As a result of this, the next generation as a whole will still have the same variance as their fathers and mothers. But for the extreme tall or short offsprings, it is more likely that their midparents are not as extreme as them.

—Preceding unsigned comment added by Michael Hardy (talkcontribs) 19:25, 21 July 2008 (UTC)

## Commented out section

I probably oughtta explain myself a bit here. I realise that various editors have put a lot of work into Galton's example, which I commented out. But I mean, my main problem with that section is that I can't really see what it is adding to the article - to even realise it is an example of regression toward the mean, for example, would require some understanding of the Linear Model, and the relation with regression lines, and specifically how the slope of the line being under 1 is what is important here. The specifics of stuff like the principal component stuff just isn't very relevant to the article, as far as I can tell. Maybe a clarified image with explanation, plus some short comments on the source of the regression towards the mean, would be better. Similarly, with mathematical derivation, I'd like to replace it with something a bit more general, since for example the root 2 issue isn't very important in general.--Fangz (talk) 19:37, 21 July 2008 (UTC)

Possibly it should be made a separate article to explain the history of the idea. Michael Hardy (talk) 19:40, 21 July 2008 (UTC)
Actually, on second thoughts, the problem is a bit broader than the section - I myself am a bit confused about what the linear regression stuff is doing in e.g. the section on mathematical derivation. I suspect the main problem is that we are trying to overgeneralise in the 'ubiquity' section. It seems to be much simpler in the identically distributed but correlated regime, compared to whatever-the-heck 'ubiquity' is aiming at, which seems to try to relax assumptions but ends up assuming not merely normality but also a linear model without obvious explanation. Or maybe it's not simpler, argh. I get the distinct feeling we are overcomplicating something very obvious.--Fangz (talk) 21:54, 21 July 2008 (UTC)

## My Understanding

If one were to test a group of students then they would find that the results fit on a bell curve. Likewise if one were to test the same student 100 times (and assuming they didn't improve) they would find that their results would fit on a small bell curve. Because of this the top 50% is going to have better luck in general then the bottom 50%. Every one on average is going to have average luck next time, and that would be worse for the top 50% and better for the lower 50%

For example: If you look at the top 15 MLB teams (or 50%) on July 1st 2007 they had won 56.2% of their games. Over the rest of the season they won 51.3% of games. This is because most of the teams are average and just had flukes the first half. They still won the majority of the 2nd half games because there are a couple good teams.

In general most of a group are average, and the top half consist mostly of average people with a good day. They will do worse on average the next time because they have average luck usually. 72.42.134.253 (talk) 02:11, 7 August 2008 (UTC)

## I'd like to take a crack at a rewrite

But I'm pretty new to editing and I don't want to do it wrong. Here is my plan, please let me know if it offends you.

Summary, needs to be expanded a bit and made clear

Example, needs to be made clearer

History, it's important to mention Galton because he named the effect, and also because it explains the name of regression analysis. But this is of interest to history of statistics, not to regression to the mean. His is not a particularly good example, and his explanation of it uses obscure language and conflates biology and mathematics. The discussion of regression lines is irrelevant, the reader should be referred to the article on regression after being told this is the source of the name.

Ubiquity, I would rename and rewrite this section to distinguish among different effects that are sometimes referred to as regression toward the mean. There is a biological principle, and engineering principle and a mixing principle, in addition to the statistical principle under discussion. The idea is also related to shrinkage, which is important, and also important to distinguish.

Mathematical derivation, I don't want to insult the author, but the steps are trivial algebra that don't start at a natural place nor lead to any insight. Also the use of rho for the regression coefficient usually labeled beta is confusing (since rho is usually the correlation coefficient that the author labels r). The text is confusing. I don't think any math is necessary. I would think about putting in a short theoretical section.

Regression fallacies, I think this is excellent, and an important part of the article. I would shorten the "In Sports" and "In road safety policy" sections and combine them with this. I would number or bullet the list. —Preceding unsigned comment added by AaCBrown (talkcontribs) 20:29, 13 October 2008 (UTC)

## I don't like the latest set of edits by thesoxlost

It took out common words and substituted technical ones making the article harder to understand for people not trained in statistics or mathematics. I don't think it added clarity. The section on Galton removed a referenced, even-handed treatment, and substituted an unreferenced one-sided one. It also took out a reference and substituted a later reference that conflicts with the reference section.

I'd like to revert it. Any other opinions?

response

The changes I made weren't to substitute common terms for technical terms; it was to substitute technically incorrect and misleading terms for technically correct ones. In statistics, it is important. There were a number of them, so perhaps we can come to a consensus on each:

• pair is as more common than set, and more appropriate
• population is correct in this case; not sample. They are equally common, and this page will be referenced by people who are learning statistics who need to respect the difference between populations and samples.
• If you prefer individual to sample, thats fine. I objected to it because it implies that are datapoints are taken from individual people, which is often not the case. Someone who does not share this bias would find that term confusing.
• regarding random variance: this concept is the foundation of regression to the mean. Without an understanding of random variance, you cannot understand why regression to the mean is a mathematical necessity, nor why the original use of the term by Galton was incorrect.

Most important, a reversion would violate editorial policy. The previous version was incomplete; the current version is still incomplete, but includes content that improves the article according to wiki guidelines. If you want to argue that the article is not understandable or is less clear as a result of my edits, I invite you to make changes that attempt to achieve a synthesis, instead of reverting.

The history section included a quote from an article that was not available online. I replaced it with a reference that had the same content, and is easily accessible online (e.g., www.galton.org). If you prefer the previous quote, thats fine.

I feel that it is important, in a section on the history of regression toward the mean, to point out that the theory has changed over time because the original concept was not a statistical one, but a genetic or sociological one. Further, given the importance of the term (in its original usage) in supporting theories like eugenics, it is important to not associate the mathematical necessity of regression toward the mean with controversial topics such as eugenics.

--Thesoxlost (talk) 18:26, 8 November 2008 (UTC)

Okay. You convinced me.

AaCBrown (talk) 21:24, 23 November 2008 (UTC)

## Doubts on the "Alternative, provable formulation"

Alternative formulation

The alternative formulation is unsourced, and looks like original research. (WP:OR)

It does not capture the commonly accepted notion of Regression toward the Mean (RTM).

Suppose the mean (of X and Y) is 70, and X=80 (first test score, say). RTM should assert that, given X=80, Y (the second test score) is expected to be in the range [70, 80).

With this alternative formulation, there is an undefined c (I suppose we would choose c<=80, in this example), and E[X|X>=c] can be any value >=c. Suppose we choose c=75, and let E[X|X>=75]=90. Then, this formulation says that we have E[Y|X>=75]<=90. Thus, E[Y|X>=75] may in fact be greater than X (=80), yet we say that RTM exists. This is not an acceptable formulation.

Proof?

In the "proof", we have

But since the marginal distributions are equal,

$P[X \ge c] = P[Y \ge c]$ whence
$P[X \ge c \land Y < c] = P[X < c \land Y \ge c]$

How does "the marginal distributions are equal" imply

$P[X \ge c \land Y < c] = P[X < c \land Y \ge c]$ ?

Equal marginal distribution guarantees nothing about the joint distribution.

It seems to me that the proof is therfore wrong.--Palaeoviatalk 00:29, 30 May 2009 (UTC)

Responses:
• Yes, I admit it's currently WP:OR. The reason I put it in is that I can't believe that nobody has come up with this before, although I don't know when and where. But if it's just flagged as {{cn}} then with any luck somebody will tell us.
• The "the commonly accepted notion" of RTM is rather diffuse. If you look at the way it gets used in real cases, I think the formulation I've given supports the idea rather better than the usual formula. The usual formula focusses on a single sample value, where anything might happen, whereas for useful application of the principle we need to look at samples of more than 1. Taking your example, if we get a whole lot of test scores in the range 60 to 80, and take c = 70, the Y-expectation of the higher X-scorers goes down.
• With the conventional formulation "we say that RTM exists" is wrong. Consider a population of clocks in a clock shop. All are going at the right speed, but with random settings. The chosen variable is the number of minutes after 12 o'clock displayed, so a uniform distribution in the range (0, 720). We have identical marginal distributions as required. We take our X reading, then our Y reading an hour later. The usual RTM formulation is then usually contradicted, whereas the alternative one continues to be valid.
• The proof is correct. Please read it a bit more intelligently. Where I say "whence" of course you need to take the preceding three lines together.
SamuelTheGhost (talk) 07:54, 30 May 2009 (UTC)
$P[X \ge c \land Y < c] = P[X < c \land Y \ge c]$
--Palaeoviatalk 08:09, 30 May 2009 (UTC)
From the top:
First we look at some probabilities. By elementary laws:
$P[X \ge c] = P[X \ge c \land Y \ge c] + P[X \ge c \land Y < c]$ and
$P[Y \ge c] = P[X \ge c \land Y \ge c] + P[X < c \land Y \ge c]$
I presume you're happy with those statements?
But since the marginal distributions are equal,
$P[X \ge c] = P[Y \ge c]$
Taking these three equations together we get
$P[X \ge c \land Y < c] = P[X < c \land Y \ge c]$
SamuelTheGhost (talk) 08:26, 30 May 2009 (UTC)

I assume that you are not asserting the following:
Since A=B+C, D=E+F, and A=D, therefore B=E.

I am saying that your reasoning follows exactly this schema.--Palaeoviatalk 08:38, 30 May 2009 (UTC)

No, I am saying

Since A=B+C, D=B+F, and A=D, therefore C=F.


SamuelTheGhost (talk) 08:59, 30 May 2009 (UTC)

Thank you. I apologize for my embarrassing blind spot in reading your proof.
Your random clock example in fact shows why your definition of RTM does not accord with common usage. The mean is 360. For all X, such that 660>=X>=360, Y=X+60 >X (Y being the reading of the clock 60 minutes later). For all X>660, Y=X+60-720.
In the first case, the second reading Y is certain to be further away from the mean (by a value of 60) than the first reading X. In the second case, whereas X is near the maximum, Y is near the minimum. I would argue that such behavior should not be called RTM.
There is no reason to define RTM so that it becomes universal.--Palaeoviatalk 10:34, 30 May 2009 (UTC)
There certainly needs to be more work on the presentation. Let's refer to "commonRTM" and "altRTM" for the two versions. We need to do perhaps the following:
• Make quite clear in the article that commonRTM is not universal, perhaps using the clock example
• Make quite clear in the article that altRTM does not necessarily imply literal reversion to the mean, perhaps using the clock example
• Invent a good name for the altRTM result, or discover it if it has already been published elsewhere
• If possible, give conditions under which commonRTM is valid, perhaps by deriving it from altRTM with extra constraints (but prefably not assuming a binormal distribution, which is in reality so rarely shown to exist)
• Explore the application of both versions in real-life cases
SamuelTheGhost (talk) 11:07, 30 May 2009 (UTC)
I am agreeable to your suggestion. However, in view of WP:OR, I doubt that your proposal is the best course to take. You would get a more appreciative and knowledgeable readership by publishing altRTM (if original) in a journal. I suspect this article attracts mainly students in the social sciences, and few statisticians. --Palaeoviatalk 11:25, 30 May 2009 (UTC)

The proof in the "alternative, provable formulation" appears to be correct. I'm not sure that it's the same thing as regression toward the mean, as that is usually understood. I certainly shouldn't be the centerpiece of the whole article, as it becomes when it immediately follows a terse introductory paragraph. The statement that

$P[X \ge c \land Y < c] = P[X < c \land Y \ge c] \,$

based only on equality of the marginal distributions, despite possible asymmetries in the joint distribution, surprised me, despite the completely elementary nature of its proof. Michael Hardy (talk) 14:18, 30 May 2009 (UTC)

## new tools available, new explorations?

70.212.163.46 (talk) 05:43, 7 July 2009 (UTC)

Start with a good database, one with the heights of adult men. Next, compare the heights of their sons to the fathers. Sounds easy.

The sum of regression-to-mean events must be counteracted by an equal set of regression-away-from-mean events, if the shape of the distribution stays exactly the same.

Modern regression-to-mean studies should start with a much larger data set. If Galton's parent and children generational study is to be expanded, it would have to be huge to accomidate lots of correction factors, socio economic status, perhaps religion, local enviornmental conditions, polution etc. lots of stuff should be controlled. What about remarriages, multiple births, hanky panky, etc?

Galton hypothisis supposes that for parents with high z-scores, the children's z scores will be nearer the mean. Simple enough.

[Was Sir Galton sexist, speaking only of fathers and sons? I wonder if he liked Turgenev's book? Perhaps it was just the bias of the old statistics book I first read concirning regression to mean topics. Galtons data seems to have measured most of the mothers and fathers. Some exceptions, perhaps for mothers or fathers who had passed-on before the experimental data was collecte.]

Let some one of us re-run a hopefully better experiment to see what really happens. Were the 900 some children of the families all biologically whom they seemed? Were Victorian social taboos controled? One effect I don't see oft accounted for is infantile and fetal morbidity. How large are the effects of feast and famine economic condition effects compared to the sum of differences between parents' and children's metrics.

Nevertheless I suspect Galton's observation will prove insightful and accurate.

With modern database programming languages, the intra-generational differences could be associated with many possible causes. The subset of observations wherein the child's z-score exceeds the parent's might be very interesting.

If the experimental technique were to be applied to some measure of batting prowess, any anit-Galtoon-like effects might be attributed to excellence in coaching, or perhaps a sporting phenomenon defined by Dennis Jenkinson as "tiger ". Is it appropriate to think of one at-bat being the son of a prior at bat? One season's performance as the child of the last? Several laps at the 1957 German Geand Prix WERE the offspring of the previous lap, ask Juan Fangio!

Using the above experimental outline, as the children's distribution fills, "z-space" will be become scarce. Tall, (or short,) parents with more moderately sized children will push the children of moderately sized parents into z-spaces differing from those that might be observed with a less comprehensive experimental design.

With the results of these kinds of experiments in hand, new kinds of intra-generational effects will be defined. Perhaps these kinds of statistical experiments are being conducted in micro biological labs, but the kind of experiment I envision have proper names attached to the individuals in the experiments.

8JUL09 - Perhap aerodynamics has a more complete analysis done. Consider the metrics of a particles flowing over a wing at the leading edge as parents. Consider the metrics about those particles a few inches later, over the mid point of the wing as "children." Then the metrics of that particle over the trailing edge might be though of as grand children. The science of turbulence may offer us useful insights.

For each "regression towards mean" event there must be some kind of change in the total distribution. Perhaps there are several kinds of changes. The answer lies in tracking all the individuals in the distribution.

The sum of regression-to-mean events in a one generation analysis must be counteracted by an equal set of regression-away-from-mean events, if the shape of the distribution is presumed to stay exactly the same. Perhaps there is a children's generation that is exactly the same except for a change in the variation. Pre-mechinization parents might be compared with picking machine children.

Monday 13jul09 So now that we can track exactly each individual in a "normal like" distribution, when comparing parent"s and children's distributions, the relevant results are improbable movements due to identifiable causes, shifting of the children's distribution due to squirming of the other children of that generation, AND resulting distortions to the "normal like" distribution.

Steve J. Gould saw all this as "improvement in the game." see his "Full House." His study needs to be redone with adjustments for "quantity effects." Variability of batting averages goes down with increases in number of plate appearances and or "at bats." In 1941 the season was much shorter than the 162 games of today. There might be a 400 hitter at the same same number of games that was it Ted Williams ended his famous season.

When'st there was more variation in abilities, the teams had different strategic options, perhaps the games was better then? I'm going to listen to more old games. I listen to most games now.

Statistically speaking the study of a snake-like distributions that summarize all the changes in two related mound-like distributions may become popular.

Perhaps a clearer an short statement of the idea,,,

The Snake-like Distribution that Ate Hank Aaron's Statistics. 27SEP09, by CactusMitch

Measuring the outcomes of sporting experiments involving pitchers and batters is a sometimes pleasant, oft times frustrating pastime. The interested parties, the pitchers, the batters, the managers, the fans, and perhaps even the umpires, all change their behavior in an attempt to influence the outcome of the next experiment. Since this is true, let us once and for all dismiss all the discussion based on binomial probabilities as being misguided.

The use of the batting average to compare batters' prowess is shaky. Hits were once only recorded in daylight from a pitcher throwing from a mound about six inches higher than the current mound. At Bats have been decreased by two kinds of sacrifice non-”Hits.” If a batter flies-out or bunts-out to advance a runner, his At Bat statistics are not increased.

Nevertheless, there is another previously, at least to my knowledge undocumented phenomenon, perhaps related to the large number phenomena, that may shed light on the demise of the 400 hitter.

The Cactus conjecture: As the number of observations are added to a constrained mound shaped distribution, the z-score needed to achieve observations comparable to an extreme observation made early in the experiment actually INCREASES!

Comment: Stephen J Gould was wrong. The 400 batter succumbed to more baseball not necessarily “better” baseball.

\ —Preceding unsigned comment added by 207.246.9.229 (talk) 18:23, 13 July 2009 (UTC)

## Restructuring for maths formulations

I have restuctured the article as discussed above, under "badly written", leaving a hopefully non-maths lede and room to add extra details of the various maths approachs to the idea. I have made a new section here to make it easier to find discussion of what is needed from this point on. Melcombe (talk) 10:22, 30 July 2009 (UTC)

Thanks to Melcombe for a promising start. "Samuels' formalization" should of course be hers, and not SamuelTheGhost's. Will you be taking care of that?--Palaeoviatalk 10:37, 30 July 2009 (UTC)
I have already included citation to Samuels for this point, but it could do with nicer phrasing. Melcombe (talk) 11:49, 30 July 2009 (UTC)

## Empirical statistical laws

I have nominated Empirical statistical laws for deletion on the grounds stated. Please add verifiable sources. I recognize the statistical concepts listed are often loosely referred to as 'laws', but find the term highly confusing. Equivocation about the term 'law' is expressed but I think it is very confusing to a layperson as it is currently written. I am very aware of the historical origin of regression to the mean, but it was mistakenly taken to be a law of heredity and in any case, I am questioning the prominence of the actual name. Again, please include sources for any discussion. Holon (talk) 12:49, 30 July 2009 (UTC)

Well it is an empirical statistical law and it is essentially for that reason that it is important to the lay person, for the reason stated in the supposedly offending sentence. Melcombe (talk) 14:39, 30 July 2009 (UTC)
We're talking at cross-purposes: the concept is important but is not necessary to use the (obscure) term "empirical statistical law" to convey the concept. The term statistical phenomenon is used first, then (sic) "law". The word 'important' is also unnecssary and I'll edit it out when we reach consensus on the rest. Holon (talk) 14:49, 30 July 2009 (UTC)
"Empirical" is a word that precisely describes the type of law that it is. The sentence uses "importance" not "important", and importance is one of things the lede is meant to describe. Melcombe (talk) 15:18, 30 July 2009 (UTC)
If something is important, that should be self-evident from the content. Holon (talk) 14:15, 31 July 2009 (UTC)
To a general reader, "law" conveys the notion of laws in the physical sciences, such as Newton's Laws, Hooke's Law. These are precise, falsifiable mathematical statements of natural phenomena. RTM is, in its common usage, an imprecise, unfalsifiable observation. Using the word "law" is unnecessary, and leads to confusion. The danger here is to mislead readers into thinking that RTM must operate on every occasion, under all circumstances.
"Empirical statistical law" lends an undeserved aura of unchallengable authority to the phenomenon. It is not an accepted term in general usage among mathematicians. It should be avoided here.--Palaeoviatalk 20:16, 30 July 2009 (UTC)
I could not agree more Palaeovia; that is precisely my concern. Michael, I don't take the above to mean it is formally vague; of course it is precisely stated in terms of the theorem. My concern is that we make it clear that it is not one of the laws of science. The term 'emprical statistical law' implies it is an empirical law in the sense of a law of science. It simply is not. Holon (talk) 04:42, 31 July 2009 (UTC)
This is just argumentative nonsense. What we have here are statistical laws that were/are originally postulated on the basis of empirical evidence, in exactly the same way that Newton's law was originally based on empirical evidence. An empirical statistical law is true, false or unproven, or true or false under stated circumstances. Do you think that Newton's law should not be called a law just because it is false? Melcombe (talk) 09:47, 31 July 2009 (UTC)
Your tone clearly speaks for itself but a request to keep it civil. Newton's law? (which?) All scientific laws express specific relations between specific dimensions and the term law is conventionally used where an empirical causal relation is inferred. A dimension is a kind of quantity, which is a property of a substance, body or phenomenon (see BIPM, VIM [3]). The term regression to the mean refers does not refer to one specific relation between specific dimensions. Empirically speaking, it's a phenomenon that exists wherever two sets of measurements are correlated. It does not refer to particular dimensions with particular standard units (as do all relations conventionally referred to as scientific laws). To ascribe a source of the correlation is to commit the regression fallacy. However, this line of debate is really moot: it all depends how you wish to define law. I'm not particularly fond of the word law in science, but neither of our dispositions is relevant. Wikipedia is not the place for using obscure terms, for all kinds of reasons explicitly stated in policy. It's just not referred to as an empirical statistical law and doing so just undermines the credibility of the articles and Wikipedia. Holon (talk) 14:15, 31 July 2009 (UTC)
Common usage has nothing to do with it. I've given a precise statement of a mathematical theorem here on this discussion page. It's not a "vague unfalsifiable" thing. Michael Hardy (talk) 20:19, 30 July 2009 (UTC)
I meant that the phenomenon described thus is vague and unfalsifiable:
Regression toward the mean, in statistics, is the phenomenon whereby members of a population with extreme values on a given measure on one occasion of observation will, for purely statistical reasons, probably give less extreme values on the measure on other occasions of observation.--Palaeoviatalk 20:44, 30 July 2009 (UTC)
Indeed, that is vague. Anything can be made vague if written by an idiot, but that doesn't mean we necessarily always have to do it that way. Michael Hardy (talk) 00:19, 31 July 2009 (UTC)
I take it that this is how RTM is defined in both The Oxford and Cambgidge Dictionaries of Statistics, as cited. It therefore means that it is commonly used as a vague concept (perhaps in the social sciences).--Palaeoviatalk 00:48, 31 July 2009 (UTC)
The Cambridge Dictionary includes this: "The term is now generally used to label the phenomena that a variable that is extreme on its first measurement will tend to be closer to the centre of the distribution for a later measurement.: The Oxford Dictionary starts: "An expression of the observation that, when we take pairs of related measurements, the more extreme values of one variable will, on average, be paired with less extreme values of the other variable". Neither article mentions either regression (as in linear regression) or conditional expectation, but then both are only 8-12 lines long. The Cambridge Dictionary does have something quoted from Galton that is slightly more specific, but still vague, immediately before the line quoted above. Melcombe (talk) 09:37, 31 July 2009 (UTC)
Does it really say "phenomena" instead of "phenomenon"? Michael Hardy (talk) 21:39, 1 August 2009 (UTC)
Yes, the version I have (2nd Edition) does say "phenomena". I don't have access to a later edition. Melcombe (talk) 10:07, 10 August 2009 (UTC)

Law of Large Numbers is an excellent example of an empirical observation of a statistical phenomenon, first called a "law", subsequently rigorously formulated and proved. I have not encountered the phrase "The law of regression toward the mean". Sourses are needed to show that statisticians have declared RTM a "law".--Palaeoviatalk 10:10, 31 July 2009 (UTC)

## Proposed reversion

I think this needs to be reverted back to the version identified by the time stamp 10:50, 30 July 2009. All the recent edits seem founded on the misunderstanding that the regression-based interpetation is in some way "not general". There may be some thought that it requires an assumption of normality, but of course it does not. It is also nonsense to think that one of the conditional expectation interpretations is "more general" than the other. Specifically they are all equally general, but different. The behaviour ("law") defined in each different way may either apply to all distributions, or not. This is an important logical point. Melcombe (talk) 14:45, 30 July 2009 (UTC)

There is no sourced rigorous definition of "regression-based" RTM. I'm only guessing at what this might be. The question of whether defining RTM in the context of linear regression is more restrictive is therefore impossible to resolve now. (I suppose that linear regression is only meaningfully applied to some, but not all, bivariate distributions, hence more restrictive.)
Let's remove ourselves from all irrelevant details.
Take any bivariate distribution. (The marginal distributions need not be equal. They can be shifted and scaled to mean 0, standard deviation 1.) Define a term T (RTM in this case) in two ways, resulting in two concepts (two predicates of bivariate distributions) T1 and T2. We say that T1 is more general than T2 if the set of bivariate distributions of which T2 is true is a strict subset of the set of bivariate distributions of which T1 is true.
Equivalently, T1 is more general than T2 if the following statements are true:
a. If T2 is true of a given bivariate distribution, then T1 is true of it.
b. There exists a bivariate distribution of which T1 is true and T2 is false.
The use of "law" is confusing in this discussion. We are not defining a law (and we cannot). We are defining a concept. I am beginning to see the danger of using "law" to describe RTM in that it causes confusion. The so called "empirical law" is just a vague empirical observation. There is no "law" when we proceed to formalize the phenomenon.--Palaeoviatalk 19:48, 30 July 2009 (UTC)
It would have been possible to add a "sourced rigorous definition of regression-based" RTM into the article had you not destroyed the logical structure once again. You destroy neutral text on the basis that it does not give citations to a view you don't want included while making new incorrect statements without any attempt to provide citations for this stuff. Your concept of what things are more general than others is incorrect in this context: we have here three (of many possible) definitions of what might be meant by regression to the mean all of which are potentailly equally general. One could note that the conditional expectation versions as stated are less general than the regression one because these statements of defintion do not cover the case of negative dependence, but this could be easily remedied (although this may not be in the literature). And, in addition, you do not seem to have recognised that Samuel's defintion is for "reversion to the mean and beyond" not just "reversion to the mean". Do you think the "and beyond" bit is irrelevant? Melcombe (talk) 10:04, 31 July 2009 (UTC)
There is a section you allocated for the definition of regression-based RTM. It must be filled before meaningful discussion can proceed.
Whether to describe a definition as restrictive or not is a minor issue. In fact, I consider the restrictive definition as superior to the general one proposed by Samuels. Therefore I am strictly descriptive in using "restrictive" and "general", without implying that "general" is better.
"Reversion to the mean and beyond" is a subset of the quoted definition, which is for "reversion to the mean".--Palaeoviatalk 10:48, 31 July 2009 (UTC)

## Intelligence

Linda Gottfredson points out that 40% of mothers having IQ of 75 or less also have children whose IQ is under 75 - as opposed to 7% of normal or bright mothers.

Fortunately, because of regression to the mean, their children will tend to be brighter than they are, but 4 in 10 still have IQs below 75. (Why g matters, page 40)

What do we know about IQ or g and regression toward the mean? Elabro 18:55, 5 December 2005 (UTC)

Your question seems to contain its own answer. Taking everything at face value, and brushing aside all the arguments (whether g exists, whether it means anything, whether Spearman's methodology was sound, whether imprecise measurements of g should be used to make decisions about people's lives, etc.) what the numbers you cite mean is simply that IQ measurements are mixtures of something that is inherited and something that is not inherited.
Intelligence, as measured by IQ score, is just about 50% heritable.
Regression doesn't have to do with the child, in this case, it has to do with the mother. The lower the mother's IQ measurement, the further away from the mean it is. The further away from the mean it is, the more likely that this was not the result of something inherited but of some other factor, one which won't be passed on to the child, who will therefore be expected to have higher intelligence than the mother.
This isn't obvious at first glance but it is just plain statistics. Our article on regression doesn't have any diagrams, and one is needed here. Dpbsmith (talk) 20:26, 5 December 2005 (UTC)
Thanks for explaining that. It's clear to me now, and I hope we can also make it clear to the reader.
By the way, I'm studying "inheritance" and "heritage" and looking for factors (such as genes) that one cannot control, as well as factors (such as parenting techniques, choice of neighborhood and school) that one can control - and how these factors affect the academic achievement of children. This is because I'm interested in Educational reform, a topic that Wikipedia has long neglected. Elabro 22:10, 5 December 2005 (UTC)

I am having a difficult time believing the regression to mean effect in certain circumstances. For example, if 2 parents of equal IQ, say an IQ of 130, have children, if heritability is .7 that is the nature component of IQ right? The other .3 is often stated as being the mean of the population at large but that does not make sense to me. Wouldn't it be nurture, the food they consume and environmental stimulation they receive?

Does anyone have any statistics on high IQ (specific IQ values), well off parents who have children, and the children's IQ scores? I have not been able to find any and it is making it very difficult for me to believe that this effect is real if the parents specifically choose each other for their IQ.

Regression to mean as it is often used could imply that evolution into different species is not possible. I remember reading about insects in an underground cave that were recently discovered in Israel. There was no light in the cave. The insects in there had evolved no eyes. Regression to mean would imply no matter how much smaller or diminished a group of insects eyes were, their offspring would regress to the mean and have normal eyes yet over time the offspring evolved and evolved and ended up with no eyes. 72.209.12.250 (talk) 04:18, 13 June 2008 (UTC)

To answer this question, it must be understood that regression to the mean is an expression of the reduction of variance for the sum of random variables. IQ is the result (sum) of many factors, and even the parents' contributions are the sum of smaller contributions. Regression to the mean thus states that their offspring are more likely to be closer to the nominal of all these factors. Thus, if environmental factors change in such a way that they encourage greater intelligence (e.g., improved diet, removal of toxins), then you would actually expect the next generation to be more intelligent because of the shift in the environmental mean. The genetic component(s) for IQ would not be immediately altered, however, until natural selection comes into play. By this mechanism regression to the mean not only does not prevent evolution, but may in fact promote it. 65.200.157.177 (talk) 14:05, 22 July 2010 (UTC)

## Complete Misunderstanding and Utter Nonsense

The article looks impressive, but is utterly useless, totally incoherent and misleading. As a emergency remedy to this disaster, the succinct explanation in WolframMathWorld powerfully dispels the dense fog of ignorance and error enveloping this article.

The section "Regression towards everything" is utter rubbish, and should be deleted forthwith. Please voice any objection.

I object. This is an important point, that there's nothing special about the mean.
• The pitcher with the highest batting average among pitchers by the All-Star break is more likely to regress down toward the mean batting average of all pitchers than up to the mean batting average of all players in the second half of the season. That need not be true, but it can be and happens to be (see: GEORGE, E. I. (1986). Minimax multiple shrinkage estimation. Ann. Statist. 14 188–205.)
• Say we have a test with negatively-skewed results due to a few students with very low scores. A student who scored above the mean but below the median is more likely to see his score decline rather than increase on a retest. This also need not be true, but it can be. Suppose there are 100 true-false questions. Ten students are guessing randomly and 90 students answer each question correctly with independent 90% probability. We expect a mean score of 86 and a median of 89.2. A student with an 88 on the first test has a 70% chance of improving (away from the mean) and only a 20% chance of doing worse (toward the mean). On the other hand, a student who scored 55 on the first exam has an 82% chance of doing worse on a retake (away from the mean) and only a 14% chance of improving (toward the mean).
• A student who scores below the 90th percentile on an exam is more likely to have had bad than good luck (that has to be true since luck averages out to zero and students above the 90th percentile, or any point, are more likely to have had good luck than bad), and therefore based on that information alone is more likely to improve, to get closer to the 90th percentile, then to do worse, moving farther from the 90th percentile.
I suspect you are thinking primarily of i.i.d. observations from a symetric, unimodal distribution, with complete information and no prior knowledge. A lot of things don't fit this description and the concept of regression is still useful. More important, the concept of regression toward the mean does not depend in any sense on the mean, so any explantion that emphasizes the mean misses the core idea.
The original section was well-documented with citations to top academic journals and textbooks. If you have sources for your belief that it's utter rubbish, you should add them to the article as this would be a controversy at the heart of the topic. I am not aware of any such sources, and you don't give enough details for me to research.
I vote to leave this section in. If you decide to refute it, please include your contrary point of view with sources. Leaving it out entirely is a major hole.
AaCBrown (talk) 17:05, 23 July 2010 (UTC)

The "Mathematics" section is irrelevant and unclear, and should be replaced with the WolframMathWorld material. --Palaeoviatalk 22:39, 30 April 2009 (UTC)

Wow, I hadn't looked at this for a while, what a mess. No objection to your characterization of "Regression towards everything" other than you understated the issue.
I am not sure what the usage situation would be with MathWorld content; certainly it is a good source (read: reliable, not infallible), but I doubt we can just copy large sections verbatim (sorry if that's not what you were proposing). Certainly some tags are in order to warn the reader that the page needs work. I'll have a look and stick some appropriate ones on. Baccyak4H (Yak!) 03:39, 1 May 2009 (UTC)
I certainly was not suggesting plagiarism, but rather adopting the substance of the material in question.
However, I now realize that WolframMathWorld's definition of Regression to the Mean (RTM) is probably different from the commonly accepted one.
We need to exclude half-baked ideas from the article, and rely on peer-reviewed academic articles and reputable sources. However, apparently even such sources can fall prey to fallacies. Vigilance is called for here.
To provide some pointers, I found the following online articles helpful, though imperfect:
For a gentle introduction to RTM, see Regression to the mean: what it is and how to deal with it, International Journal of Epidemiology.
For a mathematical treatment, see Regression Toward the Mean and the Study of Change, Psychological Bulletin.--Palaeoviatalk 12:08, 1 May 2009 (UTC)

Well, you win some, you lose some. I'm pleased that Michael Hardy finds that my proof appears to be correct. As for his judgement that the article is badly written, perhaps I could draw his attention to some extracts from the article's history, namely
• SamuelTheGhost - never touched it until 29 May 2009, then inserted one section (perhaps in the wrong place)
• Palaeovia - never touched it until 1 May 2009, then made some edits, mostly deletions
• Michael Hardy - intermittent edits since 21 April 2003
So yes, we can agree that the article as a whole is badly written. Indeed badly written. Gosh, yes, badly written. Perhaps Michael Hardy would like to assist in doing something about it. SamuelTheGhost (talk) 15:23, 30 May 2009 (UTC)
The article is poorly written, because multiple editors with erroneous notions have been freely putting in half-baked or wacky ideas, such as the section "Regression toward everything" in a previous version. No editor with a rigorous understanding of the concept of RTM has stepped in to tidy up the mess.
What is "wacky" or "half-baked" about regression toward everything? Do you disagree with the cited sources, or feel that the write-up doesn't summarize them properly?AaCBrown (talk) 17:05, 23 July 2010 (UTC)
I am not familiar with the research literature on RTM. My impression is that statisticians pay little attention to RTM. Most RTM research are done by social scientists, not always with an adequate grasp of the mathematics. I suspect this article is the cumulative fruit of many graduate students in the social sciences, and some baseball fans.
The modern mathematical statistical work on regression toward the mean is more likely to use topic words like "shrinkage," "experimental design" or "empirical bayes". It is an important concept with a lot of published work. Regression toward the mean is also an important historical concept. I think this article should stick to those two aspects, modern mathematical understanding and historical (understanding and misunderstanding, both are important).AaCBrown (talk) 17:05, 23 July 2010 (UTC)
Michael Hardy believes that RTM is a theorem! That shows the pervasiveness of confusion. If RTM is a theorem, why is there

no statement of the "RTM Theorem" in the article? It should be easy to source and quote the RTM Theorem.

RTM, according to the refereed article:
Myra L. Samuels, Statiscal Reversion toward the Mean, The American Statistician, Vol 45, No 4 (November 1991), pp 344–346
is a statistical concept, just like "random variable", that needs a rigorous mathematical definition. One can then prove theorems concerning RTM.
I think the article lead should define the concept, perhaps with words amplifying the definition for the mathematically challenged.
Correct me if I am wrong, but I believe that RTM has absolutely nothing to do with Regression analysis or Linear regression. Some editors think otherwise, as shown by previous talks here, and by the fact that this article has "Category:Regression analysis".--Palaeoviatalk 17:20, 30 May 2009 (UTC)

Certainly it has something to do with linear regression. Galton famously identified the phenomenon in the course of examining a linear regression problem.

If one were to speak simply of a bivariate normally distributed random variable whose components are X and Y the correlation between which is ρ > 0, then one can state a theorem: The conditional expected value of Y given the X is u standard deviations above the expected value of X, is ρu standard deviations above the expected value of Y. With other joint distributions one would also have results. Michael Hardy (talk) 17:40, 30 May 2009 (UTC)

I'm puzzled by this assertion that regression toward the mean has nothing to do with linear regression. Regression toward the mean is about the conditional expected value of one random variable given another. That's what linear regression estimates. The person who removed the material about Galton also said these have nothing to do with each other. I wonder if those who think that can explain their position?

Looking closely at the material that Palaeovia added near the top, it actually is stated as a definition. I think one could state a general theorem, but I'd have to think about the details. Michael Hardy (talk) 18:53, 30 May 2009 (UTC)

RTM might be historically linked to Regression Analysis via Galton, but I see no intrinsic connection betweem the two. Though both deal with independent and dependent variables, Regression Analysis focuses on deriving a regression equation (with a "best-fit"), whereas RTM says absolutely nothing about the regression equation. The connection between the two concepts seems very tenuous to me.--Palaeoviatalk 23:30, 30 May 2009 (UTC)
It seems that some editors who brought in "linear regression" here are talking about the special case of RTM, where E[Y|X=c]=tc, where c>0, 0<=t<1, assuming mean 0. (Is there an alternative definition of RTM that requires linearity?) This behavior is called "linear regression". However, I had been thinking of "linear regression" as a subfield of "regression analysis".--Palaeoviatalk 00:52, 31 May 2009 (UTC)
Of course there is a general theorem. For "regression towards the mean" to make any sense at all, the situation must be such that the X and Y variables represent essentially the same population, or subsets of the same population. For example, in the father/son case there is an assumption that there is no general drift in characteristics across generations. Essentially all that is needed is for the standard deviations of the two variables to be the same, in which case the regression coefficient is the same as the correlation coefficient which then implies that the regression coefficient is less than one in absolute value. If you add in the assunption that the means of the populations are the same then you get a full RTM. Obviously this does not rely on any particular statistical model holding, such as linearity of conditional expectation, or constancy of conditional variance, or marginal or conditional distributions. A linear regression can be constructed between any variables regardless of such characteristics ... it is simply a best linear predictor and does not deny the possibility that there might be a better predictor. Melcombe (talk) 09:50, 28 July 2009 (UTC)
Melcombe says "Of course there is a general theorem". Could he or she please state it? SamuelTheGhost (talk) 11:45, 28 July 2009 (UTC)
Melcombe's comment makes absolutely no sense to me, but must seem intimidatingly authoritative to many. Let's have a rigorous statement of the theorem, and a source for a proof.--Palaeoviatalk 12:20, 28 July 2009 (UTC)
See Section 26.9 of : Kendall MG & Stuart A (1973) The Advanced Theory of Statistics. Volume 2 Inference and Relationship, Griffin. ISBN 0-85264-215-6
I don't have a copy of Kendall & Stuart immediately to hand, so it would be more helpful to me and other editors if Melcombe could do as I asked before and actually state the theorem. Looking at his/her explanation above, it seems to me that what is really being asserted there is that the value predicted by linear regression exhibits RTM. This is very different from asserting RTM itself. SamuelTheGhost (talk) 14:57, 28 July 2009 (UTC)
Yes, I mean by regression towards the mean, the usual meaning. Not the version that has been hijacked into the start of this present version of the article. I see that previous proofs relating to the more basic meaning have been deleted from the article. You have made the mistake of taking one highly specialised meaning of the phrase and presented it as if it is the only meaning. Recognition is needed that the definition now given has been jury-rigged so as to be something that can lead to a "proof" of something that approximates someone's version of one interpretation of what the phrase "regression towards the mean" should mean. Melcombe (talk) 15:44, 28 July 2009 (UTC)
After all that, you still haven't told us what the "general theorem" is, nor what you consider the "usual meaning" of RTM to be. SamuelTheGhost (talk) 16:11, 28 July 2009 (UTC)
As an example of the kind of mathematics thankfully deleted from the current version, in this version of the article, we find the following definition of RTM:
Let x1, x2, . . .,xn be the first set of measurements and y1, y2, . . .,yn be the second set. Regression toward the mean tells us for all i, the expected value of yi is closer to $\overline{x}$ (the mean of the xi's) than xi is. We can write this as:
$E(|y_i - \overline{x}|) < |x_i - \overline{x}|$
Where E() denotes the expectation operator.
Is this the definition Melcombe defends?
Take the example of two sets of test scores. The above definition is not rigorous. It probably intends to let x1, x2, . . .,xn be the first set of scores, and let y1, y2, . . .,yn be the second set of scores. It tries, and fails completely, to capture the same idea as the current definition. You need conditional expectation to properly capture the notion of RTM.
The "proof" that followed is therefore useless, based on a faulty definition of RTM.
This so called "definition" is wrong, besides being clearly original research (i.e. being someone's half-baked idea).--Palaeoviatalk 17:10, 28 July 2009 (UTC)
In this version of the article, there is a rigorous statement of a relevant theorem:
If one assumes that the two variables X and Y follow a bivariate normal distribution with mean 0, common variance 1 and correlation coefficient r, then the expected value of Y given that the value of X was measured to be x is equal to rx, which is closer to the mean 0 than x since |r| < 1. If the variances of the two variable X and Y are different, and one measures the variables in "normalized units" of standard deviations, then the principle of regression toward the mean also holds true.
This example illustrates a general fact: regression toward the mean is the more pronounced the less the two variables are correlated, i.e. the smaller |r| is.
If a proper source can ba cited, this theorem (requiring the assumption of a bivariate normal distribution) should be restored to the article.--Palaeoviatalk 17:40, 28 July 2009 (UTC)
This is an informal statement, without proof, of the theorem.--Palaeoviatalk 19:11, 28 July 2009 (UTC)

The above is largely irrelevant. The term "regression towards the mean" has been around a long time and, with the regression-based interpretation, has been discussed and justified in both statistical and general literature at various levels of technicality. The article certainly needs to reflect this interpretation and usage of the term. It may also need to discuss other variations of the same general idea as possibly in replacing mean with median so that the idea would be "reversion towards a norm" rather than "mean" in its strict stats meaning. It could then go on to outline the difficulties in turning the idea into a mathematical formulation that can be carried forward to provide a proof. This could well make a distiction between the regression-based interpretation and a more strict requirement on the conditional expectation. It could also note that conditional expectation is not the only other possible formal interpretation of what is informally called "regression toward the mean". You may be correctly copying a definition from the Samuels paper, but you should have noted that she specifically noted that this was just one interpretation of the term "regression towards the mean" and that she deliberately did not say that this definition was "regression towards the mean". The Samuels paper has some useful references on the difficulty of giving a conditional expectation interpretation to the "regression towards the mean" idea and something useful might be extracted from these. Overall there is no justification for hijacking a reasonably reader-friendly article, which at least gave a reasonable general discussion of the idea as it exists in the real world, by imposing one of many possible mathematical interpretations at the outset. Melcombe (talk) 10:13, 29 July 2009 (UTC)

If researchers (especially social scientists) have been using the term RTM mostly as a vague concept, with no mathematical content, then the article can make this clear. I would support your proposal to include the historical development of the concept of RTM, and the current state of affairs in its mathematical formulation.
There's a third way here as well. Some people use RTM as a vague concept, some have a specific model in mind. But it is also a general principle, precise and mathematical, but not specific. I'd call this the "data analysis" or "empirical Bayesian" interpretation. Rather than arguing over which of these is most important, or worse trashing the ones we don't like, the article should clearly delineate them and explain each (with references). Let the reader decide which ones are stupid.
• RTM is important in the history of social science research, this should be covered in detail
• RTM is a general name for specific theorems in regression analysis and related work, these make a lot of specific assumptions that quickly take us beyond what is appropriate for this article, they should be mentioned and linked to specific articles
• RTM is a phenomenon observed in pratical data work that requires no specific assumptions to exploit, this should be discussed in detail
• RTM is a general name for specific topics in modern mathematical statistics, including shrinkage, empirical Bayesianism and distribution-free estimation. Again, these quickly get beyond the level of this article and should be mentioned and linked to specific articles.
AaCBrown (talk) 17:05, 23 July 2010 (UTC)
Any reasonable definition of RTM is better than none in the article. A reasonable definition of RTM in the research literature is better than a half-baked, faulty definition that confuses everyone.
I disagree. RTM is a phrase with several related definitions, each important to different fields. Insisting on a single precise definition will shortchange either history, practice or theory. People come to this article having run across the phrase in different contexts, it would be a great help to distinguish among them. Even the ones you think are foolish should be mentioned if they appear in common sources.AaCBrown (talk) 17:05, 23 July 2010 (UTC)
As long as mathematical claims concerning RTM are proved in the research literarure, and included in the article, they must be stated rigorously, with all assumptions clearly spelled out.
I don't know which previous versions of the article were so superior in your view. The version that preceded the inclusion of the Samuels definition was a total disaster in my view. (See discussions above.)
There is no consensus in how to mathematically define RTM. But that is no excuse to have an article full of pseudo-mathematics that confuses the mathematically literate, and crackpot ideas that confuse everyone.--Palaeoviatalk 11:28, 29 July 2009 (UTC)
My suggestion for a way foward is: (i) remove the maths from the present lede to a new section towards the end and after the more general discussion; (ii) remove the rest of the present lede and replace it with the 1-sentence lede and "example" section from the start of the version identified as by Michael Hardy at 19:24, 21 July 2008.
I have access to 2 stats dictionaries that can be used as references for a general (non maths) definition of the term. I don't have access to the two references in the present lede numbered 1 and 2 that might be expected to be good sources for meaning of the term so unless someone pipes up soon, these would go. We could then extend the lede to properly summarise the article's contents, including mention of the possibilities for mathemtical formalisation that would be included later.
Let's aim, as the next step after that, to have a regression-based formalisation and a conditional-expectation formalisation ... both presented in adequate mathematical detail towards the end. But perhaps the ideas would need to be introduced earlier if the present large section on "regression fallacies" is to be made relevant to anything.
If the main part of what has been deleted is only maths-stuff rather than relating to the general idea of regression towards the mean, then it is probably better to forget it and to only add what is really needed. It is better to have a citation for a proof not explicitly included, rather than a supposed explicit proof given without citation. I am not clear whether the chunk of maths at the end of the present article is supposed to be a proof of something relevant, but if a citation can't be given it is techically original research and so should be deleted.
Melcombe (talk) 13:38, 29 July 2009 (UTC)
This all looks constructive. Just a few points arising:
• I've now checked the Kendall & Stuart reference. It is as you say, but just mentioned as a historical curiosity. That interpretation of RTM is very weak, since it is a statement about the value predicted by the regression equation. The regression equation itself has no particular validity except where the population is binormal, and this is very seldom shown to be the case.
• I've got a copy of the Raiffa and Schlaifer book given as a reference in the present lede, but unfortunately I can't find the relevant bit. RTM is not mentioned in the contents page, and there is no index. I'll keep looking, but RTM is certainly not made much of.
• The "chunk of maths at the end of the present article" is indeed "supposed to be a proof of something relevant". I'd be grateful if you could look at it properly; if you do you'll see that it presents a result which could be viewed as yet another variant of RTM, although not identical to either of the other versions we're discussing. Its strength is that it holds under very wide conditions. I developed it as my attempt to prove an RTM principle, but was slightly surprised how it turned out. I still have a hope that the Samuels version could be derived from it (basically by restating as an equation then differentiating with respect to c), given a few extra assumptions. The accusation of WP:OR is of course currently correct. I'm just astonished that nobody else has ever come up with it before. If anyone reading this knows of a previous development, I'd then argue strongly for its inclusion with that attribution. Otherwise I have to admit that eventually it must go.
SamuelTheGhost (talk) 16:14, 29 July 2009 (UTC)
Postscript (written later). Having now read the visible part of Samuels' paper properly, I'd like to re-express the last four sentences above as follows.
I'm grateful to Palaeovia for adding the url of Samuels' paper, and drawing attention to its contents. I'm pleased to note that Samuels' "reversion toward the mean" is identical in formulation to my WP:OR version, so we can now include it with a clean conscience. Samuels' proof is not visible from the displayed page, but can hardly be much different from the one I gave, since it's so simple. I still have a hope that the other version, Samuels' "regression toward the mean" could be derived from it (basically by restating as an equation then differentiating with respect to c), given a few extra assumptions. SamuelTheGhost (talk) 15:44, 30 July 2009 (UTC)
The Samuels proof looks shorter, possibly only due to using indicator variables. Since you indicate that you don't have access to the paper, you might like to know that Samuels refers to an earlier proof dating back to 1983, which shows that the general idea of this type is not new. For info: McDonald et al. (1983) How much of the placebo effect is really statistical regression. Statistics in Medicine, 2, 417-427. Melcombe (talk) 15:54, 30 July 2009 (UTC)
My interpretation of the Kendall&Stuart ref is that they assume that "regression to the mean" only considers what happens to the simple regression between two variables. This a view shared by the Bland&Altman reference. It seems that the maths you contributed relates directly to the variant discussed in the Samuels paper, as noted below by Palaeovia. I have altered part of the text to reflect this. It would be worth comparing your stuff with what is in the Samuels paper to see whether improvements or shortenings can be made. Melcombe (talk) 10:39, 30 July 2009 (UTC)
I agree with Melcombe's plan.
What an improbable coincidence: The point of the Samuels paper that provided the current definition of RTM is to propose an alternative notion of reversion toward the mean, and prove its universality. And reversion toward the mean is what SamuelTheGhost (note Samuel, Samuels) formulated. The proof is different, though.--Palaeoviatalk 17:18, 29 July 2009 (UTC)

SamuelTheGhost gives us an entertaining bit of sarcasm (the part where he uses the word gosh—and that in mixed company). Although I haven't attended to this particular article lately, I think I can fairly say I've contributed quite a lot towards Wikipedia's articles on statistics and many of its other articles.

Let me state a precise theorem in the bivariate normal context: Suppose the pair (XY) has a bivariate normal distribution with correlation r. Then the conditional expected value of Y, given that X is t standard deviations above its mean (and that includes the case where it's below its mean, when t < 0), is rt standard deviations above the mean of Y. Since |r| ≤ 1, that means it's closer to the mean, as measured in the number of standard deviations.

Of course, one ideally should treat more general situations than the multivariate normal.

Notice that we spoke of the correlation between two random variables, and the conditional expected value of one of them given the other. Estimating that conditional expected value is what linear regression does. So it's weird to say it has nothing to do with linear regression. Michael Hardy (talk) 02:18, 30 July 2009 (UTC)

## Is deleted text "Regression toward everything" utter nonsense?

To facilitate discussion (initiated above by AaCBrown) of whether the following deleted text (written by AaCBrown) is utter nonsense, I am starting a new section.

== Regression toward everything ==
Notice that in the informal explanation given above for the phenomenon, there was nothing special about the mean. We could pick any point within the sample range and make the same argument: students who scored above this value were more likely to have been lucky than unlucky, students who scored below this value were more likely to have been unlucky than lucky. How can individuals regress toward every point in the sample range at once? The answer is each individual is pulled toward every point in the sample range, but to different degrees.
For a physical analogy, every mass in the solar system is pulled toward every other mass by gravitation, but the net effect for planets is to be pulled toward the center of mass of the entire solar system. This illustrates an important point. Individuals on Earth at noon are pulled toward the Earth, away from the Sun and the center of mass of the solar system. Similarly, an individual in a sample might be pulled toward a subgroup mean more strongly than to the sample mean, and even pulled away from the sample mean. Consider, for example, the pitcher with the highest batting average in the National League by the All-Star break, and assume his batting average is below the average for all National League players. His batting average over the second half of the season will regress up toward the mean of all players, and down toward the mean of all pitchers. For that matter, if he is left-handed he is pulled toward the mean of all left-handers, if he is a rookie he is pulled to the mean of all rookies, and so on. Which of these effects dominates depends on the data under consideration.
The concept does not apply, however, to supersets. While the pitcher above may be pulled to the mean of all humans, or the mean of all things made of matter, our sample does not give us estimates of those means.
In general, you can expect the net effect of regressions toward all points to pull an individual toward the closest mode of the distribution. If you have information about subgroups, and the subgroup means are far apart relative to the differences between individuals, you can expect individuals to be pulled toward subgroup means, even if those do not show up as modes of the distribution. For unimodal distributions, without strong subgroup effects or asymmetries, individuals will likely be pulled toward the mean, median and mode which should be close together. For bimodal and multimodal distributions, asymmetric distributions or data with strong subgroup effects, regression toward the mean should be applied with caution.

This text cited no sources. (AaCBrown's statement that "the original section was well-documented with citations to top academic journals and textbooks." is false.) Now, what is the defence for this text?--Palaeoviatalk 18:20, 23 July 2010 (UTC)

Before expending precious effort to show that this text is patent rubbish, I would like to establish that it is not AaCBrown's original research. (Wikipedia:No original research) It should be from a reputable source.--Palaeoviatalk 18:39, 23 July 2010 (UTC)

For the benefit of those who need help in seeing why the text is utter and unadulterated rubbish, let me explain. The following opening sentences are wrong.

Notice that in the informal explanation given above for the phenomenon, there was nothing special about the mean. We could pick any point within the sample range and make the same argument: students who scored above this value were more likely to have been lucky than unlucky, students who scored below this value were more likely to have been unlucky than lucky.

There are rigorous mathematical reasons for the phenomenon of "regression towards the mean" (RTM), and they are specific to the mean, and you cannot replace "the mean" by "any point in the sample range" in the mathematical argument and still have a correct argument.

Informal arguments are not proofs, and only serve to aid understanding. If the word "mean" does not appear in an informal argument, it does not follow that "the mean" plays no role in the rigorous proof.

"A little learning is a dangerous thing; drink deep, or taste not the Pierian spring: there shallow draughts intoxicate the brain, and drinking largely sobers us again."--Palaeoviatalk 23:30, 23 July 2010 (UTC)

The bogus citation discussed in the next section suggests that a crackpot or a prankster might be about.--Palaeoviatalk 03:09, 24 July 2010 (UTC)

It's not exactly a controversial topic, and you asked if there were objections. Did you not want any answers? I don't want to get in a flame war, and I'm not going to insist on including this material. I don't think you're stupid or insincere, I respect your points. We approach this topic from different perspectives, but there's no reason we can't agree on the article. For the record, I will answer your questions, but the issue here is not whether I am a fraud or an idiot, it's whether this material belongs in the article.

Going in reverse order. If you read past the abstract above, you will find the example I was referencing. The "multiple points in a subspace" are the batting averages of players by position. The article develops a general technique for regressing an individual observation (in the example, the batting average of a pitcher at the All-Star break) toward a point other than the mean of all observations (in the example, the batting average of all players). For full disclosure purposes, Ed George was my dissertation advisor, and the material I wrote is based on some work by Ed's dissertation advisor, Charles Stein. It is not original work by me, but I admit one reason I would like it in there is to recognize Charles and Ed. I understand that's not a valid argument for inclusion.

You write that there are rigorous mathematical arguments for regression toward the mean. I agree. I further agree that you cannot replace the mean with any other point in the range in those arguments. However there are also rigorous arguments for regression toward the median. One difference is the argument for the mean refers to the expected deviation from the mean on the second observation, while the argument for the median refers to the probability of being closer to or farther away from the median. Another difference is the argument for the mean requires much stronger assumptions (such as that the distribution has a mean). This is just the difference between parametric and nonparametric statistics.

As you weaken assumptions further you get more rigorous mathematical arguments that imply regression toward other points, or multiple points. I think those arguments are beyond the level required in this article, but there is no reason to exclude the fact of their existence.

I think our biggest point of departure is you think of RTM as one specific mathematical result, whereas I think of it as a general principle in data analysis, that does not require specifying a distribution or complete information, much less assuming a symmetric unimodal finite-mean distribution. I don't say you're wrong, after all the phrase has the word "mean" in it. I do think it makes a more logical explanation, given modern statistical understanding, if explained in the more general way. Also the practical utility of the concept in mainly in experimental design, and that application requires no strong assumptions.

You say it's wrong that the "explanation given above" does not depend on the mean. You don't quote the explanation, but if you look it up you will see the statement is true. I think you read the word "informal" as "non-rigorous." I did not mean that. The informal explanation (which I also wrote) was entirely rigorous. It was "informal" because it did not use mathematical notation, and thus was not precise. I think that's appropriate for the introduction as it gives the intuition without scaring off people who don't like complicated notation. The rigorous version was given in a separate section.

Finally, I'm not sure why the text you quoted has the citations removed. When I originally wrote it, it cited Stein and George. In any case, if you agree to put it back in, I'll be happy to restore the citations.

More important, I think everyone will agree this article is bad. There's no need to assign blame, collectively the people who have worked on this (including but not limited to you and me) have created something unworthy of Wikipedia and the topic, mainly because we can't get along.

I think the only solution is for one person to redo it. There's no need to write new material, I think we can take it that every relevant point has been written at one time by someone or other. The hypothetical new editor should just search the history and include the best version of every coherent point backed by adequate citations. If there are contradictory points, list both side-by-side for the reader to decide. The only writing should be for transitions and stylistic consistency. It's possible the topic should be broken up, say into RTM (History of genetics), RTM (Regression analysis), RTM (Experimental design) and so on. Or maybe some of those should be sections in other articles (like Genetics, Regression or Experimental Design) with only links from this article.

I know it's a big job. I'm willing to do the work, but that is probably unwise as it might just provoke a new round of unfriendly edits. Perhaps someone will volunteer who knows the subject well and has not been involved in previous controversies.

AaCBrown (talk) 01:02, 25 July 2010 (UTC)