User talk:William M. Connolley/Trend estimation

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Such a distribution will be normal (is this right; or at least approximately; or tend to normality with increasing number of trials?)

Yes, it should be normal. The expression for the trend ( ss_xy / ss_xx ) is linear in the y_i. (In this least-squares formulation, the x_i are just constants and the y_i are normally distributed variables.) A linear combination of normally distributed variables is itself normally distributed.

--Trainspotter 15:42, 6 Jan 2004 (UTC)

There are also pages on Statistical hypothesis testing and Least squares - they need work, but worth being aware of. --Trainspotter 15:54, 6 Jan 2004 (UTC)

Thanks (for reminding me to get on and finish my writing, apart from anything lese). I new of LS - as it says, in need of work - but not SHT (WMC).
w.r.t normal: I think I was asking a different question, which has nothing to do with linearity. "trend" is a new variable, not linearly related to other things, so what its distribution is needs to be found. I suspect its normal, though, for "normal" data. For unusual data (e.g., quantised) it clearly won't be normal.

We may be slightly at cross-purposes. All I meant was to confirm what you suspect, namely that for "normal" data the trend will be normal.

I wasn't trying to consider the case of non-normal data. In your example, each of the trend values is a function of the 100 y_i values in the series which generated it. It so happens that the form of that function (as given by the ss_xy/ss_xx expression) is linear in the y_i, which is useful because the distribution of any linear combination of normal variables is itself normal.

Hold on, no. ss_xy/ss_xx is indeed linear in scaling *all* y_i by a constant, but not (clearly) in scaling individual y_i's. Hence "linear combination of normal variables is itself normal" is irrelevant. Is that correct? I think so.
The demoninator ss_xx is a constant, so for simplicity just look at the ss_xy. This is defined as sum((x-xbar)(y-ybar)), which equals sum((x-xbar)y) because sum((x-xbar)ybar)=0. So it is just a linear combination of the y_i. Of course the coefficient of y_i (i.e. x_i-xbar) does not generally equal the coefficient of y_j, (as expected, i.e. the points nearest the middle of the timeseries have the least effect on the trend). But that does not matter. From the following two well-established results:
  • Given normal variate A and constant x, xA is a normal variate.
  • Given independent normal variates B and C, B+C is a normal variate.
it follows that any linear combination of independent normal variates is normal, even if the coefficients differ. (See below for discussion of independence.)
Hmm, I think I am obliged to agree with you and your interpretation. I'm not used to thinking of trends in this way. Interesting.

But let's go further and consider the case of non-normal data (e.g. index of ENSO). As you also allude to, the Central Limit Theorem says that the distribution of the linear combination of non-normal variables will approach normality. The relevant number here for the CLT is the number of points in the series -- and in your example this is 100, which should be plenty enough to assume approximate normality of the trend.

I think that your mention of "increasing number of trials" may be confusing, as the large number of trials (100,000 in your example) is arbitrary and is just a way of describing what is meant by the trend having a particular distribution.

As regards the 100,000 points being arbitrary, what I mean is this: of course if you have a normally distributed variable, take a sample of 100,000 trials and plot a histogram, it will not look exactly like a normal distribution. As you have more points it will look closer to a normal distribution. But this is just what is meant by the variable having the specified distribution (that the histogram resembles the underlying distribution/PDF as the sample size becomes very large). There is nothing special about a normal distribution in this regard; the same would apply for an exponential distribution or uniform distribution or whatever... What is special about a normal distribution is that if the variable (e.g. the trend) is itself calculated as a linear combination of a large number of other (independent) random variables (e.g. the point values) then it will have close to a normal distribution regardless of the form of the distributions of those variables, e.g. the trend in El Nino index.
Yes, 100,000 is arbitrary. I hoped that was obvious from the text... wiki is slow now, I'll check it later.
((Note added later -- Just spotted from the last para of the article that William has already noted the issues raised in the following 3 paras of my comment. I'll leave them here, but with that caveat.))
Now I said that the above result depends on independent point values. Basically that is the "white noise" null hypothesis (no correlation between temperatures in successive years). But in some ways that's a rather weak null hypothesis; rejecting it is fairly easy but maybe not particularly interesting.
The interesting null hypothesis is one which allows for low-frequency temperature variability but no overall trend. In that case it is still possible to assume a normal distribution for the trend in the null hypothesis, but with considerably greater variance than in the white-noise case because of allowing for the number of actual degrees of freedom in the timeseries being reduced compared to the number of time points (and hence harder to reject). Of course the tricky question is by how much the degrees of freedom should be reduced.
Of course it is mathematically possible to conceive of some dependence between the y_i of a really pathological form (rather than simply low-frequency variability) which breaks normal statistics completely rather than just reducing the DOF, but I think that needn't overly concern us for practical purposes.

--Trainspotter 17:53, 6 Jan 2004 (UTC)

OK, I think we're in agreement now. Good.

Cool. I don't think I'll log in for a while as I've been spending too long with it recently. Best wishes + good luck with what should be a very useful article. --Trainspotter 17:11, 7 Jan 2004 (UTC)

Confidence intervals[edit]

Are you planning to discuss the calculation of prediction confidence intervals? —James S. 23:36, 1 January 2006 (UTC)

Probably not in the near future. You'd be better off posting on the talk of the wiki page, though: this is only my own personal page I used for working up the article. William M. Connolley 11:20, 2 January 2006 (UTC).
Is this not the talk page? Which one are you pointing me to? —James S. 23:34, 2 January 2006 (UTC)
See the bit in red at the top of my version. The real version is now Trend estimation. William M. Connolley 11:18, 3 January 2006 (UTC).