Talk:Unbiased estimation of standard deviation: Difference between revisions

Content deleted Content added

Inline

Revision as of 00:28, 18 October 2009

Mathematics Start‑class Low‑priority

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Start	This article has been rated as Start-class on Wikipedia's content assessment scale.
Low	This article has been rated as Low-priority on the project's priority scale.

Statistics Unassessed

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
???	This article has not yet received a rating on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the importance scale.

Added section on autocorrelated data

I have a pretty good background in applied stat, and lots of reference books, and I'm a member of ASA and can do online ASA journal searches, but with all of that I have never seen these bias equations anywhere other than in Law and Kelton. To derive them from Anderson isn't that tough, but finding which expressions in Anderson to use isn't so simple (which is why I put the equation numbers in the references).

What I'm not clear about, and deliberately slid over in this effort, is what is the effect of taking the square root of these bias expressions. Is the resulting PDF still chi? I've been doing some sims lately in support of some ANSI/IEEE nuclear standards development, and there is still a bit of bias that these expressions don't take out. I was already aware of the chi PDF and the small-N correction, but I could use some help in seeing how to apply that sort of transformation to the autocorr case. Any info would be appreciated, and of course should be added to this article.

Given that intro texts don't deal at all with autocorr data, and that such data is common, there needs to be some treatment of the subject somewhere in Wikipedia. Rb88guy (talk) 20:46, 12 January 2009 (UTC)[reply]

There are a number of points here:

This stuff might eventually be better located separately that would be more obviously relevant to time-series, particularly as the problem starts at the stage of estimating the variance not the standard deviation.
The material presently here does not rely on the assumption of a normal distribution and none is stated. If a chi-squared dist were appropriate to the autocorrelated case this would be a necessary requirement, so care would be needed in specifying assumptions. However, I believe the chi-squared dist does not hold, even if the true correlations were known and certainly not if they are estimated from the data. It looks possible the get a formula for the variance of the estimated variance (at least of a normal distribution is assumed) which would be some guide to whether a chi-squared works.
There are other ways of estimating the variance of the sample mean which don't start with the ordinary sample variance see for example Moran, P. A. P. 1975. The estimation of standard errors in Monte Carlo simulation experiments. Biometrika, 62:1-4. Also, I think an estimate can be found via spectral analysis, by estimating the spectral density at a frequency of zero.

Melcombe (talk) 16:16, 14 January 2009 (UTC)[reply]

A few minor tweaks

Sorry, forgot to mark a couple of these as minor. Adjusted equation spacing, indents. Changed my "N" to "n" as used in previous material. Rb88guy (talk) 15:49, 13 January 2009 (UTC)[reply]

Added plot of c4 vs n

And added a caption on earlier graph. Rb88guy (talk) 18:23, 13 January 2009 (UTC)[reply]

Added calcs for variance of mean

Using the observed (sample) variance. Thanks to Melcombe for catching my error in the Var[x-bar] expression, and fixing it. I have some nice R graphics that show how this stuff works only in the mean, that is, that the expected-behavior curve(s) pass through the mean values of many thousands of replicates of the various calculations. (There's a lot of scatter.) But it takes a lot of words to describe what's going on in the graphs, so I didn't think that was appropriate. On the other hand, when I do the same sims using the std dev, not the variance, there is still a bit of bias left. As a practical matter for someone trying to calibrate an instrument, removing almost all the bias in the std dev is presumably better than being off by a factor of two... Anyway, that part of this needs more work, and when something sensible is available it should be added here, to complete this section.

Also, the only thing about moving this to a TSA section is that lots of folks who need to be aware of this autocorr bias problem wouldn't think to look at TSA. They might think "How is calibrating this instrument a time series problem? It's just a pile of numbers." Assuming they even know what TSA is or what it can do for (or to) them. Rb88guy (talk) 20:24, 16 January 2009 (UTC)[reply]

This is somewhat in danger of becoming original research which is not allowed here. However you have put in refs for the results quoted so that should be OK. To go further one would need to consider that estimating the standard deviation unbiasedly is not central to the usual run of statistical theory and that there may well be good reason for this. Would a better way of treating your "trying to calibrate an instrument" example be to say that what is wanted is a good interval estimate for the mean (ie. a confidence interval or whatever terminology is appropriate). Looking directly at how to define the limits for the CI would combine the idea of getting the "right" estimate for the variance or standard deviation with making an adjustment to the "adjustment for sample size" entailed in the use of limits derived from the Student-t distribution. Indeed, in the uncorrelated case, the use of the Student-t distribution, instead of the normal distribution, might itself be thought of as making a correction for bias in the estimated standard deviation. If the real use of the standard deviation is to construct such CIs, you may be better off aiming your simulations at the properties of the CIs rather than the estimated standrd deviations. If the CI is the real use, then a better home for this stuff might be in an article about CIs for the mean. Don't forget it is possible to put in links from several other articles to point to the right place, wherever it is. Melcombe (talk) 11:34, 19 January 2009 (UTC)[reply]

I agree about the research thing; I was hoping someone would know of a reference that takes this from variance to std dev, in the presence of autocorr. If that doesn't exist, I have to say that deriving it is, most likely, beyond my capabilities in stat, but if I did come up with something, then certainly that would be publishable (in a journal, not here). The measurement context that I'm thinking of is calibration of monitoring instruments, particularly for detection limits. There, the std dev of a mean isn't the issue, it's the std dev of the population of filtered, hence autocorrelated, measurements themselves. That std dev is used in Min Detectable Conc calcs. If autocorr isn't accounted for, then the calculated MDC will appear to be way smaller (better) than it really is. So, the remaining issue is that E[s] is not equal to SQRT( E[s^2] ), otherwise we could just take the square root of the "s^2" (and "Var[x-bar]") expressions in the article and everyone could live happily ever after...;) Rb88guy (talk) 16:17, 19 January 2009 (UTC)[reply]

Added material on estimating std devs

Well, I just felt that something needed to be added to bring this stuff back to the std dev from the variance, and it also ties back into the first (original) section (c4). Yes, I suppose some of this is "OR" but it must exist somewhere in the stat literature- this cannot possibly be novel. I'm hoping someone will know a reference for what I called $\theta$ here, or maybe, if it actually doesn't already exist, someone will research it and publish it, and then that can be referenced here. In other words I won't struggle over the exact stuff I put here, but I do think there needs to be some discussion of the issue (that E[s] <> sqrt( E[s^2] ). Consider my addition a strawman...Rb88guy (talk) 20:57, 30 January 2009 (UTC)[reply]

Small stuff changed, however

I think the tone of the intro is too negative- why not just delete the entire article, that, apparently, is full of stuff no one even uses? I could see that, maybe, for the small c4 (and c2, which should also be put in here, along with some material on the Helmert PDF) correction, but the autocorr correction is significant. Anyway, as time permits I may try to come up with a more hopeful intro that might even encourage someone to read this article...PS I've been dabbling with an article on where the c2, c4 factors come from; you might want to take a peek at the raw, far-from-finished stuff I have in my sandbox Rb88guy (talk) 20:33, 25 February 2009 (UTC)[reply]

Your sandbox article looks great, though I would suggest that you make it explicitly clear that (using your notation)

{\hat {\sigma }}_{n}={\hat {\sigma }}_{n-1}

, rather than having the reader have to deduce this. In fact why not do away with the subscript entirely and just use

{\hat {\sigma }}

? Btyner (talk) 01:17, 26 February 2009 (UTC)[reply]

Thanks, yep, that needs fixin' along with lots of other stuff. I was thinking of making this an article with a title including in some manner "Helmert's distribution of s", following Deming. Incidentally, there is a TON of useful stuff in that book! I don't usually do anything with sampling (as in survey sampling) so I hadn't even looked at it until recently. Rb88guy (talk) 02:22, 26 February 2009 (UTC)[reply]

You may be right about it being too negative, but it does say that it is an important theoretical problem which makes it of interest to a moderately large group of individuals. However, if you can find some useful citations to real applications, then do include them later in the article, with a brief mention in the intro, remembering that it is meant to be short and readable. In your sandbox you are citing the first edition of John&Kotz ... have you seen the second edition, as referenced in this article presently, as it may contain material you haven't seen. Also, regarding your sandbox, you may want to make use of the existing article chi distribution(not presently mentioned), both because it is related and because you may be able abbreviate some of what you want to say. Melcombe (talk) 10:22, 26 February 2009 (UTC)[reply]

I remember I noticed the bias of standard deviation for small samples a couple of years ago and was quite disappointed not to find anything on wikipedia. This article which was added not long after would have prevented me from wasting my time reinventing the wheel ;-). So I agree with the intro being too negative. This is true for the autocorrelation stuff, but c2 and c4 are relevant as well. I was processing test results with sample sizes ranging from 2 to 8 so the correction factors were significantly different from 1. Also, not to sound too skinflint, but sometime, even a 1% point of margin can represent a lot of money in some industries. -- Ryk V (talk) 00:23, 18 October 2009 (UTC)[reply]

Added a table of values for c4

This one of my first edit so don't hesitate to modify the table in anyway you see fit. I just thought adding it was relevant, because calculating c4 with the main formula is not straightforward (you need to go to Particular values of the Gamma function to get a correct value (and you can't go very far). Adding some external sources from the web could also be a good idea but I don't know which one are acceptable.

Added a table of values for c4

This one of my first edit so don't hesitate to modify the table in anyway you see fit. I just thought adding it was relevant, because calculating c4 with the main formula is not straightforward (you need to go to Particular values of the Gamma function to get a correct value (and you can't go very far). Adding some external sources from the web could also be a good idea but I don't know which one are acceptable. -- Ryk V (talk) 00:28, 18 October 2009 (UTC)[reply]

@@ Line 42: / Line 42: @@
 This one of my first edit so don't hesitate to modify the table in anyway you see fit. I just thought adding it was relevant, because calculating c4 with the main formula is not straightforward (you need to go to [[Particular values of the Gamma function]] to get a correct value (and you can't go very far).
 Adding some external sources from the web could also be a good idea but I don't know which one are acceptable.
+== Added a table of values for c4 ==
+This one of my first edit so don't hesitate to modify the table in anyway you see fit. I just thought adding it was relevant, because calculating c4 with the main formula is not straightforward (you need to go to [[Particular values of the Gamma function]] to get a correct value (and you can't go very far).
+Adding some external sources from the web could also be a good idea but I don't know which one are acceptable. -- [[User:Ryk V|Ryk V]] ([[User talk:Ryk V|talk]]) 00:28, 18 October 2009 (UTC)