Wikipedia:Peer review/Normal distribution/archive1

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Normal distribution[edit]

This is an old and very comprehensive article on an important topic that could benefit from the input of the larger community. For example, does it become clear quickly what the normal distribution is and why it is important? How interesting is it for a general audience? Would it be better with more examples? Or with less discussion of its applications to IQ testing? Does it need illustrations of definite integrals familiar from textbooks, showing e.g. the area under a standard normal pdf between −2 and +2? Thanks, everyone. --MarkSweep 05:55, 22 Mar 2005 (UTC)

Things I don't like/think needs improvement/comments/whatever/etc.:

  • I did some copyeditting work to squelch some concerns I had and to save me from putting them here.
  • The Normal distribution#Occurence section could be trimmed down.
  • No mention of complex normal distribution
  • No mention of AWGN
  • No mention/comparison to it's common role in white noise (combine with bit about AWGN)
  • Perhaps chi-squared derivation
  • Seems as though a few more graphs could be useful

That's all I got at the moment. Cburnett 08:47, 22 Mar 2005 (UTC)

First thing that occurs to me reading the article is that the section about photon counting seems a little bit unclear. 'Light intensity from a single source varies with time' - why? Is a Bose-Einstein distribution the same as a Poisson distribution? Why does thermal emission behave differently to laser emission? I think an expansion and clarification of this section would be helpful. Worldtraveller 12:32, 22 Mar 2005 (UTC)
Light intensity varies because of theremal fluctuations, at the very least.
The Bose-Einstein distribution is not Poisson, it is exponential.
Laser light is a coherent phenomenon, and lasers are far away from thermal equilibrium.
Those are good questions, I just don't know that a discussion of these details would not be distracting. — Miguel 08:31, 2005 Apr 18 (UTC)
There's a pair of series approximations for the distribution on the mathworld.wolfram.com page. Unless I'm mistaken I don't see them listed. — RJH 18:49, 23 Mar 2005 (UTC)

It becomes clear that the norm dist is a prob dist, but chances are if you know what a prob dist is, you'll know what a norm dist is. Perhaps there is no interest in explaining it to a less-informed general audience (I don't always do it), but as it is now, I'll be surprised if anyone who hasn't taken at least an intro to stats course will understand anything, even the most basic sections. There is no explanation about the shape of the curve, or how the scores are distributed around the mean, or anything along those lines that could help someone understand. What are the axes in your graphs, especially the probability? For instance, if the IQ standard curve has a mean of 100 and a stddev of 15, does that mean a newborn has a 50% chance of having (or developing, when he becomes adult) an IQ between 90 and 110, or does it just mean that 50% of people who've had their IQ tested scored between 90 and 110? Considering that Bell curve redirects to here, there should be something more simple, because it's not uncommon to hear that term in early high school. I think the IQ section is long, but it explains the topic very well. Unless someone wants to spin-off a new article, I wouldn't touch it. Re: the length of appendages in biological organisms, what is the sample? Is it from the same individual or across a population? The lengths of my fingernails or my 5 o'clock shadow doesn't seem like it would fit a normal distribution. The blood pressure example is a bit weird. The previous paragraph describes a lognormal distribution, then the BP is normal, and back again to lognormal. If I didn't know any better, I'd assume there was a misprint and that the BP was lognormal. The figures should be named and referred to by their number (ie: Fig. 3). Things like "plot to the right/left/above/below" is really bad. Abbreviations used should be defined somewhere, such as 'pdf' and 'cdf'. It doesn't take a Harvard education to figure out what they are but it should still be done. Hope this helps. --jag123 10:45, 24 Mar 2005 (UTC)

We are not talking about the distribution of the naillengths of your 10 fingers, but the lengths of the nails of the same finger across a population. — Miguel 08:31, 2005 Apr 18 (UTC)
Thanks, jag123. Good points. If you feel strongly about the IQ discussion, could you visit Talk:Normal distribution#IQ discussion and comment on it? --MarkSweep 18:36, 24 Mar 2005 (UTC)

There is a common but nevertheless serious error in the estimate of the variance. When the variance of a population needs to be estimated using only a sample of the entire population then one should not estimate the variance as

 {1 \over n}\sum_{i=1}^n(x_i-\overline{x})^2

since this equation underestimates the true variance. An unbiased estimate for the variance is

 {1 \over {n-1}}\sum_{i=1}^n(x_i-\overline{x})^2

The proof that is given of the former equation is wrong. You cannot set \mu=\overline{x} in the derivation since it is only an estimate. I haven't been able to find a alternative proof on the web and my statistics book is at a different location. See http://www.mathpages.com/home/kmath497.htm http://mathworld.wolfram.com/Variance.html http://www.pitt.edu/~wpilib/statfaq/95varqn.html for additional info. Jan van Male 17:50, 24 Mar 2005 (UTC)

There is a common but nevertheless serious error in the estimate of the variance. When the variance of a population needs to be estimated using only a sample of the entire population then one should not estimate the variance as
 {1 \over n}\sum_{i=1}^n(x_i-\overline{x})^2
since this equation underestimates the true variance.
That is nonsense. It is true that on average it underestimates the population variance, but to call it a serious error is nonsense: sometimes biased estimators perform better -- indeed in some cases far better -- than unbiased ones. This one in particular has a smaller mean square error than the unbiased estimator has. Your statement quoted above is the "common but serious error". Michael Hardy 22:50, 7 February 2006 (UTC)
It's not really an error, because the discussion is clearly about maximum likelihood estimation, and the first estimate is the maximum likelihood estimate. However, the connection to sample variance and unbiased estimates could be made clearer. --MarkSweep 18:36, 24 Mar 2005 (UTC)
Although unbiased, the maximum-likelyhood estimator is consistent and it has smaller variance than the unbiased version, so it is sometimes preferred. — Miguel 08:31, 2005 Apr 18 (UTC)
I did not know of the branch of statistics called maximum likelihood. Now that I have read about it, I can see my mistake. Presenting a biased estimate rather than an unbiased one does seem counterintuitive to me. Jan van Male 19:49, 24 Mar 2005 (UTC)

The article does not presently answer one key question: Why do so many phenonena result in normal distributions? Why this particular equation? The closest that the article appears to come to addressing this is, "While the underlying causes of these phenomena are often unknown, the use of the normal distribution can be theoretically justified in situations where many small effects are added together into a score or variable that can be observed." Does anyone know?--J-Wiki 13:14, 26 Mar 2005 (UTC)

It's the part about "many ... effects are added together" that sometimes justifies the normal distribution. If there is reason to believe that many factors contribute to a complex phenomenon and those factors are mostly independent and their cumulative effect is the sum of the individual effects (as opposed to their product, or some other relationship), then we would expect to see an empirical distribution that resembles a normal distribution. You're absolutely right that there should be a better and more detailed explanation in the article. --MarkSweep 19:13, 26 Mar 2005 (UTC)

Body size distributions[edit]

I looked for an article on this topic for a quick review of applicability to body size distributions (ht, wt, bmi, etc)-- see the CDC growth curves and found this article disappointing as an overview of the issue. For example, I was looking for the rough conversions of SD to percentiles and found no info on this fairly widespread and common practical application of this concept. Second, there is an unclear suggestion that biological measurements usually do not follow a normal distribution, but many aspects of medical practice use this concept. An explanation of the discrepancy should be included in that section, or perhaps this part of the article is simply wrong-- is this an example of the distribution not meeting the Platonic ideal of a statistician yet being so close that it is useful for clinical work? I found much better and clearer examples of what I wanted with a quick google search elsewhere. alteripse 01:18, 2 Apr 2005 (UTC)

About biologival specimens, the classic reference is
Huxley, Julian: Problems of Relative Growth (1932)
The overwhelming biological evidence support supports the hypothesis that growth processes proceed by multiplicative increments, and that therefore body size should follow a lognormal rather than normal distribution. The size of plants and animals is approximately lognormal.
Also, if you assume height is normally distributed, then weight will not be (normality is not preserved by powers) and conversely. They can both be lognormally distributed, though. — Miguel 08:31, 2005 Apr 18 (UTC)

Sorry I am dense (or statistically naive) but I don't understand your explanation at all, even enough to argue about it. Is it possible to provide a clearer explanation for the article? I suspect something is wrong with your argument but don't have the statistical knowledge to recognize the problem. alteripse 14:40, 18 Apr 2005 (UTC)

Honestly, if you can't state your question I can't answer it, but somehow I don't think statistics is the problem - I think the problem is geometrical. All I have to say is, check out the book I mention from a library, read the introduction and look at the diagrams. You might also want to google the title and/or author: there are lots of references to it. There is also a wealth of modern paleontological work in which the logarithm of sizes of bones is taken before any further analysis. That is, the working assumption is lognormality. — Miguel 17:44, 2005 Apr 18 (UTC)

All right, my question could be made clearer, but don't be condescending-- if you don't understand what I am describing it may be your lacuna, not mine. Here are some examples.

  • First, this [1] is a copy of a growth chart that shows ht expressed in standard deviations and percentiles, impying that hts at a given age approximate a normal distribution. If you look at the wt distribution it is clearly skewed and it would not seem to be valid to interconvert percentiles and SDs. Do SDs have any validity if the distribution is not "normal"?
Percentile ranks are always more meaningful than number of standard deviations from the mean (i.e. z values), which is just a change of measurement scale. The standard deviation itself is always meaningful.
The height data are normal, but not so the weight data. Even if a series of data is lognormal, it will be very close to normal if the SD is sufficiently small relative to the mean. Notice that the 97th percentile for weight is twice the 3rd percentile, but that in the case of height the 97th percentile is just about 16% larger than the 3rd. That is a huge difference as far as the lognormal is concerned.
I'll do a goodness-of-fit analysis for a lognormal distribution on both sets of data and report back. — Miguel 14:44, 2005 Apr 19 (UTC)
The endpoint of the height data fits a lognormal with log-standard-deviation between 0.0389 and 0.0408. What I did was 1) visually estimate the values from the graph to within 0.5 cm; 2) divide all values by the median height so the result has by construction log-mean equal to 1; 3) compare the resulting ratios (with errors) with the quantiles of the lognormal using R. When I figure out how to wiki-code tables I'll post the details. — Miguel 19:11, 2005 Apr 21 (UTC)
The weight distribution does not seem to fit a lognormal, though. — Miguel 19:43, 2005 Apr 21 (UTC)
The Cauchy distribution is an important distribution without a mean or standard deviation. There is a whole theory of large deviations for so-called fat-tailed distributions. Note that when a Cauchy distribution is involved, it is wrong to estimate a standard deviation from a sample and then discard any outliers. It is even wrong to estimate a sample mean, for that matter. — Miguel 08:39, 2005 Apr 21 (UTC)
  • Second, this statistics website [2] provides several examples of biological variables in a normal distribution, including ht, suggesting at least some statisticians think many measurement variables do follow a normal distribution.
Yes, many people who use statistics (who are, by the way, mostly not statisticians) think so. On the other hand, if you look at the article's talk page you'll see that we tried and failed to find a single statistics textbook where the statement that biological variables are normal is backed by a reference that we could check. Most people contributing to the article actually have training in probability and statistics, too, but also know full well that sometimes statistical methods based on normality are used because of mathematical convenience (or because they are available off-the-shelf) more than anything else. — Miguel 14:44, 2005 Apr 19 (UTC)
  • That is more or less an excercise and is basically making an incorrect assumption for the simplicity of the assignment and getting across a point about normal distributions. - Taxman 14:05, Apr 19, 2005 (UTC)
  • Third, this [3] is the first example I could quickly find to illustrate common use of the assumption that hts can be expressed as z-scores. Look at the methods section.
I am not disputing common use of the assumption that hts can be expressed as z-scores, I am disputing the soundness of the assumption. The article you reference uses z scores for all of height, weight, and BMI. We know from the growth chart data you reference that weight data cannot be expressed as z scores without loss of information. That is a flaw (quite likely unconsequential, I would admit) in their method. — Miguel 14:44, 2005 Apr 19 (UTC)
  • Fourth, this website [4] is an example of explaining the relationship of percentiles, z-scores, and SDs that is quite useful in many disciplines but is missing from our article.
the website says
It can be shown that many characteristics of interest, such as IQ, height and weight of people, etc., have a normal population distribution.
well, the data you provided actually show that weight is not normally distributed, and IQ hardly counts as evidence because it is normally distributed by construction (see the discussion in the article where it is made clear that the normality of IQ is the result of taking raw test data which are not normally distributed, calculating percentiles, then z values, then translating the z values into a normal with mean 100 and SD 15). The website does not bother to give a reference where this has been shown, and it is exactly that kind of unsubstantiated statement that is often found in statistics textbooks. — Miguel 14:44, 2005 Apr 19 (UTC)

So, most of the world uses SDs, z-scores, and percentiles to express height distribution and I am having difficulty reconciling this with your assertion that ht and many other biological variables do not follow a normal distribution. Again, are you simply claiming that the distribution is close but not exactly normal, (like an astronomer arguing that the earth is not spherical, just really close)? If so, I think you are nitpicking or being deliberately obtuse. I usually assume if I can't explain something to someone it is likely because I don't understand it thoroughly enough myself. Can you explain your assertions to me? Do you still not understand this issue? To me, this is an enormous hole in this article, which I suspect is largely unintelligible to 99.9% of college-educated adults. I think it should be explicitly addressed in our article. alteripse 12:53, 19 Apr 2005 (UTC)

I don't think he was being condescending at all, just simply saying he can't answer a question if he doesn't know what you are asking. In any case, you can use all of those tools you refer to without the distribution being normal. Just because the normal distribution is common and simple, doesn't mean everything has to follow it. SD's, percentiles, all apply to lognormal and other distributions, and z-score's can too to an extent. Read the article on the lognormal distribution or poisson, and you'll see those have very different shapes, but still have many of the same attributes such as SD's and percentiles. This article is about the normal distibution, it should not cover everything about those topics which are general to all probability distribution functions. You see below I agree this article is too technical, but this specifically is not a problem with the article, but your understanding of statistics. - Taxman 13:52, Apr 19, 2005 (UTC)
Most of the world also uses linear aproximations to nonlinear phenomena, often for no better reason that we have no idea how to solve nonlinear equations in general.
The earth is not spherical for many practical purposes nowadays, given the accuracy of modern navigation systems. GPS even uses general relativity corrections. As far as my daily life is concerned, the Earth might as well be flat. That has nothing to do with what I know to be the case, and I would be nuts to demand that a manufacturer of street maps use a method that allows for sphericity. But that is not the point.
I may be nitpicking in the case of the height of girls. However, I am not in the case of many other biological variables usually claimed to be normal. On the other hand, there is a substantial difference between the normal and lognormal models of height, and that is the treatment of growth rates. On the second page of the growth charts there is a chart of growth rate in cm per year. If the height is lognormal, the appropriate measure of growth is the relative rate: cm of growth per year, per cm of height. Now, this is exactly what the book by Huxley is all about: growth rates. And he takes logarithms. On page 11, he says
In passing, it is worth noting that the logarithmic method of plotting brings into true relief an important point that is entirely obscured by the usual method of of plotting on the absolute scale—namely that growth is concerned essentially with the multiplication of living substance
Replace "logarithmic method of plotting" with "a lognormal model" and "plotting on the absolute scale" with "a normal model" and that is basically what our article is trying to say. As usual, the original says it much better even if (or probably because) it is over 70 years old.
Now, if the height of girls is the result of a long and slow process of growth, and if the growth rate is affected by a multitude of genetic and environmental factors which we model as random, and if the growth is assumed to be multiplicative as biologists know it should be and Huxley supports with data, then we must expect height to be lognormal and the depault method of analysis should not be mean-standard deviation. The mean would be replaced by the geometric mean and the standard deviation... well, that's the problem, that there is no simpler way to describe what needs to be done to the data in that case other than to say "the exponential of the standard deviation of the logarithm of height", and I would forgive doctors for not wanting to do that when they measure the height of girls. — Miguel 14:44, 2005 Apr 19 (UTC)

Thanks for the above. I won't argue that my statistical expertise is rudimentary on a good day although I took an intro course many years ago and wrote a spreadsheet program to do SD and SEM computations for lab data in the days before VisiCalc and Lotus. The problem with these articles is that they appear to be concise aides de memoire for people who already understand the subject matter, so that they are better suited to a Handbook of Statistics than an encyclopedia. For example, it would nice if the lognormal article had an illustration of the difference between a normal and a lognormal distribution. It might have saved all these words. I didn't know we had a lognormal article until you pointed it out, but sadly I am still little more knowledgeable after reading it. These articles do serve the purpose of making me wonder if some of the articles I have contributed suffer from the same flaw of being a nice synopsis for those who already know the material but insufficiently clear and explanatory for a reader who doesn't. We might all learn from this example of what an encyclopedia isn't.alteripse 14:39, 19 Apr 2005 (UTC)

Well sure, it is much tougher to write an article that is accessible to someone that doesn't already know the subject. Because of that many articles simply state the facts and features about the subject in a technical way. The issue of course is that people that know the subject well, will write it in terms they are used to and work with every day. But we'll get to a great article eventually. Being aware of the issue and having people that can point out where the article is not helpful is very important toward reaching the goal of an effective article that is useful both to someone who does not know the subject and to someone that does. All of the subject will never be fully accessible to someone that does not know the subject because some facets of the topic simply require background knowledge that cannot fit in one article. Example kurtosis. But I do believe the negative effect of that can be minimized by the method I've outlined below. I will see what I can do. - Taxman 15:38, Apr 19, 2005 (UTC)

Too technical[edit]

  • Well there is great material in this article, but it also has a long way to go before it can be a FA. Overview comments: 1) The lead is too short and still too technical. It should ease a reader in that doesn't already know the subject. It is the one part of the article that really needs to focus on that, while the rest of the article can go into more specifics and require a bit more knowledge. One way to get there is to minimize unfamiliar terms and leave details that aren't the most important parts of the subject for later. 2) Part of 1) is that overview sections are deprecated. Anything an overview section would do is what a great lead section should already have done. 3) The whole article would be very difficult for anyone that does not already know the subject. That is a problem given that this topic is not all that hard and any student of the social and physical sciences will have to encounter it. The problem is that way too much technical and difficult material is way too early in the article. I propose progressing the material from simple and apparent to steadily more difficult later in the article. That way the readability is dramitcally improved and everybody gets what they need out of it. Specifically, the table under the graphs at the top is way too much and not terribly helpful for many people. That could me moved into the Specifications and/or properties section (which themselves are too technical) and moved down in the page. Now don't get me wrong, detail is good, and we want to be accurate, but that can be done while still giving all (or most of) the needed context and explaning all necessary concepts inline. Following my proposed progression would make that easier. I don't recall the original provinance of the quote, but I believe Hawking references the idea that a single equation in a book would cut out half the readers. We don't need to be that extreme, but keeping it in mind would help a lot. 4) Is that exp() notation really the standard? I've never seen any books that haven't used the e^x type notation. Is it possible to link to exponential function directly in the equation instead of making someone have to see the explanation under it to understand? For example unexplained functions are used in the table at the top of the article, including exp, erf. That's all for now, other detailed things I'll try just to work on myself. - Taxman 21:37, Apr 18, 2005 (UTC)