Talk:Sample (statistics)

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Statistics (Rated C-class, High-importance)
WikiProject icon

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C-Class article C  This article has been rated as C-Class on the quality scale.
 High  This article has been rated as High-importance on the importance scale.

WikiProject Mathematics (Rated C-class, High-importance)
WikiProject Mathematics
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
C Class
High Importance
 Field: Probability and statistics

I think there is enough information here so I can remove the stub tag. There are links to the many other pages that talk about the specifics of samples and sampling. Steve Simon 16:31, 27 September 2006 (UTC)

I've changed the introduction and highlighted some topics to be expanded on (e.g., stratified sample) at a later date. Steve Simon 02:53, 27 September 2006 (UTC)

Mathematical description[edit]

The "Mathematical description" and the "Empirical description" describe different things; this is not just a matter of different ways of looking at the same thing.

If X is a random variable, then the n-tuple (X1,...,Xn), in which the Xi are n i.i.d. clones of X, is itself a (multivariate) random variable. This is what is called "a sample" in the section Mathematical description. A single experimentally observed outcome of this multivariate random variable is what is called "a simple random sample" in the section Empirical description.

I know that the use of the term "sample" to describe outcomes obtained by a sampling process is quite common. Is the "mathematical" meaning in which a sample is a multivariate random variable also common? If so, shouldn't we make it clear that we have two different meanings here (and give citations of sources for the "mathematical" meaning)? If not, the wording of the section on "Mathematical description" should be changed.

(Can someone explain why the first section is called "Empirical description"? It is not as if we observed lots of samples in the field and are now trying to describe, based on our observations, what we saw.)

 --Lambiam 19:19, 13 December 2007 (UTC)

Yes you can sample from a multivariate distribution. I suppose one can interpret a sample of length n (= n iid random variables) as a sample of length one from an n-dimensional variable if convenient for some purpose, but it is not exactly standard. The definition I gave is in virtually any book on mathematical statistics or probability. I have added a citation to a book I happened to have on my desk. The last paragraph in the section is a restatement what a random variable is: a mapping that assigns values to possible outcomes.
That is not what I mean. The sample itself, whose outcomes are vectors of which the length is the sample size, has a (boring) multivariate distribution. Google books won't let me read Wilks' definition, but I trust that is OK, which answers my first question.  --Lambiam 23:40, 13 December 2007 (UTC)
The word empirical means in probability and statistics from observation, see empirical probability. For the concept of sample in the sense of mathematical statistics it is fundamental that in principle one can repeat the sampling experiment and the results would be coming from some probability distribution. So, even if you may not have a lot of samples, you could, again in principle. Jmath666 (talk) 21:27, 13 December 2007 (UTC)
So it is not the description that is empirical, as the section title suggests, but the context of use: this describes the meaning of the term in empirical statistics.  --Lambiam 23:40, 13 December 2007 (UTC)
Actually, I think these are two descriptions of the same thing, one from practitioner's view, one from theoretical view. Otherwise one could not do anything analytical/mathematically meaningful with the pracitioner's samples, such as hypothesis testing, just descriptive statistics. Jmath666 (talk) 01:02, 14 December 2007 (UTC)
I think it is desirable to be able to make a distinction between a random variable, and an experimentally obtained (observed) outcome. The latter, assuming it is in numerical form, one can write on a piece of paper, put it in a table in an article, and so on. One cannot put the random variable itself in the article. Currently, a collection of such observations is also called a sample. If you look, for example, at our Mode (statistics) article, it defines the mode of a sample, where this clearly applies to a collection of observations, and not to a set of i.i.d. r.v.'s. Is it abuse of the term "sample" when authors write: "we obtained a sample of 183 specimens from ..."? Is there a better term for "a data set that has been collected by a sampling process"?  --Lambiam 12:50, 14 December 2007 (UTC)
Perhaps this second part should be moved to Random sample which looks pretty sad now. Also the Sample (probability) should link there. Then there would be no need for the "empirical description" heading. Jmath666 (talk) 22:15, 13 December 2007 (UTC)
I support that, and, moreover, I think "simple random sample" should be merged into and then redirect to "random sample"; the different meanings of "random sample" should be cleared up better (I think a "simple random sample" is also a "random sample" in the empirical sense, even though the text suggests it is not). Is there a counterpart of the mathematical treatment for non-simple sampling methods (e.g. with no replacement)?  --Lambiam 23:40, 13 December 2007 (UTC)
I never heard "simple random sample" before. It may be used incorrectly here or something someone in some source made up.
To the second question, if you sample from a population without replacement, what you get are not values of independent random variables (since something was not replaced, the distributions are not identical and they are not independent) so it is not a random sample. But if you n sequences length m each starting from the beginning that would be random sample size n out of multivariate distribution dimension m... Oh that confusion when people use same word for different things and it gets perpetuated in undergraduate textbooks... Jmath666 (talk) 01:02, 14 December 2007 (UTC)
Googling ["simple random sample" OR "simple random sampling"] gives more than a few hits; the terminology is sufficiently widespread that we cannot ignore it. See e.g. here. This page discusses many forms of "random sampling".  --Lambiam 12:50, 14 December 2007 (UTC)
Indeed simple random sample is something else than random sample (which is defined as iid); not a special case of it as the name might suggest... The article simple random sample also says so, correctly. These are not iid but only approximately so. I am sure statistics textbooks would have plenty of the kind of info about such things. One authoritative source is Kendall's advanced theory of statistics. I'll look at Wilks when I get to that office again. This whole sample related collection of articles would benefit from some coordination by someone who know what he is doing (which in statistics would not be me, at least not yet). Jmath666 (talk) 19:50, 14 December 2007 (UTC)
...a random sample of length n (where n may be any of 1,2,3,...) is a set of n independent, identically distributed (iid) random variables with distribution F.
This definition seems excessively narrow. It excludes sampling without replacement from a finite population (violates the 'independence' condition) and it would also exclude things like stratified sampling schemes. I don't have sources handy - does anybody have a better, citable definition that could be substituted? --GenericBob (talk) 10:00, 9 December 2011 (UTC)

First sentence flawed?[edit]

The first sentence of the article is flawed. With the exception of a population that is all-inclusive, the population will always be a subset of another population (that is, a sample), and so on -- making the definition circular.

Another problem is with the operationalization of "taking" a sample of some population -- this is never discussed in the definition.

Without some kind of rigourous methodological approach, samples drawn from the same population could have wildly variant statistical characteristics. This shows that sample methodology is an essential statistical element -- but one which the definition neglects.

In other words, there is no way to characterize a discrete sample without also discussing how that sample was created. But this is just what the first sentence attempts to do.

Gsmcghee (talk) 00:10, 17 December 2009 (UTC)Gsmcghee (talk) 00:14, 17 December 2009 (UTC)Gsmcghee (talk) 00:16, 17 December 2009 (UTC)

Every set is a subset of itself anyway. (talk) 11:25, 5 July 2011 (UTC)

Infinite / finite samples?[edit]

There is no mention that a sample must be finite. But I can't imagine what an infinite sample is. It would be nice if someone could write something about this, preferably with an example.

Velle (talk) 08:56, 9 December 2011 (UTC)

It's possible to define a method for selecting an infinite sample. For instance, suppose your population is {xi}, where i=1 to +inf. Then you could toss a coin: on heads you select all the even-indexed members, on tails you select all the odd-indexed members.
But I can't think of a situation where it'd be useful. The usual point of a statistical sample is as a pragmatic way to derive information about the population/superpopulation, which means you're doing some calculation based on the sampled units. That's not going to be possible if the sample is infinite. If you have enough information to determine the statistical properties of the infinite sample, and you have reason to believe the sample is representative of the whole, then you probably already know the statistical properties of the population.
There is certainly plenty of discussion of how various properties, estimators etc. behave as the sample size tends to infinity, but "tends to infinity" and "infinity" are not the same concept. --GenericBob (talk) 09:59, 9 December 2011 (UTC)