# Talk:Sample size determination

WikiProject Statistics (Rated C-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated C-class, High-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 High Importance
Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

## Rule of Thumb in an article about statistics!!

Under estimating proportions a paragraph begins "The rule of thumb for (a maximum or 'conservative')"

I think Deming is rolling over in his grave!70.22.59.134 (talk) 20:33, 8 April 2008 (UTC)

That section definitely needs more attention.. --Riddell  21:40, 4 December 2009 (UTC)

## Maximum Error

I see someone has changed the inequalities around again - the reason they were set the way that they were is that if we want a certain maximum error epsilon, then we require the half-width of the CI to be at most epsilon i.e. B<= epsilon. Therefore we will obtain a minimum, (rather than a maximum) sample size required (which the inequalities would suggest in their current state). I can see no interpretation which would lead us to set these inequalities the other way around, particularly that would give us a maximum value of n, rather than a minimum! HyDeckar 00:48, 22 March 2007 (UTC)

Look, I'm really quite confident about this - if anyone feels like responding, I'm more than willing to figure out what is right, but I'll change it for now (but come back at me here rather than start a possible 'edit war') HyDeckar 14:22, 23 March 2007 (UTC)

This is, I'm afraid, a fundamental error of interpretation on your part. You want to be able to say that the sampling error from your procedure is no larger than B, i.e. < B. The sample size is still the minimum required to assure that. As n increases, B and, consequently error, decrease. -MBHiii 12:41, 30 March 2007 (UTC)

I still disagree - I am claiming that the sampling error is no larger than epsilon (as *epsilon* -not B- is the required maximum error, B is simply a working variable which is the half width of the CI). Therefore, we derive an inequality which must be satisfied by n for this to occur. It is clear that this inequality _must_ be of the form n \geq some value, as small n leads to larger error. HyDeckar 12:32, 6 April 2007 (UTC)
Now I see what you're doing. The confusion for me was in using epsilon for a fixed quantity. I contend epsilon's usually used for a random variable denoting error. Calling B a variable is also non-standard (B is for "Bound"), and it's a simple function of parameters, not variables. By calling B a variable, you require the concoction of a new fixed boundary, your epsilon. The fewer the steps, the better. I still contend you should use epsilon for sampling error, a variable, and B for the bound on that error. I see what you want: a small error pushes up the required n. My formulation turns the emphasis around: allowing large error pushes down the required n. Both are correct, but I contend mine is more standard and focuses on limiting the random error (epsilon) using a value of B that's determined before starting.MBHiii 20:59, 6 April 2007 (UTC)
Ok, that seems fair, but I contend that it is hardly clear to someone unfamiliar to the subject what exactly is going on. Therefore, I've changed these inequalities to approximations, as that at least is unambiguous in meaning HyDeckar 12:44, 7 April 2007 (UTC)

## Question

I am unsure about the statement

Note, if the mean is to be estimated using P parameters that must first be estimated themselves from the same sample, then sample size should be n+P.

If we mean to increase the sample size by P so as to maintain enough degrees of freedom, then we would arguably need to consider the use of degrees of freedom throughout, i.e. t distn's etc... Also, such a situation would typically involve covariates, which may allow us greater accuracy than the CI given here. Given all of this, I'm just not sure that this comment is a useful one to have here.

An article on sample size is a good place to introduce the concept of degrees of freedom. Suggest:

Note, if the mean is to be estimated using P parameters that must first be estimated themselves from the same sample, then to preserve sufficient "degrees of freedom" sample size should be at least n+P.

Looks good HyDeckar 00:48, 22 March 2007 (UTC)

There seems to be some information missing in the Required Sample Sizes for Hypothesis testing section - the formulae seem to be missing, unless I'm reading it incorrectly or it isn't displayed on my screen. The proof doesn't seem to be appearing! Vickie —Preceding unsigned comment added by 163.156.240.17 (talk) 10:11, 13 February 2008 (UTC)

## Old Stuff

Can someone please stop User:Lgallindo from vandalizing this page! I see he has problems with someone else vandalizing a page on sampling, but this has nothing to do with that. -- Mbhiii 18:23, 23 October 2006 (UTC)

## New Version

New version now online HyDeckar 16:36, 20 March 2007 (UTC)

I have written a tentative new version of this page, it is available for comment on my user page. Unless I get a huge negative response, I'll load it up in a couple of days. HyDeckar 15:10, 19 March 2007 (UTC)

It seems like a big improvement to me. I say go for it. -- Avenue 03:06, 20 March 2007 (UTC)

What you wrote is good with improved notation and generally, BUT you DELETED the useful "rule of thumb" and its derivation - a bad move on your part. I'm restoring it. --63.98.135.196 19:47, 20 March 2007 (UTC)

Sorry about that, accidental 'friendly fire' - I've rehashed the "rule of thumb" (not under that name) to line up with the rest of the article stylewise. HyDeckar 08:28, 21 March 2007 (UTC)

## Clarification

Please do not remove the confusing tag until the article is much clearer. Piuro 19:12, 24 October 2006 (UTC)

Hi, I hope this works. The main ideas I'm trying to answer are (1)just what does "sample size" mean, (2)what are its effects, and (3)how can you estimate it? I think the first paragraph answers (1), the first and second answer (2), and the third and fourth answer (3). --Mbhiii 16:23, 25 October 2006 (UTC)

Hello Khatru2, thanks for bolding "Sample size", but you should know I wrote every word of Sample size and take responsibility for it. That's what my (or anyone else's) signature means, so please leave it. It provides a quick link to my contact information for further, detailed or ancillary discussion. --Mbhiii 12:40, 26 October 2006 (UTC)

Whoops, as per Wikipedia:Ownership of articles, no signature. --Mbhiii 17:16, 26 October 2006 (UTC)

## Notation

There is a misconception that the sample size is denoted N, but all serious sampling texts (e.g. Cochran, or Sarndal et al) use n for the sample size and N for the population size. I accordingly changed N to n throughout the article. However 68.221.1.30 (talk) changed it back to N. I see this as a very retrograde step, which is likely to confuse our readers. It was only slightly mitigated by the addition of an end note that N usually denotes the population size. Another note was added to say that N is being used for legibility.

Why should we use the wrong notation? Legibility is not a good reason; lower case italics are the symbols used most often in mathematics, so anyone who will understand the formulae should be very used to reading this sort of text. A note saying that we are doing this intentionally does not solve the problem; it just makes us look foolish rather than ignorant. -- Avenue 01:06, 4 February 2007 (UTC)

Fixed. Hopefully, it'll stay that way. — Xaonon (Talk) 18:33, 19 February 2007 (UTC)
I'll leave it. My eyes are old, so legibility is a top priority. It'd be nice if someone made the small n larger. --mbhiii 15:16, 14 March 2007 (UTC)
Did you know that you can increase the size of displayed text in most browsers by pressing the Control and "+" keys simultaneously? (See the links at Computer_accessibility#Web_browser_accessibility_features for a lot more information.) While I agree it's important to keep accessibility in mind, I don't think changing the size of individual elements such as the small n is a good idea. -- Avenue 21:58, 14 March 2007 (UTC)

## Commercial Software

I have deleted the links to commercial software. Wikipedia is not a place to promote commercial items. Rlsheehan (talk) 13:36, 21 July 2009 (UTC)

## example needed

an example in the section titled "Required sample sizes for hypothesis tests" would be nice for dullards like me. —Preceding unsigned comment added by Bjarthur (talkcontribs) 18:03, 21 October 2009 (UTC)

Definitely agree a prominent example would be good. What about something easy to grasp - say, measuring average population height by sampling, for 10 people, 100 people, 1000 people. 203.217.150.69 (talk) 05:52, 13 May 2010 (UTC)
The current text seems to start off with the assumption that p = 0.5. Shouldn't the article note that if it can be anticipated a priori (from experience) that the proportion is certainly less than some number (such as 0.01, for defective components) then this would affect the minimum sample size computation (prediction).
By the way, is the cited NIST reference correct in adding the z_beta term to the z_alpha/2 term???
—DIV (138.194.12.32 (talk) 02:26, 2 July 2010 (UTC))

## bit of a coatrack, no?

This article, which should cover the fairly simple notion of a sample size, seems to be bridging out into statistical sampling more generally. do we really need a separate article for this? I'd suggest we merge and redirect this into statistics or statistical sampling. --Ludwigs2 21:25, 9 November 2010 (UTC)

## Clarity

What does the following sentence, currently found in the 5th paragraph of the article, mean?

Typically B is generated in such a way that the range of values of that are within a distance B of the estimated parameter value will be a 95% confidence interval, at least in an approximate sense.

Can the intended meaning be better expressed? Kind regards, —Encephalon 04:17, 28 November 2010 (UTC)

## Ambiguous Audience

This is a concern for all WP articles of a technical nature. A good rule of thumb should be to start out with the general, easily used, and comprehensible. Follow up with the more detailed, exacting, and technical. -Trift (talk) 18:32, 17 May 2011 (UTC)

## Stratified sample size

The formulae here bear liitle resemblance to anything in the cited pages by Kish. Some are clearly wrong. Can anyone clear this up. Melcombe (talk) 15:11, 7 July 2011 (UTC)

## Estimation of means

the sentence "For example, if we are interested in estimating the amount by which a drug lowers a subject's blood pressure with a confidence interval that is six units wide, and we know that the standard deviation of blood pressure in the population is 15, then the required sample size is 100."

doesn't make sense! standard deviation of blood pressure is 15??? 15 what? apples? oranges? torr? mbar? hektopascal? a value without a unit is useless! — Preceding unsigned comment added by Fspiegel (talkcontribs) 12:45, 22 December 2011 (UTC)

## do I have any idea what I'm talking about?

I'm hardly an expert, but the striking thing I remember from statistics is how small samples have to be to achieve significance. I came here to read up on it and saw the opposite.

In some situations, the increase in accuracy for larger sample sizes is minimal, or even non-existent. This can result from the presence of systematic errors or strong dependence in the data, or if the data follow a heavy-tailed distribution.

in many situations, the increase in accuracy for a larger sample size is minimal, close to non-existent, because there is strong independence in the data, and the base sample size from which you are measuring your increase is already large enough for a process with a normal distribution given the number of degrees of freedom. Or am I remembering this wrong? 68.174.97.122 (talk) 01:14, 1 December 2012 (UTC)