Talk:Selection bias: Difference between revisions

Content deleted Content added

Inline

Revision as of 08:49, 9 September 2008

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
???	This article has not yet received a rating on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the importance scale.

Removing Selection Bias?

Might be handy if this page included some information about removing selection bias - but I don't have the knowledge. Cached 11:42, 5 December 2005 (UTC)[reply]

I just started a new section on this. Tobacman 20:08, 11 January 2006 (UTC)[reply]

"Selection bias" vs "sampling bias"

I believe that much of this article as written so far pertains to sampling bias, not selection bias, and hence the article clamors for substantial revision. Properly, selection bias refers to bias in the estimation of the causal effect of a treatment because of heterogeneous selection into the group that receives the treatment. Sampling bias refers to biases in the collection of data. (See also the article on bias (statistics).) Tobacman 20:08, 11 January 2006 (UTC) The stub on censored regression models also helps make this distinction and given that this is a one strategy used to overcome these kinds of biases should probably be linked into this article Albertivan 13:31, 30 March 2007 (UTC)[reply]

Self-selection Bias in Online Polling

I am interested in ttesting the validity of online polling, under which the assumption is made that the set of participation in a poll is defined by the persons who view the website for other reasons, and who participate in the poll based upon their discovery of its presence.

I am especially interested in determining the validity of such a sample when affected by a self-selection bias, particularly when a subset of participants with a predetermined answer to a poll, is selected by self-recruitment - that is, by other means than discovery only upon visitation of the website, such as mutual e-mail notification.

To what degree can such participants bias the results of the poll, in comparison to their relative participation?

Any thoughts on approach?

Nothing apart from appreciating

Heckman was is smart, yes. why doesnt he make his model clear for public consumption. Its too mathematical.

Stephen mpumwire, Fortfortal

From "invalid" to "wrong".

I am changing one word in the first paragraph again. Conclusions drawn from statistical analysis are inductive, not deductive. As such, inductive arguments exhibit a valence of strength. The language of logic dictates that deductive arguments are either valid or invalid, and either sound or unsound. Since only deductive arguments have validity, it does not make sense to refer to statistical inductive arguments as valid. It makes sense to refer to statistical arguments that suffer from a selection bias as weak. Weak arguments are inductive in nature and are not likely to preserve truth. Kanodin 09:35, 9 July 2007 (UTC)[reply]

I put invalid back. I'll explain later, when I have time.--Boffob 11:14, 9 July 2007 (UTC)[reply]

OK, I'm back. The issue here is not the inductive nature of statistical arguments, it's the validity of the underlying assumptions of the statistical analysis. Statistical arguments are inductive in nature but they should tend to the truth as the sample size increases (particularly, if one could have the entire population as a sample and were able to observe all quantities of interest, then the truth would be known with probability 1). Ignoring selection bias means relying on plainly wrong assumptions, namely that the data are a random sample from the target population (when they are not). The statistics and probabilities computed under such invalid assumptions will hence not merely be weak, they will not represent reality, in the sense that you will not obtain consistent estimators for the quantities wanted (those that relate to the target population). The big issue with selection bias, like with any sampling bias, is that merely increasing sample size without correcting the sampling technique will not alleviate inferential problems.--Boffob 14:45, 9 July 2007 (UTC)[reply]

I see your point that you want to show that statistics that fail to account for selection bias are likely to be wrong, but it is not enough to call an argument invalid simply because it relies on plainly wrong assumptions. My central point is that using 'validity' to describe an inductive argument is wrong. "Weak" captures the spirit of what we are saying. It is only appropriate to use validity to describe deductive arguments that meet the criteria of validity. If we were to use validity to describe an argument, we would be saying that it is impossible for the premises to be true and the conclusion false. Statistical arguments, by definition, make no such assertion--often times statisticians provide a 'p' value, which describes the likelihood that the results are a statistical artifact. In fact, all statistical arguments that sample less than the entire population are invalid, not merely the defective ones. Most statistical arguments are inductive. The inductive analogue for validity is strength. A strong argument is such that if its premises are true, then the conclusion is probably true. Weak arguments are the negation: If an argument is weak, it is not the case that if the premises are true, then the conclusion is probably true. In the verbiage of logic, I see no problem with describing defective inductive arguments as weak, but I can understand why you would want more strength behind the evaluation in the article. Perhaps it would be acceptable to word it like this: "Statistical arguments that do not take selection bias into account are inductively weak, and their conclusions are likely to be wrong because the tainted sample does not represent the population." I think it would be fine to use stronger language, but using 'invalid' in a place where it does not belong hurts the credibility of the article. Kanodin 19:18, 11 July 2007 (UTC)[reply]

It seems our quarrel is about whose technical lingo we should use. I put "invalid" instead of "weak", under an layman definition (after all, some of the premises are known to be false in the presence of selection bias), because it is concise and conveys the importance of the possible distortions caused by selection bias better than "may be weak". Your argument is that, strictly using the logic definition of "validity", this is not appropriate because statistical arguments boil down to the expression of probability of a conclusion given a set of premises in one form or another (you give the example of p-values which are the probability of observing the data, or more extreme data, given that the null hypothesis is true). I have to say that reading the article in terms of logic verbiage appears to be déformation professionnelle (my own déformation would read "may be weak" as "may converge weakly"). While I appreciate the compromise you offer, I don't see why one should burden the article through strict adherence to the vocabulary of formal logic.--Boffob 23:13, 11 July 2007 (UTC)[reply]

There is a good possibility that someone may read this article and assume that invalid is used from a logicians standpoint. While I am averse to using invalid in the way the article prints it, it is also not essential that the article uses the word 'weak'. So, maybe another compromise would be to use neither 'weak' nor 'invalid'. The sentence could then be simplified as: "Statistical arguments that do not take selection bias into account often produce wrong conclusions." Kanodin 23:37, 12 July 2007 (UTC)[reply]

Now we are getting somewhere. How about just replacing "invalid" with "wrong"? Here, the expression "statistical arguments" is again, sort of a logician's vocabulary. More often, in natural and social sciences, we're talking about the statistical analysis of the results of an experiment or data collected from a study, the term "argument" is not common use as far as I know. And then the qualifier "often" may not be entirely correct either, as the amount of bias induced may be relatively small in many cases (possibly more often than not, who knows), but we still want to express the fact that selection bias could completely skew the conclusions to the opposite of what they should be (which "may be weak" does not convey, unless one is a logician).--Boffob 01:18, 13 July 2007 (UTC)[reply]

ETA: I made the change. I think it reads OK, though that doesn't leave out the possibility of a more elegant rephrase.--Boffob 01:23, 13 July 2007 (UTC)[reply]

Looks good. Kanodin 08:38, 13 July 2007 (UTC)[reply]

@@ Line 24: / Line 24: @@
 Stephen mpumwire, Fortfortal
-:I agree that the difference between sample and selection bias is not clearly stated. A trusted reference when describing this difference is greatly appreciated. -- steffen (09/09/08)
 == From "invalid" to "wrong". ==