Great essay. I'd been wondering about doing this kind of big sample random articles test - but I'm also happy that someone else did the work and I can just enjoy the result :)
I agree with your carefully considered conclusions. Sure, we have a tremendous coverage of popular culture - but we've got everything else down pretty well too. The same can be seen from Wikipedia:WikiProject_Missing_encyclopedia_articles - we're well on our way towards covering every subject traditional encyclopedias have got.
The quality issue is, of course, a much more difficult question - but the coverage issue is hardly even debatable anymore.
Cheers! - Haukur Þorgeirsson 17:23, 11 November 2005 (UTC)
There was also a bot at one time that used a machine-learning algorithm to detect articles that were labelled stubs that were not actually likely to be stubs: I don't know what all it considered, but it included things like article length and content. Unfortunately I can't remember who ran it. It may not have even been a bot; it may have just been a program to run on a database dump and identify articles for a human to look at and fix. Anyway, I thought you might find it interesting. If you happen to know or figure out who did this, please let me know so I can remember. Jdavidb (talk • contribs) 17:27, 15 November 2005 (UTC)
Here's the table in the article, but with the half-widths of a 95% confidence interval. In English, that means that there's a 95% chance that the true number of a given type is within "+/- #" from "Estimated # in WP". For example, there's a 95% chance that the true number of full articles is less than 33600 from 342,768. Put another way, the correct number of full articles was between 309,200 and 376,400. Similarly, the percentage of full articles is somewhere between 40% and 48.8%. This method of analysis only works if the sample percentage is greater than about 1%, so I didn't do the calculations for two of the article types, since the numbers wouldn't be particularly valuable.
|Type of article||Estimated % of WP||Estimated # in WP||+/- %||+/- #|
|Articles from public sources||1.4%||10,808||1.03%||7950|
|Requiring substantial cleanup||2%||15,440||1.23%||9500|
|Should be redirected||0.6%||4,632||—||—|