User:R. fiend/How many articles does Wikipedia really have?
While the official count is somewhere around 750,000, having looked through quite a few of these articles, I've noticed that there are a number of these that would be hard to call "articles". Some are a sentence long, some are disambiguation pages, some are more of a chart or a list than an article (not to say they aren't necessarily useful, but the term "article" might be a bit of a stretch). So I decided to explore what these 750,000 articles contain. To aid me in this I have used the random article option. Now, it's been a while since I've taken a class in probability and the like, but it is my understanding that, if truly random, a pretty small sample can yield pretty accurate results. This is assuming the random article button is truly random (and I think it is). I have cateogrized the results into several categories:
- Full articles: though I cannot judge their overall quality very well, these are your basic decent or very good articles in wikipedia, with a few exceptions that are covered elsewhere
- Stubs: sort of tricky, as I disregarded if they were labelled as such, but I basically went with anything that was a brief or medium sized paragraph. But any of 2 sentences or less are:
- Sub-stubs: Very brief articles (they really cannot be considered articles at all)
- Disambiguation pages: we all know what they are
- Articles completely from other sources: these are basically the same as "full articles", but I thought they deserved special categorization, as they are not the creation of wikipedia or wikipedians. In it's own category are
- Rambot articles: We all know these; quite a vexed part of wikipedia. Those that have substantial human written sections (not sure what qualifies as "substantial" yet) will be included in full articles.
- Articles that are mostly a chart: Articles that have little text that is not a template or chart. This includes album stubs that are a sentence (artist and year) plus a template and a tracklist, as well as often a member lineup. Those which have at least a few sentences beyond this can be considered full articles. A similar but separate group is:
- Lists: We all know these; another highly debated facet.
- Articles needing cleanup: for this I included pretty substantial cleanup. It's a bit of a judgment call, I know, but these are ones than can probably be generally called "pretty bad" and hence do not take their place (yet) with the full articles, but are not
- Deletables: either speedy or AfD. Again a bit of a judgment call here, particularly since I have a slightly higher bar than many. But this project is partially for the purpose of people who are skeptical of wikipedia, those who are used to and prefer more standard encyclopedias, and their standards may well be considerably above mine.
- Dubious articles: Ones that may be disguised adverts, dubious, potentially unverifiable claims, hoaxes, copyvios, and the like, but ones I'm not going to AfD.
- Redirects, these are articles that should have been redirects to more complete articles, and I have since redirected them. This sort of like a delete, as it removes one article from the total number. Redirects, I believe, are neither counted in the article count nor the random article feature, so are disregarded when they exist already.
Note this is more about types of articles than article quality. In my notes I originally had "full artciles" down as "good articles", but decided to change it. Since I won't be reading anything but the stubs and substubs in their entirety, I cannot judge the accuracy or quality of them (even if I read them completely, I still couldn't do so without substantial research, which would obviously slow this project down immensely). There are also articles, while full and complete, I wouldn't necessarily call "good". I didn't want to have to make a separate category for fancruft or anything, as that would obviously be severely subjective. At some later point, I think I may undertake another such project in which I categorize random full articles into subject, paying particular attention to the amount of fiction. One criticism (not entirely unfounded) that Wikipedia often garners is it's level of detail in TV shows, sci-fi, anime, etc., potentially at the expense of other subjects (though whether this is really at any expense is clearly debatable).
The number of random articles I explored was 500. This should give me a pretty accurate reading. I'll have to look into calculating the margin of error (if any math folks who want to help I'd be grateful).
If anyone knows of a similar project by another wikipedian, I'd be very curious to see it. Any feedback on this I'd love to hear. Leave it on my talk page.
Following the collection of data resulting from 500 random article searches, I have come up with the following statistics.
|Type of article||Estimated % of WP||Estimated # in WP|
|Articles from public sources||1.4%||10,808|
|Requiring substantial cleanup||2%||15,440|
|Should be redirected||0.6%||4,632|
The estimated number of each group in Wikipedia is based on an approximate number of 772,000 as being the number of stated articles in Wikipedia on Oct 14, 2005 (rounded up slightly).
So how many articles does Wikipedia have?
Well, that depends on what one wants to call an "article." Obviously the "Full article" category are not by any means the only articles. Even though I was quite generous in listing articles labeled "stub" as full articles, I think even those particularly short ones can still meet the technical definition of an article. Likewise Rambot and other articles from external sources are real articles, though I did feel they deserved separate mention.
Substubs and disambiguation pages I really cannot call articles; they lack too much real information to qualify. Charts and lists are questionable. I feel a certain amount of prose is required to technically be an article. This is not to say they are not useful (nor does it say all of them are). The deletables, dubious articles, and redirected ones should probably not be counted either. The cleanups, I suppose, could go either way.
So, using the classifications above (1st paragraph are articles, others are not or questionably so), it seems Wikipedia has an estimated 545,032 articles. A pretty large amount, really. Though again, beyond the terrible deletables and the clear cleanups, this says nothing about quality. Of course, there is the consistent accusation that half of Wikipedia's articles are on Star Trek. This is not true, obviously, but we certainly do have an extreme number of articles on very minor elements of fiction ("fancruft" some call it). There are various schools of thought on this. Some say they damage the reputation of Wikipedia, making it more of a specialty than general purpose encyclopedia that is difficult to take seriously in a more academic way, some say that extreme coverage in these areas is not at the expense of more traditionally "encyclopedic" topics. It's hard to say. If we look at Britannica, according to our article, they have 120,000 articles. If we are competing with them, it is not in the area of Star Trek and the like; we have them beat there. No, the competition is on their turf. By the calculations above, we have about 4.5 times the articles they do. Even if half of these can be dismissed as "fancruft" (and though I didn't keep track, I'm sure it was not nearly that many in my sample) we still have twice as many articles as they do on a very wide range of pretty unquestionably encyclopedic topics. The issue of quality, however, is not one that can easily be answered.
How accurate is this?
Overall, it appears to be pretty accurate. Pollsters have been using pretty small random samplings to predict elections (reflecting millions, not thousands), and have shown a pretty high degree of accuracy. I think the only specific number I have for comparison is the Kate's tool count for Rambot. It says Rambot has edited 53,826 articles. My calculations have 54,040. Pretty close. Pretty damn close, now that I really look at it. Off by merely 0.4%, I believe.
Further Statistical analysis
Here's the table in the article, but with the half-widths of a 95% confidence interval. In English, that means that there's a 95% chance that the true number of a given type is within "+/- #" from "Estimated # in WP". For example, there's a 95% chance that the true number of full articles is less than 33600 from 342,768. Put another way, the correct number of full articles was between 309,200 and 376,400. Similarly, the percentage of full articles is somewhere between 40% and 48.8%. This method of analysis only works if the sample percentage is greater than about 1%, so I didn't do the calculations for two of the article types, since the numbers wouldn't be particularly valuable.
|Type of article||Estimated % of WP||Estimated # in WP||+/- %||+/- #|
|Articles from public sources||1.4%||10,808||1.03%||7950|
|Requiring substantial cleanup||2%||15,440||1.23%||9500|
|Should be redirected||0.6%||4,632||—||—|
I have included comparisons to shorter studies of 100 articles by two other users. User:Pureblade did one based on my same orgainzation, while User:Carnildo did one for a different reason and with different categorization, see User:Carnildo/The 100.
|Type of article||R. fiend's % (based on 500)||Pureblade's % (based on 100)||Carnildo's % (based on 100)|
|Articles from public sources||1.4%||3%||2%|
|Requiring substantial cleanup||2%||8%||1%|
|Should be redirected||0.6%||0%||2%|
*Carnildo has stated that he counted charts as substubs. combining those two in my sample yields 17.4%, reasonably close to Carnildo's 16%.
My results, compared with Pureblade's, are quite similar, particularly in the first 2 categories. The biggest discrepancy comes from the cleanups, where subjectivity is somewhat high. Carnildo's results differ from mine greatly in some categories, and (from a quick look at a few of the articles in his study) I think much of it comes down to how one defines a stub. Like I stated, I didn't take a stub label (or lack thereof) into account. Much of what he calls a stub I call a short article (still in the "Full articles" section). This does not mean there isn't plenty of room for expansion. I also had to go through his comments and the like to make adjustments in categories (some he had as substubs became deleteables, as they have been deleted, and so on).
If anyone else wishes to do a similar test to compare, please go ahead and leave the results on this talk page. I urge you to use categories and criteria similar to mine for easiest comparison (stubs are a paragraph or less, substubs are 2 or fewer sentences, separate rambot articles and EB 1911s, anything AFDed or otherwise deleted goes in the deletables category, etc).