While the official count is somewhere around 750,000, having looked through quite a few of these articles, I've noticed that there are a number of these that would be hard to call "articles". Some are a sentence long, some are disambiguation pages, some are more of a chart or a list than an article (not to say they aren't necessarily useful, but the term "article" might be a bit of a stretch). So I decided to explore what these 750,000 articles contain. To aid me in this I have used the random article option. Now, it's been a while since I've taken a class in probability and the like, but it is my understanding that, if truly random, a pretty small sample can yield pretty accurate results. This is assuming the random article button is truly random (and I think it is). I have cateogrized the results into several categories:
- Full articles: Although I cannot judge their overall quality very well, these are your basic decent or very good articles in Wikipedia, with a few exceptions that are covered elswhere
- Stubs: Sort of tricky, as I disregarded if they were labelled as such, but I basically went with anything that was a brief or medium-sized paragraph. But any of 2 sentences or less are:
- Sub-stubs: Very brief articles (they really cannot be considered articles at all)
- Disambiguation pages: We all know what these are
- Articles completely from other sources: These are basically the same as "full articles", but I thought they deserved special categorization, as they are not the creation of Wikipedia or Wikipedians. In its own category are
- Rambot articles: We all know these; quite a vexed part of Wikipedia. Those that have substantial human written sections (not sure what qualifies as "substantial" yet) will be included in full articles.
- Articles that are mostly a chart: Articles that have litle text that is not a template or chart. This includes album stubs that are a sentence (artist and year) plus a template and a tracklist, as well as often a member lineup. Those with at least a few sentences beyond this can be considered full articles. A similar but separate group is:
- Lists: We all know these; another highly debated facet.
- Articles needing cleanup: For this, I included pretty substantial cleanup. It's a bit of a judgment call, I know, but these are ones than can probably be generally called "pretty bad" and hence do not take their place (yet) with the full articles, but are not
- Deletables: Candidates for either speedy or AfD. Again a bit of a judgment call here, particularly since I have a slightly higher bar than many. But this project is partially for the purpose of people who are skeptical of Wikipedia, those who are used to and prefer more standard encyclopedias, and their standards may well be considerably above mine.
- Dubious articles: Ones that may be disguised adverts, dubious, potentially unverifable claims, hoaxes, copyvios, and the like, but ones I'm not going to AfD.
- Redirects: Articles that should have been redirects to more complete articles, and I have since redirected them. This sort of like a delete, as it removes one article from the total number. Redirects, I believe, are neither counted in the article count nor the random article feature, so are disregarded when they exist already.
Note this is more about types of articles than article quality. In my notes I originally had "full artciles" down as "good articles", but decided to change it. Since I won't be reading anything but the stubs and substubs in their entirety, I cannot judge the accuracy or quality of them (even if I read them completely, I still couldn't do so without substantial research, which would obviously slow this project down immensely). There are also articles, while full and complete, I wouldn't necessarily call "good". I didn't want to have to make a separate category for fancruft or anything, as that would obviously be severely subjective. At some later point, I think I may undertake another such project in which I categorize random full articles into subject, paying particular attention to the amount of fiction. One criticism (not entirely unfounded) that Wikipedia often garners is its level of detail in TV shows, sci-fi, anime, etc., potentially at the expense of other subjects (though whether this is really at any expense is clearly debatable).
The number of random articles I explored was 500. This should give me a pretty accurate reading. I'll have to look into calculating the margin of error (if any math folks who want to help I'd be grateful).
If anyone knows of a similar project by another Wikipedian, I'd be very curious to see it. Any feedback on this I'd love to hear. Leave it on my talk page.
Following the collection of data resulting from 500 random article searches, I have come up with the following statistics.
|Type of article||Estimated % of WP||Estimated # in WP|
|Articles from public sources||1.4%||10,808|
|Requiring substantial cleanup||2%||15,440|
|Should be redirected||0.6%||4,632|
The estimated number of each group in Wikipedia is based on an approximate number of 772,000 as being the number of stated articles in Wikipedia on October 14, 2005 (rounded up slightly).
So how many articles does Wikipedia have?
Well, that depends on what one wants to call an "article." Obviously, those in the "Full article" category are not by any means the only articles. Even though I was quite generous in listing articles labelled "stub" as full articles, I think even those particularly short ones can still meet the technical definition of an article. Likewise, Rambot and other articles from external sources are real articles, though I did feel they deserved separate mention.
Substubs and disambiguation pages I really cannot call articles; they lack too much real information to qualify. Charts and lists are questionable. I feel a certain amount of prose is required to technically be an article. This is not to say they are not useful (nor does it say all of them are). The deletables, dubious articles, and redirected ones should probably not be counted either. The cleanups, I suppose, could go either way.
So, using the classifications above (1st paragraph are articles, others are not or questionably so), it seems Wikipedia has an estimated 545,032 articles. A pretty large amount, really. Though again, beyond the terrible deletables and the clear cleanups, this says nothing about quality. Of course, there is the consistent accusation that half of Wikipedia's articles are on Star Trek. This is not true, obviously, but we certainly do have an extreme number of articles on very minor elements of fiction ("fancruft" some call it). There are various schools of thought on this. Some say they damage the reputation of Wikipedia, making it more of a specialty than general purpose encyclopedia that is difficult to take seriously in a more academic way, some say that extreme coverage in these areas is not at the expense of more traditionally "encyclopedic" topics. It's hard to say. If we look at Britannica, according to our article, they have 120,000 articles. If we are competing with them, it is not in the area of Star Trek and the like; we have them beat there. No, the competition is on their turf. By the calculations above, we have about 4.5 times the articles they do. Even if half of these can be dismissed as "fancruft" (and though I didn't keep track, I'm sure it was not nearly that many in my sample) we still have twice as many articles as they do on a very wide range of pretty unquestionably encyclopedic topics. The issue of quality, however, is not one that can easily be answered.
How accurate is this?
I don't know for sure. I haven't calculated the margin of error (I'm not entirely sure how, actually). I will say that pollsters have been using pretty small random samplings to predict elections (reflecting millions, not thousands), and have shown a pretty high degree of accuracy. I think the only specific number I have for comparison is the Kate's tool count for Rambot. It says Rambot has edited 53,826 articles. My calculations have 54,040. Pretty close. Pretty damn close, now that I really look at it. Off by merely 0.4%, I believe.