User:Opabinia regalis/Article statistics
This user subpage is currently inactive and is retained for historical reference. If you want to revive discussion regarding the subject, you might try contacting the user in question or seeking broader input via a forum such as the village pump. |
Note
[edit]The datasets below are old (2006-7), tiny, and not useful except as a historical reference.
Random article survey
[edit]I was bored waiting for my very slow program to run, so I clicked "random article" 250 times and kept track of what kinds of articles popped up. 48 articles (19.2%) were stubs or had at least one cleanup tag. (I tried to count "citation needed" as a cleanup tag but may have missed a few.) The results as of 11 Nov 2006:
Type of article | Number | Percent of sample |
---|---|---|
Biography | 60 | 24% |
Places/geographical locations | 34 | 13.6% |
TV shows/movies | 17 | 6.8% |
Disambiguation | 15 | 6% |
Music/bands/albums | 14 | 5.6% |
Company/product/service | 13 | 5.2% |
History/war | 12 | 4.8% |
Politics/government | 9 | 3.6% |
Sports | 8 | 3.2% |
Organisms | 8 | 3.2% |
Definitions/common phrases/common objects | 7 | 2.8% |
Architecture/buildings | 7 | 2.8% |
Mythology/religion | 5 | 2% |
Astronomy/physics/space science | 5 | 2% |
Software/computing | 5 | 2% |
Games (including video) | 4 | 1.6% |
Literature/publications | 4 | 1.6% |
Biology/medicine | 3 | 1.2% |
Food/drink | 3 | 1.2% |
Schools | 3 | 1.2% |
Math | 2 | 0.8% |
Nonsense/unclassifiable | 2 | 0.8% |
Visual arts | 2 | 0.8% |
Philosophy/ethics | 2 | 0.8% |
Linguistics/languages | 2 | 0.8% |
Charities/nonprofit organizations | 2 | 0.8% |
Economics/finance | 1 | 0.4% |
Deleted and protected | 1 | 0.4% |
"Biography" is probably a bit overinflated because I classified everything about an individual real person as a biography, including historical figures. Articles about fictional characters went in the category of the corresponding fiction (TV, myth, etc.)
Obviously this is a lousy way to determine Wikipedia coverage - 250 articles is a tiny sample. But the advantage over, say, counting category populations is that this avoids duplicate-counting of articles in multiple categories and can find articles that are un- or miscategorized. Special:Random also (as far as I know) excludes recently created articles that haven't yet been indexed, which filters out lots of nonsense speedy candidates. I don't think Special:Random would exclude deletion candidates, but none of these had prod or AfD templates.
First-glance observations:
- I didn't find a single chemistry article. Biology and medicine had one clinical feature, one cell biology article, and one disease, so not even any biochemistry showed up. Physics as such was also missing; the articles in that category were almost entirely about NASA missions and observations.
- Similarly, nothing I'd classify as sociology or psychology.
- The literature and publications category contains a comic book, a newspaper, and two contemporary novels. No classic/canon literature.
- I admit I'm a bit surprised at the low volume of school articles, which judging from AfD are infesting the place like weeds.
- Somehow I don't think that 14% of the sum of all human knowledge is TV, movies, games, and bands. I admit I was surprised at the low percentage of video game cruft. The music articles were almost exclusively contemporary popular bands or their albums (with reasonably diverse geographical coverage) - nothing about musical theory and nothing about classical music.
Recent mainspace changes survey
[edit]Inspired by Wikipedia:Wikipedia is failing and User:Worldtraveller/Wikipedia is failing (NB: leaving the redlink, in case further moves occur), I looked at a sample of 250 mainspace edits covering a time span of 04:43 to 04:46 UTC on 18 Feb 2007. (It would be interesting to gather these statistics again at a time when US schools are in session.) In this sample there were 159 edits by registered users, 89 edits by anonymous users, and 2 edits to a subsequently deleted image description page. Thus the percentages below take 248 edits as the total sample.
Change type | Percent of total sample (n = 248) | Percent by registered editors (n = 248) | Percent by anonymous editors (n = 248) | Percent of all registered edits (n = 159) | Percent of all anonymous edits (n = 89) |
---|---|---|---|---|---|
Substantial content changes | 5.2% | 4.0% | 1.2% | 6.3% | 3.4% |
Minor content changes | 28.6% | 17.3% | 11.3% | 27.0% | 31.5% |
Copyediting/formatting/wikilinking | 40.7% | 27.4% | 13.3% | 42.8% | 37.1% |
Tagging/maintenance | 8.5% | 6.5% | 2.0% | 10.1% | 5.6% |
Vandalism reversion | 8.9% | 7.3% | 1.6% | 11.3% | 4.5% |
Vandalism | 8.1% | 1.6% | 6.5% | 2.5% | 18.0% |
Other than determining whether an edit was vandalism, I did not make any value judgments. Thus, 'minor content changes' contains considerable amounts of unsourced material and original research that will certainly be reverted.
Other observations:
- I saw two ongoing edit wars and one addition of an inappropriate unfree image.
- Of the ten examples of substantial content changes by registered users, five were new-page creations. The single largest content change was on a Digimon article.
- One of the four examples of vandalism by registered users involved the creation of a nonsense page.
- I excluded bot-flagged edits (the default). The registered-editor set contains two edits by a known-bot account without a bot flag.
- The percentage of copyediting and formatting done by registered editors is probably inflated by AWB users.
General thoughts:
- I suppose it's a good sign that the rate of vandalism and the rate of vandalism reversion are about the same. However, that could be a function of the time of day.
- Substantial content addition occurs at a quite low rate. It's possible that this is due to editing patterns: if an editor uses many 'progressive saves', no one change will appear on this sort of survey as substantial, and if an editor uses a single save for a large change, that editor's edit rate will be low and his change will be unlikely to appear in such a small sample. I didn't see much evidence of the first pattern, in that no series of edits to the same article by the same person occurred except to manipulate formatting; however, a series of content-creating edits will likely be separated by more than three minutes.