Wikipedia talk:Counter-Vandalism Unit/Vandalism studies/Study1

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Counter-Vandalism Unit
WikiProject icon This project page is within the scope of the Counter-Vandalism Unit, a WikiProject dedicated to combating vandalism on Wikipedia. You can help the CVU by watching the recent changes and undoing unconstructive edits. For more information go to the CVU's home page or see cleaning up vandalism.
Taskforce icon
This project page is related to the Vandalism Studies project.

Finishing the project[edit]

I'm am thrilled that we have finished the last data point! I say we all go back and double-check the entries to make sure they are correct and then publish the results out to the whole community. YEAH!!!! Remember 06:04, 25 February 2007 (UTC)

I'm stoked too, good job to all. I'll have a little extra time after work tomorrow and i'll try to pretty the project pages up a little bit in case people feel warm enough to stay if they like the study. it is important that this all looks professional and orderly. once we feel ready to post the results (we'll have to field some questions i gather), i suggest we put them in these places:
  • counter-vandalism projects talk pages
  • statistic projects talk pages
  • community notice board
  • village pump (w/diff showing the first post suggesting this project's creation of Rem's)
  • Admin notice board (i think)
  • signpost (done last, after others have had some time to gauge reaction)
  • antispam project? (WT:WPSPAM)
  • people in the project's participant list for vandal studies
  • wikiEN-l listserv and wiki-research-l listserv.
i'd also like to get a proposed next study worked out in the wording of our results post, like - 'our initial study showed that X and Y from our sample. further work is needed to add more power to these results, and study 2 where we will Z is looking for more participants should you be interested.' in this way we can inform and turn interest into more user power. what do you say for study 2, more data points with similar criteria for vandalism definitions? JoeSmack Talk 05:15, 27 February 2007 (UTC)
I totally agree with Joe. We should check our results, make them look nice and presentable and draw up conclusions, and then present them to the community along with a note that we will be gathering volunteers for our next study. As for study number 2, I would be fine doing the same study in a different month with more data points or we could try the study that Jack suggested. We should probably make an area on the talk pages of the vandalism studies project to discuss this more formally. Remember 11:39, 27 February 2007 (UTC)
I added some general clean up to the study page. One thing i'm concerned about is using Linkspam as a vandalism category - it was added later after encountered part way through the study, so all edits should be double checked to be sure none more were paired to this category. Also, i stuff like this: [1]. We need to check our math, keep the number of significant figures uniform (2, 3, how many?) and remove all ???? from the maths part of this study. JoeSmack Talk 13:57, 28 February 2007 (UTC)
Hey guys, sorry for the late call.
Anyway, when did the linkspam category got added? Before which datapoint?
I'll be going through the math tonight.
One thing I think could come in handy, is list the types of articles and the sort of vandalized it gets. For instance, I came upon a video game article, that was vandalized around the time of the release. JackSparrow Ninja 16:32, 28 February 2007 (UTC)
It was added by Remember about halfway through here at data point 21. Up to that point it was only me, and I didn't encounter anything like that. I think we'd need more data before inferring anything about the types of articles that get vandalized. What I'd also like to do is make sure those definitions for the categories of vandalism were what you (both you and Remember) used/fits how you were marking vandalized edits. That part is very important too. JoeSmack Talk 17:02, 28 February 2007 (UTC)
I'm sorry I haven't done much about it lately. I haven't really had the time to spend longer times at one thing here on Wiki. I should be able to again later this week. Could someone perhaps update me? What's next to do? JackSparrow Ninja 19:14, 6 March 2007 (UTC)
Just start double-checking all of our data. I have double checked the first 20. Please fell free to check all of them again because I want to make sure we are right before we finalize our conclusions and send them out to everyone. Remember 19:48, 6 March 2007 (UTC)
Okido. You'll hear from me when I'm done. I'm just gonna do it one big batch. JackSparrow Ninja 18:26, 7 March 2007 (UTC)
Just a little update. It's taking me a little longer then expected, but I've worked through to Data Point 70 now. Seems good so far, with the updates from you guys. JackSparrow Ninja 03:46, 14 March 2007 (UTC)

significant figures[edit]

i'd propose keeping the math to 2 significant figures when showing percentages - we're yet to deal with a million billion data points, so anything more seems like overkill. hows that sound to you guys? JoeSmack Talk 17:20, 28 February 2007 (UTC)

Sounds good to me. Remember 17:37, 28 February 2007 (UTC)

we may need a numbers guy[edit]

i think we should request a consult/assistance from Wikipedia_talk:WikiProject_Mathematics before we start presenting this just to be safe. they have a pretty active group over there. JoeSmack Talk 17:38, 28 February 2007 (UTC)

I was thinking the same thing. Please go ahead and ask. Remember 17:39, 28 February 2007 (UTC)
Done. JoeSmack Talk 17:57, 28 February 2007 (UTC)
Hi! I read JoeSmack's message on the math talk page, so I came over here to take a look at what you're doing. You're certainly a dedicated bunch. Gathering the statistics by reviewing the edit histories one at a time must have taken a lot of effort. Anyway, here are my comments.
  • I didn't check all your arithmetic, but I did notice one arithmetical error, in the grand totals for all 100 articles. For November, 2005, 15/273 = 5.49%, not 4.49%.
Fixed, good eye! JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • To do simple statistical analyses like the one you're after, spreadsheet software is appropriate. Has anyone entered the data into a spreadsheet program, like Excel, or Open Office? If not, I'll try to get it done in the next couple of days.
A fantastic idea, it would really cut down some of this error. I'm not especially adept to using Excel, if you have the time itd be great if you could. It'd also be a whole lot easier to see the data in a page or two as opposed to the many pages of Wikiformat. Good thinking. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • Using the "random article" button to select articles for analysis makes a lot of sense. Selecting only the month of November for analysis makes less sense. Ideally, you might want to generate a random integer 1 through 12 (using a pseudorandom number generator) for each article selected for analysis, then analyze the edits for that month for that particular article. The problem with the procedure you used is that it may have introduced an unintentional systematic bias. Human behavioral patterns vary with the seasons, so it may be that you got an exceptionally high reading (or an unusually low reading) because people are grumpy (or benevolent?) in November, on average. Not that it's a big deal. Call it an opportunity for improvement.
A confound I hadn't thought of; we originally made it one month i think so that we weren't being inaccurate is saying around 5% of edits in a month are vandalism - what if the month has 28 days and not 31? ack! It also adds another layer of complexity to the study, which isn't bad if its trade off would outweigh it in goodness. I've moved a lot of these comments to study 2's talk page for background for the next round, so maybe we can incorporate that there. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • There are quite a few 'bot authors on Wikipedia. The process of extracting the raw data, at least, might be automated somehow. For example, a 'bot might select the random articles, select a month at random, then extract only the edit history records you're interested in and dump the whole thing into one page somewhere, where you guys could study the data without doing so much data collection. Just a thought.
Very true, very true. I see this as a more further down the road solution, but it can be done. I know a few botworkers too, so I might be able to rustle up some help from Eagle 101, Betacommand, Heligoland or anyone else who hangs out in #wikipedia-spam-t where i spend a lot of time too. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • When you write your report, you might want to present the numbers two ways – both with and without the randomly selected articles that you discarded because no edits occurred in November. I'm only suggesting this because if you include that number of articles you can compute the likelihood that a randomly selected article is going to be edited in November, at least once in three years. OK, you'd probably want to dress it up a little and present it as the probability that a randomly selected article gets edited within one month. Anyway, it would just be good statistical practice to report how many articles you bypassed because there were no edits in November. It's part of full disclosure.
We included them at the top, so I think this can be managed without too much trouble. Good point. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • Your overall result (roughly 5% of all edits during the period studied attributed to vandalism) is fairly interesting. The distribution of times it took to revert the damage (30 instances, roughly 15 hours, on average) might be of more interest than the average itself. I'll collate those 30 data points, at least, and write something more about it in a little while.
Thanks for all your hard work! DavidCBryant 23:17, 28 February 2007 (UTC)
And thank you for yours! :D JoeSmack Talk 18:33, 1 March 2007 (UTC)
Hi! It's me, again. I've collated the data for the 30 (or is it 31?) instances of vandalism.
  • 12 articles were vandalized, out of 100. 6 were vandalized once, 3 were vandalized twice, 2 were vandalized three times, and 1 (#78) was vandalized twelve times.
  • Data point #78 is confusing. You say it was vandalized 6 times in November, 2006, but the events are labeled "#1", "#2 & #3", "#3 & #4 & #5", and "#6". So I'm not sure if it was six instances, or seven. Somebody might want to double-check this data point and fix the labels. Or report 31 instances of vandalism, whichever is right.
I don't think I did data point 70-79, Remember or Jack did either of you do this and remember what was up? JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • As I suspected, the distribution of time elapsed before vandalism gets fixed is very interesting. I see 3 cases at 0 minutes (should be < 1 minute); 2 cases for each of 1, 4, 8, and 11 minutes; and 3 cases of 13 minutes. The rest of the times (sorted) look like this: 14, 18, 23, 29, 51, 104, 222, 452, 490, 895, 898, 1903, 2561, 4188, 6186, and 7991. The mean (average) is 491 minutes. But the median is 14 minutes -- half of the observed instances of vandalism were repaired within 14 minutes, and half took longer than 14 minutes. 80% of all the cases are "average" or better, and only 20% are "worse than average".
Hmm, interesting. We should include something about this in our summary report. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • Even more interesting is the distribution of repair times for articles that were vandalized more than once within a month. Those sequences look like this: 6816, 7991; 11, 11; 23, 4188; 1, 1, 0; 104, 222, 898, 895, 1903; 51, 13, 13, 8, 8, 0; and 452, 29, 4. I'm probably reading too much into this, but the sequences {1, 1, 0}, {51, 13, 13, 8, 8, 0}, and {452, 29, 4} are very suggestive. It's almost as if the defenders of Wikipedia are (sometimes) gathering around recent victims of vandalism and repairing the damage more quickly when an article is the subject of repeated attacks, sort of like the human body's immune response.
I like the picture you drew here with an immune system; and I agree, articles that get slammed with vandalism get babysat a bit after so reverting is often quicker. JoeSmack Talk 18:33, 1 March 2007 (UTC)
  • I have a technical question. You've divided the edits into two classes, "vandalism" and "not vandalism". I think three classes might be more appropriate: "vandalism", "revert vandalism", and "not related to vandalism". I think the distinction is meaningful, and probably not too hard to make. Anyway, I'm not sure how you counted reverts in your raw data, but maybe I didn't read the report closely enough.
What do you mean by 'revert vandalism'? Vandalism that is reverted later, or people who've vandalized once then revert the reverter? JoeSmack Talk 18:33, 1 March 2007 (UTC)
That's it for now. I've got to go shovel snow off the sidewalk! DavidCBryant 00:35, 1 March 2007 (UTC)
Good luck! I've got to catch a bus, CMummert, I'll comment on yours a bit later today. Thanks David for all the hard work, and I invite you to look on over study 2 as well! :) JoeSmack Talk 18:33, 1 March 2007 (UTC)

comment by CMumert[edit]

The study is very interesting and informative, but the small sample size (100) makes the final numbers subject to a large margin of error. With the sample of 100, the estimate of 5% of edits being vandalism has a margin of error of about 4% at 95% confidence; so I conclude from your numbers that there is a very high chance the real vandalism rate is less than 9%. In order to have a 2% margin of error with 95% confidence, if the real percentage of vandalism edits is 5%, you need to sample about 475 articles. Fortunately, the total number of WP articles doesn't matter, only the number that you sample.

A second, more interesting, problem is that you are measuring the average percentage of edits per article that are vandalism. But there is another statistic that is equally valid: the average percentage of total WP edits that are vandalism. To see the difference, think about the extreme case where only one article on WP is ever vandalized, but it received 100% vandalism edits. Then your survey would show 0% average vandalism unless you were lucky enough to find that one article with the random article button. To measure overall averages, you would need to take a random sample of 1000 edits (I have no good idea how to do that without using a database dump) and determine how many of them are vandalism.

Nevertheless, your survey is very interesting. No study is ever perfect, and you seem to be planning more. This sort of work is the only way to dispel the random speculation about vandalism. CMummert · talk 03:00, 1 March 2007 (UTC)

Some good points. Margin of error and significance is definitely something I want to add to this study or the next; in reality this is a statistics domain and should be treated like it. I know when we started it seemed like it was drive to get the data, I was thinking people could play around with it later. I'm glad you're thinking of more ways to interpret this stuff, and the limitations it has as well.
You're right about sample size; it needs to get bigger. One thing to note is that if this study has a sturdy enough procedure, we can use our old data in the new study, starting off with 100 data points. We'll see how things unfold over at study 2, this may very well be the case. More is better, true true true.
I see the distinction you make between an article's edits being vandalism and wikipedia's edits being vandalism. One thing is, vandalism doesn't just happen on mainspace: it happens on talk pages, user pages, wikipedia pages - you name it. People seem to care the most about mainspace because it is where people go to get the straight dope, and finding crap there instead is worrisome when it comes to credibility, reliability, etc. The distinction is also prevalent in a sense that some articles are vandalism more than others. AIDS for instance, which I helped foster over the years and probably one of two wikipedia articles that got me started on wikipedia at all, receives a LOT of vandalism, probably more than the average article. I know JackSparrow got a taste of this when he hit data point 99, a Zelda videogame (The_Legend_of_Zelda:_The_Minish_Cap) that had a near by release date. Do these data points reflect an artificially inflated amount of vandalism? Do we throw up our hands and hope enough randomization accounts for the occasional frequent target article?
How do you guys feel about this, is there a better way to approach this than how it is currently being done? Is there a way that study 2 can be conducted that combats this in any way? JoeSmack Talk 03:31, 2 March 2007 (UTC)
Oh, and another thing; what if we hit an article with semi-protection/protection? Do we ignore it? JoeSmack Talk 06:12, 2 March 2007 (UTC)

Begun double checking[edit]

I have begun to double check all of our entries for the study. I will write here my progress and any issues I find. Remember 18:05, 28 February 2007 (UTC)

Found mistake for data point 5 Crescent City Connection, it says that it had 3 edits in 2005 but the history shows that it had none. Unfortunately this was an early mistake that will affect all our math. I will correct it later unless someone wants to do it sooner. Remember 18:21, 28 February 2007 (UTC)
Phew, corrected all the effected data. It HAD to be an early data point! ;) I also found in this [2] that for the final total somehow the number got inflated, for both the 2005 total and the final total. As painstaking as it is, the numbers need to be double checked in in both cumulative results sums and the percentages (which should be shaved down to 2 significant figures if they aren't already, or bumped up too I suppose). JoeSmack Talk 19:16, 28 February 2007 (UTC)

I have now double checked all of our data and calculations for Data points 1-25. Slowly but surely we will finish this thing. Remember 03:29, 2 March 2007 (UTC) : Mistake number two and three, Anthology (Bruce Dickinson DVD) and Gillian Spencer have edits in November. I guess we should add this as point 101 and 102. Remember 03:45, 2 March 2007 (UTC)

I presume this was from the list of articles with no november edits? if they do actually, it'd be remiss not to add them otherwise our study wouldn't be random (we'd be discriminating some articles). When need to find a way to increase reliability, I'm starting to have second thoughts about spreading results from study 1 when it has these kind of speed bumps that can be flattened in study 2. JoeSmack Talk 04:09, 2 March 2007 (UTC)
I'm still for publishing our data once we have double-checked everything just to raise awareness so we can conduct study 2 quicker and with more participants. Remember 12:53, 2 March 2007 (UTC)
Ok, than lets do this; if you're in i'm in. Still, you might want to attach this little disclaimer for now: "Measuring is easy. What's hard is knowing what it is you're measuring. This is a preliminary study, if you have any suggestions for improvement please join us for study 2." JoeSmack Talk 19:27, 3 March 2007 (UTC)
Whoops, I spoke too soon. Those articles (Anthology (Bruce Dickinson DVD) and Gillian Spencer ) were incorrectly listed as not having data points but they were also listed as being the 23 and 27 data point so I have just taken them out of the early section and it shouldn't affect our math. Remember 18:35, 7 March 2007 (UTC)

Administrator help[edit]

We may want to enlist the aid of an administrator who can help us lock the article before this study goes public because I have a feeling that some people would enjoy the irony of vandalizing a project that studies vandalism. Remember 18:06, 28 February 2007 (UTC)

If this becomes a serious issue (I'll keep an eye out when the time comes), I'll take care of getting it semi-protected if need be. JoeSmack Talk 03:33, 2 March 2007 (UTC)

Current status[edit]

I've now double checked everything for the first 50 edits. I found two new errors. The first is with Kyle Vanden Bosch. It should be four edits in November 2005 instead of 3. The second is Jean Théodore Delacour, there is 2 edits in 2005 and not 1. I need to change these entries and double check all the math up to this point. Any help double checking is welcome. Remember 16:46, 10 March 2007 (UTC)

I'll give a hand this weekend, sure thing. JoeSmack Talk 19:20, 10 March 2007 (UTC)
I have added the new data to Kyle and Jean data points, but I have now found a larger error on Shadow of the Colossus data point. For November 2006 it says there are only 4 edits but there are actually 20! I'll keep double checking this stuff but we need to get this data accurate before our study can mean anything. Remember 22:28, 10 March 2007 (UTC).

:::I have now gone through all data points for 1-60, but I haven't rechecked the math. Remember 22:52, 10 March 2007 (UTC) I now have gone through 1-70, I found some more errors and corrected them. Now all the data points from 1-70 should be correct but all the math is wrong. If someone wants to help they can verify the last 30 data points or redo the math for all the sections. Remember 23:17, 10 March 2007 (UTC) I have now gone through 75 points. Remember 20:39, 13 March 2007 (UTC)

Alright, i've gotten all the math from 1-70. Yowzah, study 2 needs to use a spreadsheet instead of this manual method. One question I have is Data Point 68 here has this marked as vandalism (no category on the data point given, in the tallies noted as 'obvious vandalism'). To be fair, I have no idea if this was vandalism or not (I don't speak/read arabic). But, also to be fair, the IP's contrib in question was reverted by a user, and the IP has a history of vandalism. How should we go about this? JoeSmack Talk 00:35, 12 March 2007 (UTC)
Combining the revert and the ip's history, it seems vandalism. But perhaps we should ask who reverted it. He probably knows why, and thus if it was vandalism. JackSparrow Ninja 00:53, 12 March 2007 (UTC)
I was the one who originally categorized this as vandalism. I put a message asking the person why he reverted this lanugage but I have got no response. I also tried do a google search for the arabic that was put in the place of the original text and it did not seem to come up with the right person once I translated the page, but I will admit that this is a very crude method. We should find someone that reads arabic. Remember 11:44, 12 March 2007 (UTC)

I have now double-checked the first 90 data points. Only 10 more to go!!!! Remember 20:50, 19 March 2007 (UTC)


Now all we need to do is check all of the calculations for the page and write the conclusions! We are almost finished. Please help out if you can. Remember 22:56, 19 March 2007 (UTC)

I have recalculated up to data point 80. I've gotta end it for today, so should anyone want to continue before I get back, here are the changes that have to be furtherly made from 81 onwards:

Edits in 2004 -> +2
Edits in 2006 -> -2
Vandalism edits in 2006 -> +1
Total vandalism edits -> +1
Time before reverting, total -> 4615 4599
Reverts by anonymous editors -> +2

JackSparrow Ninja 07:18, 21 March 2007 (UTC)

I'm seeing systemic math whoopsies in the reverting numbers/averages. We're going to have to go through this once more after this, they keep on popping up and we want to do this right. I'm still going, but after I'm done with 80-100 we'll have checked the math twice. JoeSmack Talk 16:46, 24 March 2007 (UTC)
I've finished checking the math. I found it kind of funny there were exactly 666 edits, good to end on a laugh. It just needs to be triple checked (including making sure numbers are rounded correctly, like anything that is 6.365 on a calculator looks like 6.37 on the wiki and turning 3.4's into 3.40's). Also make sure all the numbers add up to their totals) for surface and aesthetic math. This last time around is gonna be way less nerve racking (where the hell did these 17 edits come from!? How far back does this go?! ahhh!!!). The light at the end of the tunnel is near! :D JoeSmack Talk 17:53, 24 March 2007 (UTC)
Awww, it was 668 in the end, darn. ;) Alright, lets get cracking on what to do next in the following section. JoeSmack Talk 19:25, 24 March 2007 (UTC)


I have just triple checked all the data points and made the appropriate corrects. I typed in all the data and did it by excel spreadsheet so there should be no mistakes at all. All we need to do now is write the conclusion and publish!!!!!!! Congratulations to all!!!!!!!!!!! - Can someone get a statistician or someone else to help with the results? Remember 19:07, 24 March 2007 (UTC)

AWESOME! Ok! Quick thank you to Rem for busting out Excel for triple checking - we have the benefit of human and computer accuracy now. Heres what to do next now that the study is complete:
* Double check with an Arabic WikiProject or a fluent speaker using Babel userboxes that Data Point 68 here is in fact vandalism (heres the dif).
I've messaged User:Mohammed_Khalil about this, a frequent contributor and native Arabic speaker (I found him using Babel userboxes). The link to the request is here. Let's hope he'll help! JoeSmack Talk 23:45, 24 March 2007 (UTC)
I've asked #wikipedia and #wikipedia-ar with no luck. No one in #wikipedia speaks Arabic, and no one in #wikipedia-ar spoke English. I'd suggest trying again later in the day (it's 11pm PST here). JoeSmack Talk 06:28, 25 March 2007 (UTC)
Google, the great linguist that it is, finds a discernible difference between the two versions of the article (namely the two Arabic words). The before dif google results of the Arabic word (47,600 results), and the after dif google result of the Arabic word (45 results, all wikipedia mirrors). The interesting thing is using google's translator, all the before dif results all link to sites on a guy named 'Osama bin savior' who looks to have been born same year 'Usamah ibn Munqidh', the article's subject, was born. Thus, they may be the same person, and the revert the editor made may have been in error. What do others think, do we need to revise the math? JoeSmack Talk 07:01, 25 March 2007 (UTC)
I'm hesitant to revise now because of all the math and revisions required. For some reason the person who reverted it thought it was incorrect (but I couldn't get him to tell me why). I say we try to get someone to tell us what it means in the next couple of days and then reassess at that point. Remember 13:48, 25 March 2007 (UTC)
* I think instead of tapping a statistician on the shoulder we should ask the couple of helpful people that commented before from Wikipedia_talk:WikiProject Mathematics, User:DavidCBryant and User:CMummert. I'm especially interested right now in the wide range of time to revert vandalism; the current total mean is 12.94 hours which i think is skewed.
WikiProject Mathematics has been asked here. Both CMummert and DavidCBryant have also been asked on their respective talk pages. JoeSmack Talk 00:00, 25 March 2007 (UTC)
* Through consensus work out what to put on the conclusions section here, then move it over to the study.
  • Work out where we want to present the findings, and in certain cases (like the community notice bulletin board) the phrasing that should be used.
Great job to all involved, it feels good to have the data and math finally finished with! Huray! :-D JoeSmack Talk 19:23, 24 March 2007 (UTC)
Oh and we need to report too in the conclusions how many articles were pulled including the ones with no edits in 2004/2005/2006 - you know, the ones at the top of each set of 10 data points. JoeSmack Talk 00:05, 25 March 2007 (UTC)
Remarkable results. For the next study, an interesting descriptive stat would be the percentage of non-vandalous (and non- revert) Edits done by registered and unregistered users. I'd guess that registered users would do significantly more than the 75 percent of total reverts done by registered users, b/c they might tend to follow an article more closely to begin with. This study gives a good baseline for determining how relatively bad the vandalism problem is for a given article. --Thomasmeeks 00:43, 25 March 2007 (UTC)
Good thoughts. Study 2 is already being formed. Hop on over, we'd love to hear more of your thoughts, ideas and ways we can improve. JoeSmack Talk 05:10, 25 March 2007 (UTC)

Congratulations. This is a very interesting study, and I hope that it leads to a good conversation about vandalism. I don't see any glaring problems with the summary, asuming that that calculations are correct. The average time to revert seems accurate here; many of the articles seem to be reverted very slowly. The reason it seems high is likely that this survey didn't happen to find any of the relatively few articles that are subjects of lots of vandalism. You might want to point this out in your conclusion: it is possible that there are a few articles that get a lot of vandalism and many articles that get little vandalism. In that case, the average amount of vandalism per article will not reflect the true condition of any article. CMummert · talk 11:35, 25 March 2007 (UTC)

I concur. Good job on this, it's interesting and well-done. Paul Haymon 04:26, 30 March 2007 (UTC)

Draft conclusion[edit]

Here is the current draft of the conclusions for the article. Feel free to edit it as you see fit. Remember 22:58, 24 March 2007 (UTC)


The current study analyzed a sample pool of 100 random articles. Within these 100 articles there were a total of 668 edits during the months of November 2004, 2005, and 2006. Of those 668 edits, 31 (or 4.64%) were a vandalism of some type. The study suggests that approximately 5% of edits are vandalism, and 97% of that vandalism is done by anonymous editors. The vast majority of vandalism is obvious. From the data gathered within this study it is also found that roughly 25% of vandalism reverting is done by anonymous editors and roughly 75% is done by wikipedians with user accounts. The mean average time vandalism reverting is 758.35 minutes (12.63 hours), a figure that may be skewed by outliers. The median time vandalism reverting is 14 minutes.

Detailed conclusions[edit]

This study compiled 174 random articles from the 'Random article' sidebar tool. Of the 174 articles compiled, 100 articles had at least one edit during the months of November 2004, 2005, or 2006 (74 articles compiled had no data and thus were omitted from the sample pool of data points). There was a total of 668 edits during November 2004, 2005, or 2006 for those 100 data points of the sample pool. Of those 668 edits, 31 (or 4.64%) were a vandalism of some type.

There did not seem to be any trend whether vandalism was growing or shrinking as a percentage of total edits, but percentage of total vandalism edits seemed stable taking up in between 3-6% over each of the periods reviewed. This was shown by the fact that in Nov. 2004 total vandalism edits represented 3.49% (3 cases of vandalism out of 86 edits), in Nov. 2005 vandalism edits were 5.04% (14 cases of vandalism out of 278 edits) and in Nov. 2006 there was 4.61% (14 cases of vandalism out of 304 edits).

As for the type of vandalism, by far the most common was obvious vandalism which accounted for 83.87% (26 out of 31 cases). Deletion vandalism was next with 9.68% (3 out of 31) and linkspam being the least used at 6.45% (2 out of 31 cases).

Those that vandalized pages were overwhelmingly anonymous editors, who accounted for 96.77% of all vandalism edits (30 out of a possible 31 vandalized edits). While anonymous editors did vandalize wikipedia pages much more, they also did contribute to reverting vandalism 25.81% of the time (8 out of 31) while all other reverts were done by wikipedians with user accounts (74.19% with 24 out 31 reverts).

The time it took to revert edits were on average 758.35 minutes or 12.64 hours. This was much higher than previously guessed by a previous IBM study (See pdf here). Nevertheless, the median time for reverting was 14 minutes (0, 0, 0, 1, 1, 3, 4, 4, 6, 7, 8, 10, 11, 11, 13, 14, 18, 23, 29, 51, 104, 222, 452, 490, 895, 898, 963, 1903, 2561, 6816, 7991).

Summary of results[edit]

Current cumulative tally

Total edits 2004, 2005, 2006 = 668
Total vandalism edits 2004, 2005, 2006 = 31
Percentage of vandalism to total edits = (31/668)= 4.64%

November 2004

Total edits in November 2004 = 86
Total vandalism edits in 2004 = 3
Percentage of vandalism to total edits = (3/86) = 3.49%

November 2005

Total edits in November 2005 = 278
Total vandalism edits in 2005 = 14
Percentage of vandalism to total edits = (14/278) = 5.04%

November 2006

Total edits in November 2006 = 304
Total vandalism edits in 2006 = 14
Percentage of vandalism to total edits = (14/302) = 4.61%

Percentage of overall vandalism that was

Obvious vandalism = (26/31) = 83.87%
Inaccurate vandalism = (0/31) = 0%
POV vandalism = (0/31) = 0%
Deletion vandalism = (3/31) = 9.68%
Linkspam = (2/31) = 6.45%

Percentage of overall vandalism that was done by

Anonymous editors = (30/31) = 96.77%
Editors with accounts = (1/31) = 3.23%
Bots = (0/31) = 0%


Average time before reverting = ((7991+14+6816+18+2561+4+11+11+0+13+963+23+1+0+0+104+222+490+898+895+1903+10+51+7+6+8+3+1+452+29+4)/31) = 758.3548 minutes = 12.64 hours.
Percentage of reverting done by
Anonymous editors = (8/31) = 25.81 %
Editors with accounts = (23/31) = 74.19 %
Bots = (0/31) = 0 %

Note -

Some articles compiled had no data and thus were omitted from the sample pool of data points. The sum total of these articles that had no edits for November 2004, 2005, 2006:

14+12+6+2+3+6+5+5+11+10 = 74 articles

Data point 68 mystery solved[edit]

Allllright, I did a lot of digging and I've got the mystery beat. The short answer is technically, neither the IP who edited the Arabic name in this dif nor the user who reverted is was wrong. Both are the same words, ie the name of the man of the article Usamah ibn Munqidh. The IP changed it to a 'basic form' unicode encoding that is Arabic (Arabic block, U+0600 to U+06FF). The user reverted this change back to the previous encoding, that of a 'presentation form' (Arabic presentation form block, FE70 to FEFE). From our nifty little Unicode guide it says (pg 10 of your pages, page 194 of the PDF guide):

Optional Features: Many other ligatures and contexual forms are optional - depending on the font and application. Some of these presentation forms are encoded in the range FB50..FDFB and FE70..FEFE. However, these forms should _not_ be used in general interchange. Morewover, it is not expect that every Arabic font will contain all of these forms nor that these forms will include all presentation forms used by every font.

So our well-wishing IP tried to make the name more compatible by changing the unicode that represents the Arabic characters in a more basic/compatible form, and our well-wishing user reverted it back to the presentation form perhaps thinking he was full of it. As far as I know it is still in this 'presentation form' currently in the article right now. One could also note that the 'basic form' returns 47,000 hits from google while the 'presentation form' returns 45 hits from google, all wikipedia mirrors. This was all conducted with help from 'alnokta' a fluent speaker of Arabic in #wikipedia-ar and 'TimStarling', a tech wiz over in #wikimedia-tech. I think the right thing to do would be to retract this as a vandalism edit and decrease the numbers one tick from datapoint 68 on down. I know it isn't the most favorable thing to do, but we'd be remiss in ignoring it. JoeSmack Talk 16:37, 25 March 2007 (UTC)

OK, I'm going to try to revise all the data. Without the 68 vandalism. Remember 17:21, 25 March 2007 (UTC)
OK I have updated all the data and will update the draft conclusion. Remember 17:49, 25 March 2007 (UTC)
Ok I have updated everything now so everything should be correct. Feel free to double check if you want. Remember 17:59, 25 March 2007 (UTC)
I've given it a quick look and everything seems to be well. Thank you for revising! Lets start working out letting people know! JoeSmack Talk 20:24, 25 March 2007 (UTC)
Wait, is it fixed? Data point 68 still has it listed as vandalism, 'Vandalism 2'... JoeSmack Talk 20:33, 25 March 2007 (UTC)
I think what happened is you revised all the math but forgot to remove the false positive. I've done so, everything else looks ok... JoeSmack Talk 20:35, 25 March 2007 (UTC)
I believe you're right. Remember 15:12, 26 March 2007 (UTC)

final loose ends ?[edit]

any final loose ends before we start releasing conclusions? is the draft conclusion satisfactory, can we move it over the study data page? anything anyone can think of that they still want to regard before we wrap this up? JoeSmack Talk 20:27, 25 March 2007 (UTC)

It looks all good to me. I've been thinking about the draft, but I think it's great as it is now. JackSparrow Ninja 20:40, 25 March 2007 (UTC)
I think this looks good. I would like to get a stats person to see if they could give us more information on our margin for error, but I say we publish soon.Remember 00:56, 26 March 2007 (UTC)
Okee dokee. Try here: [3] a list of 'what links here' to a 'I'm a statistician' userbox. Ask one, hopefully one that is very active (for a quick reply) to give it an eyeball. :-) JoeSmack Talk 01:05, 26 March 2007 (UTC)

Form language of conclusions to notify community of our results[edit]

I was think that we should draft some form language that we could post to all of the sites we mentioned to get the word out about our study. This would be a concise summary of the conclusion along with a call to participate in the vandalism study project and the next study. Below is the proposed language. Please feel free to revise. Remember 15:30, 26 March 2007 (UTC)

The WikiProject on Vandalism studies recently finished its first study and has published its conclusions (a full and detailed copy of the conclusions can be found here).
The first study analyzed a randomly sampled pool of 100 random articles. Within these 100 articles there were a total of 668 edits during the months of November 2004, 2005, and 2006. Of those 668 edits, 31 (or 4.64%) were a vandalism of some type. The study's salient findings suggest that in a given month approximately 5% of edits are vandalism and 97% of that vandalism is done by anonymous editors. Obvious vandalism is the vast majority of vandalism used. From the data gathered within this study it is also found that roughly 25% of vandalism reverting is done by anonymous editors and roughly 75% is done by wikipedians with user accounts. The mean average time vandalism reverting is 758.35 minutes (12.63 hours), a figure that may be skewed by outliers. The median time vandalism reverting is 14 minutes.
Currently the project is working on a related study, Wikipedia:WikiProject Vandalism studies/Obama article study, and is also beginning to draft up the parameters of our second major study (see Study 2). If you are interested in our work, please participate in our efforts to help us create a solid understanding of vandalism and information on wikipedia. Thanks.
Looking good, i'll give this more eye as the day goes along. We don't want things to be too impersonal, so some should be different depending on where they are left. I wrote up a good wording for the community bulletin board for instance, but thats in a text file at home (im at work). JoeSmack Talk 16:31, 26 March 2007 (UTC)
P.S. I hate spam, and i don't want this to feel too spammy. Lets report on the Community Bulletin, the Village Pump and the Counter-vandalism unit wikiproject talk page. After these three, we'll worry about other stuff, they may pick it up on their own. JoeSmack Talk 17:09, 26 March 2007 (UTC)
I'd also like to put it on the tip line of the Signpost. I think this might make a good article. Remember 17:16, 26 March 2007 (UTC)

Some suggestions[edit]

  1. You might add a disclaimer that you possibly missed some very cleverly disguised "inaccurate vandalism".
In retrospect I agree, some could be interpreted as "inaccurate vandalism"; we need to make or definitions between the different types of vandalism more rigid for the next study. JoeSmack Talk 00:50, 27 March 2007 (UTC)
  1. In the presentation, the division in blocks serves no purpose that I could discern.
Really it makes it easier to scroll down the data from the table of contents at the top, and thats about it. It also helped divvy up the work between three of us (I can do about 10 data points tonight, or oh, i can double check data points 30-50). JoeSmack Talk 00:50, 27 March 2007 (UTC)
  1. The presentation can be made more concise and more easily surveyable by presenting the data in tabular form, with a separate section listing the vandalism edits found (wikilinked to from the table).
Would be great. The pith of Wikipedia is prose, so I think it should be presented first, but tables rock and help really examine data from another angle. If you can tighten up the wording of the presentation (See Official Conclusions above, or on the project space) that'd be great. I tried to cut it down as much as possible, but i've been working on this study for months and everything seems precious to me. ;) JoeSmack Talk 00:50, 27 March 2007 (UTC)
  1. For a very skewed and possibly heavy-tailed distribution like that of time-before-reverting, the sample arithmetic mean is not a meaningful measure for the average. (Consider the case where 999 of 1000 vandalism edits are reverted within 1 minute, while 1 vandalic edit was never reverted; isn't that better than if 1000 of 1000 vandalic edits are reverted in between six months to a year?). Instead, a graph showing the percentage p(t) that was reverted after t minutes is more informative; because t would range from 0 to a whopping 7991, you should use a modified log scale for the horizontal axis, where the horizontal offset is proportional to log(t+1). The vertical scale would simply run from 0% to p(7991m) = 100%. I've sketched how this would look in ASCII:
100% +-------------------------------------------------------------------------
     |                                                            ----------   
 90% +                                                          ---            
     |                                                    -------              
 80% +                                                    |                    
     |                                               ------                    
 70% +                                   -------------                         
     |                              ------                                     
 60% +                        -------                                          
     |                      ---                                                
 50% +                    ---                                                  
     |                  ---                                                    
 40% +                  |                                                      
     |                ---                                                      
 30% +              ---                                                        
     |            ---                                                          
 20% +          ---                                                            
     |    -------                                                              
 10% +-----                                                                    
  0% +-------------------------------------------------------------------------
     |    |   |    |       |    |    |    |     |  |    |     |    |    |   |  
     0m   1m  2m   5m     15m  30m   1h   2h    4h 6h  12h    1d   2d   4d  6d  

 --LambiamTalk 17:27, 26 March 2007 (UTC)

comments from pdbailey[edit]


I made up the probability function you should probably include in a much prettier form (with nice labels like the above. I'll add more later, but this is what I have now:

  • you weren't calculating the same thing as the IBM paper, so your numbers shouldn't be comparable.
  • The idea of a random page is interesting (and a great first start) but probably not exactly the most interesting question to ask. It would be much more interesting to look at
Pr(vandalism active | page viewed).
That is the probability of vandalism conditional on the page being viewed. You might just want to shoot for this using a weighting function (and, yes others could help you with this) that weights based on the number of edits, or length of the article in question.
  • there is something driving the length of time to reversion, I would guess it is coincident with one of the above (length, views, number of edits in the month). You should look for that. I'll see if I can piece it together with the data provided later.
I think there are a few different behaviors that lead to revision time being how it is. The first is articles that no one watches that takes forever for someone to revert, no matter how many vandalisms. The second is articles that everyone watches or that a few people watch all day and it takes very little time for someone to revert. The third is a page that no one watches and then is hit with a persistent vandal - suddenly someone picks up on this and then a few do (maybe via WP:ANI, recent changes patrol, whatever) and then ironically the amount of time it takes to revert goes from long to short. It'd be interesting to study this phenomenon specifically - (borrowed metaphor) a sort of body's immune system reaction to a virus, from nothing to strong response in a short amount of time that defeats the virus's attempts. JoeSmack Talk 00:30, 27 March 2007 (UTC)
  • You should probably present confidence intervals. The idea is that you estimated some values based on the population of all wikipedia pages, and your results can be generalized to all pages on the site.

Again, more later.Pdbailey 22:59, 26 March 2007 (UTC)

Wow, this is over my head, but thanks a ton for the help. Please keep it up so that was can present some legitimate results to the whole group. Remember 23:13, 26 March 2007 (UTC)
This is above a beyond what I thought we'd be able to get out of this data! Thanks so much for graphing this stuff, i don't have access to statistical software anymore because i finished my degree so this is great. Theres other stuff i might suggest we do with the data now that we got a couple of red blooded statisticians on board like is there a significant difference between the years' vandalism rates? whats the margin of error? all the stuff from advanced data analysis that i need to brush up on again anyways. :)
What i'd propose to do is present the surface conclusions now and leave another spot for 'further discussion' on the project which can be incubated here. I suspect once this study gets more eyes lots of people will want to play around with the data and see what it has to say, but the significant stuff we should have laid out in 'Official Conclusions' by now, right? Does this sound ok? JoeSmack Talk 00:41, 27 March 2007 (UTC)
Sounds good to me. Feel free to publish the conclusions. — Preceding unsigned comment added by Remember (talkcontribs)

Okay, so the first thing to do is a sanity check. Can you estimate the number of edits that occur in November of these years. The estimator is simple. These are taken from Cochran (isbn:0-471-16240-x), pp 21-23, I'm assuming a SRS (simple random sample) you should be sure "random page" does this (I don't know)

[Number of edits in November] = [Number of articles] * [Observed number of edits in sampled articles] / [number of articles sampled]

This is an unbiased estimator. It's variance is given by

Var = [Number of articles]^2 / [number of articles sampled] * S^2

Where S^2 is the traditional estimate of variance (y-mean(y))^2/(n-1) and I'm assuming that the sampled articles do not represent a large fraction of the total number of wikipedia articles. You can then compare this to [4]. If someone makes me a list similar to the time to revert list that's on the page now (just comma seperated) I will calculate this. The point is that if you get this right, its a good reason to believe that you were doing something right... sorry, I just don't have time to compile this myself.

What would be best would be to have a spreadsheet that had columns (article (name or just number),number of vandal edits, total edits,year) and then one row for each year/articles (total of 174*3 rows). Then, another sheet with (article (name or just number so long as its the same number as before), revert time) and an entry for each reversion. Short of this... well, we can negotiate. this would be great.Pdbailey 01:21, 28 March 2007 (UTC)

It occurs to me PRL is the tool for doing this, but I'm not quite up on it, could anybody write a parser?Pdbailey 01:25, 28 March 2007 (UTC)

Second, I intend to help you calculate confidence intervals and significance levels with you (thats you as a group). I highly recommend that you wait to publicize your results widely until you have something that resembles an analysis (which you aren't far from now).Pdbailey 01:21, 28 March 2007 (UTC)

sanity check[edit]

Okay, here are the estimates, in 2004, the estimated number of edits is 371 thousand with a 95 % confidence interval of [153,590] thousand. The true value is 762 thousand. The associated t-statistic is 3.51 with an associated two-tailed p-value of 0.0004. This means that the sample is not random (more on this later). In 2005, the estimated number of edits is 2.4 million with a 95% confidence interval of [320 thousand, 4.4 million]. The true value is 1.9 million. In 2006, the estimate is 4.2 million with a CI of [2.8,5.6] million and the true value is 3.8 million. Here I used the latest numbers (October 2006) as a stand in for November 2006.

The latter two are well within the confidence intervals. The first one (2004) is not. I think this is because you did not take a random sample of 2004 articles, you took a random sample of 2006 articles. This implies that the 2004 numbers should be viewed with a jaundice eye. Perhaps the structure of the wikipedia changes during this time? i.e. articles went from those multi link things to real articles?

Hmm, I also just realized that I'm assuming that your sample of 100 is the right sample to be looking at... can someone explain the 174 vs 100 issue? Why are there only 100 entries in the spreadsheet? If only 14 had no edits in any November, shouldn't the other 174-14 be in the calculation?Pdbailey 03:36, 28 March 2007 (UTC)

at this point the cat is sort of out of the bag; a lot of new faces around here are due to a community bulletin post about some of the initial findings. theres nothing wrong with driving in deeper though, and you definitely know how to make the numbers talk.
speaking of which (pun, ha), im trying to get a clear picture of some of what you are saying: when we selected 'randomly', the articles that were Nov. 2004 were selected from a 2004/2005/2006 bucket and not just a 2004 bucket, 2005 from a 2005/2006 bucket, and 2006 didn't have to worry about it at all? JoeSmack Talk 04:51, 28 March 2007 (UTC)
To answer your question, you selected randomly form a bucket with just 2006 articles in it. The best case would have been to select randomly in November of each of these years, but it's fine that this isn't what happened. Pdbailey 02:01, 29 March 2007 (UTC)

second crack analysis[edit]

year samp. art. samp. edits Est. edits (M) std. err (M) Act. edits (M) t-statistic
2004 163 86 0.214 0.07 0.762 7.55
2005 168 278 1.38 0.62 1.9 0.81
2006 174 304 2.4 0.43 3.8 1.15

notes: year is the year. samp. art. is the number of articles sampled, this is the number that existed of the 100 at the time times 74*[fraction that existed in 2004] (a guess). samp. edits is the number of edits observed. Est. edits (M) the estimated number of edits in November of that year in millions. std. err (M) the standard error in the estimate. Act. edits (M) is the actual number of edits in millions. t-staitstic is the T statistic.

Okay, so now I think that the 2004 value just suffers from a poorly estimated variance (and possibly is adversely affected by a poorly estimated sampled articles number). The point is that the results are within the realm of believability and you didn't waste your time (yay!). Now, I can move on to calculating confidence intervals. Pdbailey 01:58, 29 March 2007 (UTC)

Estimates and CIs[edit]

Now, you didn't take a simple random sample (SRS) of the edits, so estimating the fraction of edits that are vandalism is a lot more complicated (this means you can't do the total vandals / total edits technique and get an unbiased estimate). There are a couple of ways of dealing with this. The primary issue that you have to realize is that your values are clustered (into articles) so the variance will be higher than you might otherwise expect and the average will be a little more complicated to calculate. Had you taken the FIRST edit after November 1st for each year, you would have an SRS of edits (not that this is a better idea, just answering a reasonable question and explaining). So you can estimate the number of vandals per edit on an article by article basis and take the mean. Problem: some estimates will be incalculable (0/0). There isn't a great way of dealing with this in the world of sampling that I know of, so I'm just pretending they were not part of the sample (I think I could argue this point too). Then you take the average of these vandals/edit on a per article basis. This yields:

year samp. art. mean(vand./edit) Std. Err Confidence Interval
all 100 0.030 0.017 [0.0,0.084]
2004 25 0.034 0.023 [0.0,0.090]
2005 42 0.050 0.027 [0.014,0.14]
2006 78 0.046 0.022 [0.0,0.065]

Okay, found at eqn. 2.46 regarding ratio estimates and found good answer. Now I have the numbers. Pdbailey 02:44, 29 March 2007 (UTC)

Could someone with more intelligence than myself incorporate these calculations into our official conclusion. I think they would be very valuable, but I'm afraid I would screw it up. Remember 15:14, 29 March 2007 (UTC)

I added confidence intervals based on bootstrapping, but I'm not sure that this is a valid technique. Any of the statisticians have an opinion? Pdbailey 20:05, 31 March 2007 (UTC)

proposed conclusions section[edit]

The current study analyzed a sample pool of 174 random articles for edits durring November of 2004, 2005, and 2006. Of these articles, 100 contained an edit. A total of 668 edits were observed, of which 31 (or 4.6%) were a vandalism of some type (defined below). Because articles were randomly sampled and no edits a ratio estimate must be used to calculate the fraction of edits that are vandalism. The fraction of edits that were vandals is 3.0% with a standard error of 1.7%. There is no discernible time trend in the data as can be seen in the table below.

year sampled articles mean(vandals/edit) Std. Err.
All 100 0.030 0.017
2004 25 0.034 0.023
2005 42 0.050 0.027
2006 78 0.046 0.022

In addition, 97% of the vandalism observed is done by anonymous editors. Obvious vandalism is the vast majority of vandalism used. Roughly 25% of vandalism reverting is done by anonymous editors and roughly 75% is done by wikipedians with user accounts. The mean average time vandalism reverting is 758.35 minutes (12.63 hours), a figure that is skewed by outliers. The median time vandalism reverting is 14 minutes.

Pdbailey 17:08, 29 March 2007 (UTC)

Can we add in a touch of what exactly standard error means in this context? JoeSmack Talk 19:57, 30 March 2007 (UTC)
I've been thinking about that. I think confidence intervals are easier to understand and nobody mentions developing them. I think it's impossible to do exactly. But either a bootstrap or overestimation is possible. Pdbailey 13:27, 31 March 2007 (UTC)


Does anyone know how good our random page function is? I haven't seen any discussion/proof regarding its "randomness". Does anyone know how it is generated? --Spangineerws (háblame) 21:19, 27 March 2007 (UTC)

AFAIK, it uses a field in the page table, but I'm not sure about the details. Tim Starling is the one who designed the function as it stands now, and he will have the best knowledge about it. Titoxd(?!? - cool stuff) 22:08, 27 March 2007 (UTC)
We talked to Tim the other day in ##mediawiki-tech when we were figuring out data point 68 with a fluent Arabic speaker and having Unicode problems (see Data_point_68_mystery_solved above). You'd prolly catch his ear there the best. JoeSmack Talk 04:54, 28 March 2007 (UTC)
Tim says generator is NOT truely random - see here Ttguy 12:18, 24 May 2007 (UTC)


I have uploaded the spreadsheet file I made to calculate the results of our study. I don't really know how to upload these types of files to wikipedia so I just did it as a photo upload. You can find the file here File:Vandalism study.xls. Remember 02:33, 28 March 2007 (UTC)

Great! I can't make sense of the discontinuity in columns A-K from 80-100 or the lack of discontinutity in columns I and on. I think I can make sense of the first K columns though. Pdbailey 02:55, 28 March 2007 (UTC)
It should be noted that the spreadsheet was just put together to calculate the specific data that we had set out to calculate, thus the vandalism part of the data tables (the part that categorizes what type of vandalism, who did the reverting, who did the vandalism and the reverting time) do not necessarily correlate to the specific article whose row they are in. But they are all in the correct groups of ten. That is why there is discontinuity I had to add more space so I could add more details about the vandalism in the edits from 70-80. Remember 12:27, 28 March 2007 (UTC)
Okay, as I noted above, my bigger concern is why if you sampled 174 articles there are only 100 entries in the spread sheet. If you want to present a random sample, you will need all 174. Pdbailey 22:21, 28 March 2007 (UTC)
The study only analyzed edits that were done in the months of November 2004, November 2005 or November 2006. This month was chosen at random and we limited the time period to one month so that we would not have to analyze all of the edits in an article's history. We also hoped to detect any trends that might occur by analyzing the data from different years. We looked at 174 articles total, but 74 of them did not have any edits in November 2004, November 2005, or November 2006 so there was no data to record. That is why there are not 174 data entries. Remember 22:37, 28 March 2007 (UTC)
So all 74 had zero edits in 2004, zero edits in 2005, zero edits in 2006? If so, this appears to contradict, "Articles with no edits for November 2004, 2005, 2006: Marcos Riccardi, Just a Poet with a Soul, Kodály körönd, FWG, ISO 3166-2:DZ, Wainuiomata River, Dyuni, List of people associated with San Francisco, Alexis Caswell, Communication disorder, Clerk of the Ordnance, Australian Professional Footballers' Association, Booroola Merino / Total: 14 articles with no edits for November 2004, 2005, 2006." Help me understand. Pdbailey 23:40, 28 March 2007 (UTC)
The data was broken up into sets of ten to help coordinate our investigations. The articles you cited were the articles that had no edits when I was collecting the first 10 data points (i.e., the first 10 articles that did have edits in either Nov. 2004, 2005 or 2006). Remember 00:42, 29 March 2007 (UTC)
Got it, thanks. Pdbailey 01:25, 29 March 2007 (UTC)

Based on your great spreadsheet, I've given you all I can above. Thanks for posting it. I'd suggest updating the results to reflect the more appropriate estimates and their standard errors (see Estimates and CIs above). There more than can be done, but I think it requires another drag through the data. I would need to be able to link vandalism to articles. If you are interested, we can talk about it, but I think it might be too much work for now to make up a sheet for this. I think I'll just make a nice version of the CDF (plot above) and wish you well. Pdbailey 02:53, 29 March 2007 (UTC)

Thank you so much for all of your hard work. Could you please suggest some language to add to our official conclusions because I think I would screw it up if I tried. Remember 15:16, 29 March 2007 (UTC)

Missing figures[edit]

Maybe I missed seeing where you stated these numbers, but:

1. What is the percentage of edits by anonymous editors that is considered by your criteria as vandalism? Is it greater or less than the 5% average?
2. What are the total numbers of anonymous edits & edits by logged-in users -- or the ratio between the two?

Knowing these numbers would help evaluate the impact of your findings. -- llywrch 17:21, 28 March 2007 (UTC)

We didn't collect that information, but that information would be useful. Also, it is possible to go through our data and add that information to our study if you wanted to. Remember 17:30, 28 March 2007 (UTC)

Issues with conclusions[edit]

Please feel free to reply to under each problem seperately.

  • Problem 1) Lack of use of standard deviation to eliminate outliers in mean calculations
    • Outliers could be eliminated via standard deviation tests and a meaningful average could be calculated. Such calculations are part of any introductory statistics course, and that they were not done here is problematic. The mean-time-to-revert should be recalculated after outliers outside 2 standard deviations (95% confidence interval) are eliminated. --Jayron32|talk|contribs 17:54, 28 March 2007 (UTC)
Feel free to do this calculation and add it to our analysis. I don't know how to do it otherwise I would. Remember 18:47, 28 March 2007 (UTC)
    • For a statistic as skewed as this one, the mean is essentially meaningless. Outliers SHOULD NOT be removed, they are significant. If you want to use common statistic methods, which assume that data are more or less normally distributed, apply them on the logarithms of the vandalism durations. _R_ 23:51, 28 March 2007 (UTC)
I think we resorted to a plain old mean because the three of us who worked most of the math out for Study 1 didn't have a huge amount of statistics background. If this way is better it should be done, but I'm afraid I don't know how to do it myself. JoeSmack Talk 01:54, 29 March 2007 (UTC)
  • Problem 2) Conclusion, especially the bolded part, seems to be an inadequate conclusion without further information.
    • The conclusion that seems to be begged from this study is NOT how much vandalism is caused by anons, its how many anon edits are vandalism. The distinction is not moot. For example, we have no data on the TOTAL edits by anons in the conclusion. If anons account for 95% of vandalism, but also count for 95% of total edits, then there is no anon vandalism problem. Without that data to compare to, the conclusion could lead to an unneccessary alarmist attitude against anon editing. The data we REALLY need to see is about what % of anon edits are: Good edits, Test edits, and Vandalism edits, and then to compare said results to the registered users. The fact that X% of vandalism edits are done by anons is a statistic of dubious importance, and may reasonably lead people to incorrect further conclusions. The statistic implies that anon editing is, on the balance, bad, however though it leads people to make that conclusion, it doesn't actually say that. If a statistic leads people to a wrong or incomplete conclusion, it isn't a great statistic. --Jayron32|talk|contribs 17:54, 28 March 2007 (UTC)
That would be some useful information to have. Perhaps we can get some editors to go back through our data to gather that information. Remember 18:47, 28 March 2007 (UTC)
I wholeheartedly agree - the lack of this missing information/conclusion immediately struck me as I read the study. To me, it is the most important potential finding. Adding to what Jayron32 pointed out, imagine that if it turned out not only was 97% of vandalism done by anon editing, but that this vandalism represented 50% of all anon editing. Such a finding could have a big impact on policy debates about anon editing. Does anyone know if followup analysis has been done? Wormcast (talk) 18:24, 13 February 2009 (UTC)
No, I don't think there has been any followup, it's been a couple of years. Keep in mind this info was gathered that long ago too, but given that flagged revisions is now a serious discussion, this kind of work becomes very topical... JoeSmack Talk 16:59, 14 February 2009 (UTC)
    • How many anon edits are vandalism isn't the only quantity of interest, knowing how much content and how many good edits come from anons would be equally interesting. _R_ 23:51, 28 March 2007 (UTC)
I think that is the biggest next question yet to come from this study: how many unregistered userer's edits are vandalism compared to how many aren't vandalism. Study 2 looks like it's heading towards a semi-protection efficacy bent which is also interesting (or perhaps a 'recent changes' approach to counting vandalism), but I don't see anyone's name on Study 3! I'd LOVE to see a study to produces some potential answers to this question. JoeSmack Talk 01:57, 29 March 2007 (UTC)


How are you defining "vandalism". Is it anything that gets reverted? What about accidental inaccuracies, test edits, or well meaning but poor quality content? What about edits that degraded the quality of an article that were never reverted? How are you determining whether the addition of a link was spam or not? This needs to explained as part of the study write up.

Please see Wikipedia:WikiProject Vandalism studies/Study1#Description for a description of the various types of vandalism. In general we judged vandalism on a case by case basis. It was not based on whether the vandalism got reverted or not, but all the vandalism we found did get reverted. We also viewed the edits, in general, in the best light so that any good faith editing was assumed to not be a vandalism. You can actually see what the vandal did if you scroll down to look at each incident of vandalism that we recorded (for example see Wikipedia:WikiProject Vandalism studies/Study1#Data Point 31. As far as linkspam, again we pretty much just categorized the ones that were obviously non-revelant (or so tangentially related as to obviously not be needed) links that were added to the article. Again, our data are all on our study page. Feel free to review the data. Remember 20:52, 28 March 2007 (UTC)
There's a serious problem here: for Data point 99, I fail to understand why [5] wasn't classified as spam and I wouldn't have considered at first sight [6] vandalism (indeed, it definitely isn't, since The Legend of Zelda: Four Swords subseries characters was at the time a redir to Vaati - IMHO, it's even the revert that ought to be considered vandalism). That it was reverted then shows that users have a bias against IPs and that it's considered vandalism now shows that this study is biased as well. Further studies would really benefit from a "double-blind" methodology that labels vandalism without referring at all to the status of the editor nor to the reactions of other editors. _R_ 00:58, 29 March 2007 (UTC)
I believe that datapoint was done by User:JackSparrow Ninja, so I don't know for sure until he comments, but the two sides to this are that it didn't qualify as 'linkspam vandalism' via our description (which you could think of as 'bad faith linkspamming, ie putting dick pill links on the Man article) or that it was a mistake and it was missed. We're certainly not beyond mistakes, and this was a first study so it doesn't totally surprise me that one could have been made. I also don't think it shows that users have bias against editors who have or don't have an account, but it might suggest it. This is also only one instance mind you.
I actually double-checked this edit. I didn't classify it as linkspam vandalism even though the reverting editor said it was linkspam. To me, it appeared to be what it said it was, which was a link to pictures of the game described. I did not understand why this would be vandalism and not a good faith effort to add information to the article. I know that wikipedia doesn't usually like to add external links to certain sites. But I was wary of classifying this as linkspam vandalism, when I thought it was a good faith effort to add information to the article. This is one of the inherent problems in trying to study vandalism: one man's vandalism is another man's good faith effort. But noting that inherent problem, we tried to do the best we could. Remember 15:21, 29 March 2007 (UTC)
A double blind method would rock. I am really interested in getting someone who does bot work onboard so that we might make a little edit aggregator that we can filter out a username/ip number for the edit and so we don't have to dig through tabs and tabs of histories which are more prone to slipping somewhere (triple checking is one heck of a time taker). JoeSmack Talk 02:08, 29 March 2007 (UTC)

Also, I agree strongly with Jayron32 that the conclusion is meaningless without knowing what percent of edits by unregistered users were not vandalism.

Well, the purpose of the article was not to lay blame on non-registered users, but was general a fact-finding mission (and it was our first so please forgive us for any oversights). We hope that our results will not be misused by others. We also intend to conduct further studies to flesh out any other facts that people may be interested in. If you are interested in joining our efforts please go to Wikipedia talk:WikiProject Vandalism studies/Study2. Remember 20:52, 28 March 2007 (UTC)
I agree with you and Jayron32 also, this is info we need to know. I might note though that Study 2 looks to be going towards a semi-protection vandalism before/after study or a recent changes vandal counting approach. However, Study 3 would be just as good as a place to start and you'd have my support and by the looks of it plenty of others interested in helping. What'dya think? JoeSmack Talk 02:11, 29 March 2007 (UTC)

Finally, please correct or explain your use of the term "anonymous". The majority of logged in users are anonymous in that you don't know who they are in real life. I know that on Wikipedia the word "anonymous" is used to mean a user who is not logged in, but this study may reach those outside of Wikipedia who will be misled by this misnomer. Logged in users may be more anonymous because you don't know their IP address which can tell you their location or even the name of their school or workplace. Angela. 20:32, 28 March 2007 (UTC)

Anonymous was meant to refer to non-register users, and not to people who have registered accounts. Remember 20:52, 28 March 2007 (UTC)
I know it was, but that will be extremely confusing for the outside world. I suggest it be renamed "unregistered users". Angela. 00:17, 29 March 2007 (UTC)
Sounds good. Let's rename it unregistered users. But I don't know which "it" you are referring to. Just the reference to anonymous users in the conclusions? If so, feel free to change. Remember 00:45, 29 March 2007 (UTC)
I think the term is wrong in general, so changing it everywhere would be good. Something else to check - are you sure [7] was vandalism? The double name is used like that on IMDb. I'm not saying it's correct, but the edit may have been made in good faith if they used that as a source. Angela. 01:34, 29 March 2007 (UTC)
I think you're right about using the term 'anonymous editors' - i can see this taking thinking in the wrong direction. I myself will start using 'registered users' and 'non-registered users' now. Good catch on Data point 27. That very well could have been a non-vandal edit; one limitation to the study is we didn't (and probably couldn't time wise) investigate every edit deep down to make ultra-mega-sure it is a vandal edit or not in these types of cases. It was reverted, and I think that was a flag for the person who did this data point. JoeSmack Talk 02:18, 29 March 2007 (UTC)

Press coverage of studies[edit]

FYI- someone Dugg this study. [8]. Remember 15:37, 29 March 2007 (UTC)

Also blogs about the study WikiAngela, Original Research blog, Valuewiki Blog. Remember 03:24, 8 April 2007 (UTC)
Also this made it into Wikizine, Year: 2007 Week: 15 Number: 67 [9]
Also made the Wikipedia signpost: [10]
Our project was discussed on the most recent wikipedia weely podcast Wikipedia:WikiProject WikipediaWeekly/Episode19. They start talking about it at minute 17.

IP users[edit]

A few notes of my own:

Looking at the same pages and months as the study, I found that IP users made 187 edits.
If 30 of these edits are vandalism, then 16% of IP user edits are vandalism. (On the other hand, 84% are not vandalism.)

See User:Kevinkor2/Vandalism Study 1 where I recorded IP edits for article×month combinations. (You can click on a cell to go directly to the history for a given month.) --Kevinkor2 06:03, 6 April 2007 (UTC)

I think this information would be worth adding to our study. Thanks a lot! Remember 11:58, 6 April 2007 (UTC)
I think so too, anyone object? This information was requested by a few people, and I think it is relevant. JoeSmack Talk 02:07, 9 April 2007 (UTC)
Unless anyone objects, someone please add this information to our data and to our conclusions. Thanks Remember 12:57, 9 April 2007 (UTC)