User:Manning Bartlett/Moni3 ANI analysis
OK, this is a page for discussion of Moni3's proposed AN/I analysis.
- 1 Background discussions (copied & edited for relevance)
- 2 Prototype development
- 3 Experimenting with a totally different approach
- 4 Coding the first few items from ANI729
Background discussions (copied & edited for relevance)
Note: The following collapsed discussions provide all the background to this project. They have been edited (by me) for relevance - please see the linked version for the complete discussions.
|please open to read the various discussions that led to starting this project|
Excerpted from AN_talk discussion
Data collection to assess how effective ANI is at responding to complaints
I apologize right up front here. I will be unable to do this because I simply don't have the time, but I'm hoping this idea will spur on someone else who does.
To be better informed about how effective ANI is at responding to editor complaints, those who are participating in this discussion should be aware of how ANI operates on a daily basis. If I had the time, I would chart the success of each thread at ANI going back at least three months. I would rate each thread on a scale of 1 to 5, much like the DEFCON ratings:
A: "Sidetrack" means any instance someone inserts a comment irrelevant to the original complaint, including attempts at humor, comments about responding admin(s), accusations that one or more responding admins should not be participating because they are involved, or something else that does not address a solution to the stated problem.
For whoever may take this on, and I really hope someone will, my DEFCON rating system here is based on my experiences at ANI. However, if you see fit to tweak or change the rating method, it should follow that any changes still rates each thread's success rate on:
You may have to ping my talk page to get me to respond here if you have questions. Please consider taking this on, and again, I apologize for not being able to do this myself. --Moni3 (talk) 22:43, 10 February 2012 (UTC)
Excerpted from Moni3's talk page
I finished the basic framework yesterday. The schema allows for diffs to be mapped to a user, to a specific thread and to an archive. These individual diffs can then be scored, and these scores can be rolled up to the thread level. I've also made room to expand the system to consider other noticeboards apart from AN/I, but I'm not planning to use that functionality in the short term.
Thread data includes who started the thread, how many comments it attracted, the open and close date, the general category (user complaint, sock, vandalism, page prot, etc) and whether it was resolved. User data includes whether they are an admin or an editor. (There are numerous other editor attributes we could include, but this was enough to get started). Diff data includes which user started it, the full text of the diff, the edit summary, and some analytic parameters.
The diff analytics consists of a set of 1-10 scores on various parameters. To get started I chose the following parameters: "Tone", "Relevance", "Sarcasm", "Hostility", "Constructiveness". I also have a flag for "Side Topic", and a measure for "Side topic relevance" - part of an attempt to analyze thread drift.
Note - It is already apparent to me that these parameters need refinement, but whatever, that's why we have prototypes.
I have loaded about 1000 users into the system (based on a scraping of who edited AN/I) and the threads headers from AN/I Archive 729 and 730. I've loaded about 50 diffs as well, but I want to get to 1000 before sending it over to you for review (1000 diffs is only about 3 days worth).It takes about a minute to enter a single diff (this will get faster), so there are quite a few hours of work ahead. I hope to have something to you by the end of the week.
Do you have MS Access available to use? I chose Access because it is great for "quick and dirty" development, as there are (no doubt) numerous fundamental changes yet to be made before we have a system we're all happy with. If you don't have Access I can dump the results out to a Google spreadsheet or similar resource. If the system seems useful we can look into making it more permanent in nature, but that's a long way off. Manning (talk) 03:47, 13 February 2012 (UTC)
Kim - I'm just using Access for prototyping of a data entry system - once I load all the diffs, we'll need to go through and score them and I can't use SPSS for that. Ultimately the best system would be a web front end over a MySql database, but I'm using Access for now because it is ideal for "quick and dirty". If we started with a web-based system then every time we change our mind about the params we'd have to recode the user interface (which gets really tedious). Once we have the data loaded and scored then SPSS would be ideal for analysing it (or Cognos which I work with).
I'm not real stressed about the params just yet, I think when you actually have diff data to work with your thinking might change anyway (it did for me). Once I get something out to you with diff data loaded you'll be in a much better position to make design choices. Cheers Manning (talk) 11:25, 13 February 2012 (UTC)
Manning, et al, when I said I was short on time...that wasn't an exaggeration. This is the earliest I've been able to reply to Manning's original post on my talk page, and the rest of my week(s) are going to go the same. I feel like a dingus suggesting stuff and then having others do it, but it hadn't been suggested and it seems Manning and company know what they're doing--far more than I would.
See Also - this discussion with CBM (a tool server developer).
OK, the first phase of the Access DB prototype is complete. It's got about 500 diffs in it (chiefly related to AN/I Archive 729). Each diff is linked to a parent thread (stored in the thread table) and parent threads are linked to an archive (stored in the archive table). Anyone who wants a copy of this database in Access 97 format please send me an email and I'll return the DB to you in a zip format.
The next step is to decide on the methodology to assess these. I've actually removed all the assessment parameters from the prototype I will send you, as I'd like you guys to look at it without my preconceptions attached.
Things I have learned so far:
- My original idea of "copying the diff into a text field for later evaluation" is not workable. For example there is no way to copy a deletion or rollback. Hence it is not possible to create an extract file which "contains all the diffs".
- The process is not quite as simple as "evaluating a diff". Some diffs are mere copyedits (or link/indent/formatting fixes) and don't need to be assessed if the result has no influence on the final version. Some are redactions (toning down) which alter how the final version appears.
- A great deal of "tweaking" goes on. Some editors (not naming any names) revisit their posts a large number of times.
- Thanks for keeping everyone up to date. I mailed you for a copy. I'll look at it as soon as I can, and see if I have anything sensible to say about it. Cheers. I'm probably a "tweaker", by the way :-) Begoon talk 02:45, 15 February 2012 (UTC)
CBM's Extract and first database
CBM just came through with a kick-ass extract. I've got 15000 URLs for diffs, plus the edit summary and user details. From here we'd need to build an interface which sucks in the diff and allows us to score them. Anyway, while I've brought the extracted diffs into the prototype, I haven't done anything else with them (other than format the revision number into a URL).
OK, the prototype database is getting sent out now. There is one form there which has example parameters (although they are disabled at the moment). That's the kind of thing that can be built for the prototype. Once we get a better idea of what we actually want, I can build a new form, and once we are happy with that, we can turn it into a web-based tool. Manning (talk) 11:10, 15 February 2012 (UTC)
- Got the mail - thanks enormously. 5 minutes looking at that and every comment you made above makes sense.
- One thing that strikes me is that every initial thought I had about how you might exclude certain diffs from processing automatically starts to fall apart because it filters out vandalism too, which we need to see... And anyway, if people fiddling with prose is so prevalent it could be causing edit conflicts, that would be a relevant finding - so I'm back in my thinking to scoring everything, even if it needs to be scored as "prose fiddling". Of course, depending on sequence, people adjusting their own comments can mean other things too...
- As you say, it's all about the "scoring" methodology. I'll spend some proper playtime in next few days, see what happens. I need to seriously score some diffs in a session to properly say anything more useful than that, right now, other than thanks again for putting this together. Begoon talk 12:05, 15 February 2012 (UTC)
I'm no data analyst and I'm more than happy to leave that to those who are. However I'm not entirely clear on how it's intended to score/categorise/whatever the threads and what the end result will be. So far I think we have Moni's initial DEFCON score and Manning's list ("Tone", "Relevance", "Sarcasm", "Hostility" etc). Assuming we're aiming for an end result that tells us some basic stats like the ratio of resolved to unresolved threads, and then gives a more detailed analysis of what made for a successful or unsuccessful resolution, do we need an agreed set of scoring criteria and scores? EyeSerenetalk 13:27, 15 February 2012 (UTC)
- Reposting a bit of what I said in the hatted section above: "I was looking at the suggested variables of "Tone", "Relevance", "Sarcasm", "Hostility", "Constructiveness". Will you have anchor text to describe each end of the 10-point scale? Also, I suspect there'll be a lot of correlation between some, which will make some redundant - eg I suspect tone, hostility and constructiveness will all correlate. We could get rid of one or two of these, and maybe have a further variable on something like "Use of policy" - ie the degree to which WP policy is invoked, quoted or linked to in the diff? Does the thread data need a bit more detail on the resolution as well? Not just whether it was resolved but how - block, page protected, editor warned, complaint dismissed etc etc... "
- I think the key thing that our selection of variables needs to be driven by a hypothesis. Eg, we might have a hypothesis that incivility in a thread is associated with higher levels of off-topic chat; or that threads closed quickly are more likely to be rated as having a clear, agreed ending. In which case we need to record variables for levels of incivility; levels of off-topic chat; speed of closure; and clarity/consensus ending. What are our hypotheses here? Kim Dent-Brown (Talk) 16:09, 15 February 2012 (UTC)
- I agree that the key to selecting the variables will be having some idea of what we think the end product ought to tell us. Regarding the scale, subjectivity is going to be almost impossible to avoid but I think we might minimise it with a smaller range rather than a larger one. For example, a three-point scale (eg good/neutral/poor) is easier to score and more likely to get the same scores from a range of different people than a ten-point one. It inevitably sacrifices some of the nuance, but given that scoring is subjective would the difference between, say, a '2' and a '4' on a ten-point scale be meaningful anyway? EyeSerenetalk 16:29, 15 February 2012 (UTC)
Thread and diff table design
The key tables in the DB are the thread and diff tables. The real value of this system will be in the thread analysis obviously. The thread table contains the following:
- The archive the thread belongs to
- The thread name
- The thread number (based on the TOC on the archive page)
- User starting the thread
- Thread category
This last column currently has ten choices: Block|Unblock Review / User Conduct Complaint / Content Dispute /Page Prot Request / Current Event response / Sock related / Admin conduct complaint / Vandalism in progress / Legal Threat / Policy Interpretation.
There may be other categories that need to be created. Based on the discussion above it is clear that the thread table needs more criteria. Some initial ideas could be:
- Resolved status
- Total comments
- Total "on-topic" comments
- Total users
- That looks great :) I've got two initial questions:
- Are we filtering out 'illegitimate' threads (vandalism, wrong board etc) or including them in the analysis? I guess there's a case for removing them, but their handling (prompt removal, correct redirection etc) could form part of the analysis too.
- I'm slightly wary about personalising the analysis, so do we need to know the user who started the thread? I'm not sure what that would tell us beyond contributing towards a primary key, though possibly knowing whether the respondents are admins/non admins might be useful in analysing the diffs.
- EyeSerenetalk 08:37, 16 February 2012 (UTC)
- - currently not filtering out anything. I think a thread that really belongs at AIV/WQA is still "resolved" in a sense. Vandalism threads get erased pretty quickly and don't appear as threads in the archives. The diffs involved have been recorded chiefly because it is best to capture everything at first and then prune (much harder to work in the other direction).
- - Including that was just "data analyst paranoia". As with the diffs, it's much easier to remove things than try to shoehorn them into the data model later on. If we decide we don't need it then it's no hassle to excise it later on.
- I got your email, sending DB as requested :) Manning (talk) 09:38, 16 February 2012 (UTC)
Experimenting with a totally different approach
As the purpose of this analysis is to review the effectiveness of threads, I've tried a new approach for collecting raw data (again using the randomly chosen Archive 729). This time I've simply scraped the entire text contents of the Archive page, and then mapped it to users (OK that part's still in progress).
The drawbacks of this approach is that we can't directly link back to the diff involved (as the archives are posted by MizsaBot II), and linking to users is a bit trickier. I've also had to replace all the return characters with (BR) tags and I stripped out unicode. (Also the dates and times all end up being Australian EST, due to my user settings). The advantage is it makes looking at the thread as a whole easier, and there is no uncertainty about which comment goes with which thread. I'll have a prototype version out fairly soon for anyone interested.Manning (talk) 23:42, 15 February 2012 (UTC)
Coding the first few items from ANI729
Hello Manning and fellow proponents of ANI improvement. Here's a table with my analysis of the first eight sections from Wikipedia:Administrators' noticeboard/IncidentArchive729. I've tried to assign Moni's DEFCON code numbers. She also gives the meaning of her codes in the green collapsed section near the top of this page, labeled 'please open to read the various discussions..'.
This group of cases is boringly straightforward and most of them seem to be well handled. (These are threads 1-11, neglecting some repeats). They all get code 5, except for one that is code 4 (since no answer was provided). I invite anybody to scan through the remainder of ANI729 to look for cases of Moni's 'bad' codes. If all codes are good, then ANI does not need improving. Perhaps the table should be extended to just include the 'bad' cases.
|Name||Start date||Description||Filer||Subject of the complaint||Outcome||Comment on appropriateness||Moni's DEFCON code|
|AWB usage suspended for Waacstats||26 November 2011||Lack of care and excessively rapid edits when using AWB||PBS||Waacstats||Mistakes were acknowledged by the editor. His access to AWB was restored.||ANI was the right place for this. Filer and subject both acted properly.||5|
|Admin attention requested: continuing unruly page moves||28 November 2011||Continuing unruly page moves regarding 5 O'clock (song)||Noetica||Jab7842 and others||No response||ANI was the right place for this. Filer acted properly. A response would have been good.||4 (since thread was not answered by any admin or experienced person)|
|User:Bonowatcher||26 November 2011||Repeated attacks on the subject of a BLP article at Chaz Bono||188.8.131.52||Bonowatcher||Bonowatcher already blocked as a sock||ANI was the right place for this. Filer acted properly. This was an obvious BLP violation and the sock was also obvious (at least to a checkuser).||5|
|Ducky socks, page protection||26 November 2011||Edit warring at Korean language by apparent socks||Crossmr||KoreanResearchCenter||Socks blocked||Either ANI or SPI would have been OK for this report. Filer acted properly.||5|
|Persistent spammer||27 November 2011||Spamming by User:Niel Mokerjee||Heiro||Niel Mokerjee||User was blocked one month for spamming||ANI was the right place for this, though AIV might also have worked. Obvious spam.||5|
|User:Swat0120||28 November 2011||Vandalism-only account||Rangoon11||Swat0210||Level three warning given||This report should have been filed at AIV instead of ANI. Otherwise, no problem.||5|
|Personal attack by Hipocrite||28 November 2011||Personal attack||TParis||Hipocrite||Editor withdrew the alleged personal attack||ANI was the right place for this. The subject of the complaint continued to make unusual remarks in the ANI thread. An admin boxed the discussion to keep it from continuing.||5|
|Athletics page vandalism||28 November 2011||Personal attack||Trackinfo||184.108.40.206||IP blocked three hours||This report could have been filed at AIV instead of ANI. Otherwise, no problem.||5|