Wikipedia:Community health initiative on English Wikipedia/Research blog
- 1 Analyzing the Harassment Reporting and AN/I Survey 2/9/2018
- 2 Wikihounding Research with the Wikimedia Research Team 11/10/2017
- 3 AN/I Analysis and Follow Up 09/12/2017
- 4 Apologies for the Delay! And follow ups from Wikimania 08-30-2017
- 5 Quantitive Research on AN/I 07-18-2017
- 6 Welcome to Caroline's Research Blog 07-12-2017
Analyzing the Harassment Reporting and AN/I Survey 2/9/2018
Hi all, Apologies first for the delay in writing. I got a bit swamped but that's no excuse. I'm actually going to start posting, however big or small, every two weeks. Anywho, onwards to this post.
During the month of December, the WMF's Support and Safety team (SuSa) and the Anti-Harassment Tools team ran a survey targeted at admins on Wikipedia English Administrators' Noticeboard/Incidents and how reporting harassment and conflict is handled. Around 136 people responded and filed out the survey. The survey has 23 questions, included 6 write in answers, and 2 questions rate multiple features (so it has multiple selections). Those two questions with multiple spaces to rate or rank answers are questions like "on a scale of 1 to 5, how well are specific types of problems are dealt with at AN/I?" The choices listed to rank are actions like sockpuppeting, personal attacks, copyright violations, etc.
The qualitative answers are taking a bit longer to sort. Those answers stem from questions like "what do you think AN/I does well," "if you could change one thing about AN/I, what would it be", etc. Some of the answers we are sorting are already illuminating specific groups of feedback. For example, administrators were noting that what AN/I needed the most to improve were things were two kinds of feedback specific to different kinds of structure structure. This structure can be divided into two categories: technical structure and formalist structured. Technical structure included feedback such as suggesting more ‘structured’ data, which could mean more proper reporting or creating specific forms or templates for data structure. Formalist structure were requests and feedback that included more clearly defined rules, better rules for how cases are tried, clerking/moderation and noting a mixture of needing some private and some public reporting.
For the month of January, SuSA and the Anti Harassment Team have been analyzing the quantitive and qualitative data from this survey. Our timeline towards publishing a write up of the survey is:
- February 16th- rough draft with feedback from SuSa and Anti-Harassment team members
- February 21st- final Draft with edits
- March 1st- release report and publish data (which SuSa has been diligently compiling) from the survey on wiki
We are incredibly excited to release our findings with the community. I, personally, am incredibly excited because it's first large collaborative research projects and published reports that I've done while at the WMF. I will definitely be linking to it from here :)
Wikihounding Research with the Wikimedia Research Team 11/10/2017
Myself, along with the research team, are focusing on creating a machine learning model about ‘wikihounding.’ A few weeks ago, I posted about the work my team is doing on looking at cases on AN/I and labeling specific cases as harassment and conflict. We were interested in seeing the amount of cases on AN/I, how many were resolved which I summarized here.
Myself, Diego Saez-Trumper, Aaron Halfaker, and Jonathan Morgan all of the research team, are thinking of different ways to study wikihounding. To kick off this research, I created a large document summarizing what I thought could be a good basis for wikihounding, from reading about 30-40 case that mentioned wikihounding or involved wikihounding accusations.
Wikihounding is incredibly qualitative and quantitive, it’s has context as well as statical or mathematical aspects to it. What do I mean by that? Well, everything inside of a digital space has a quantitive aspect to it. Meaning our interactions are saved as data, our pasts, what we like, when we post things, how long those posts are (e.g. byte counts), when, on to, in response to, etc that’s all analytical data, it’s a data point. That’s the quantitive aspect, the qualitative part is then looking and reading the cases and determining if one editor was engaging in harassment or was trying to antagonize another editor. We don’t want to create a model that loops wiki-friending or mentoring in with wikihounding, hence why this si both a qualitative and quantitive project. We are focusing on AN/I and wikihounding cases labeled from that space- for 'high precision’ meaning these ANI cases that have been judged to be instances of WikiHounding and are community decided and labeled cases. AN/I is not a well structured dataset, but it's accessible for qualitative analysis and open.
The definition of wikihounding, from EN:WP: Wikihounding, as defined by EN:WP Harassment Page: “Wikihounding is the singling out of one or more editors, joining discussions on multiple pages or topics they may edit or multiple debates where they contribute, to repeatedly confront or inhibit their work. This is with an apparent aim of creating irritation, annoyance or distress to the other editor. Wikihounding usually involves following the target from place to place on Wikipedia.”
^The above definition, while lacking quantitative specifics, is a great one for a basis of understanding the qualitative aspects of wikihounding. It’s the context, e.g. following, repeatedly, an editor around, and the intention to inhibit, irritate, annoy and distress.
What is Wikihounding? A policy definition is a reversion of more than 3 times, that can be on the same pages or different pages.
My methodology for starting to think of quantitive parameters around wmkihouding and what those parameters are: But for a case to be argued as ‘real' wikihounding in a space like AN/I and Arb Comm and to be recognized by administrators as ‘wikihounding’ it's a mixture of: frequency+length of time+locations.
The methodology for frequency + length of time + location: The frequency, if it’s around 5 but under 10, with a length of time of a few days, across more than 3 pages, is definitely wikihounding. But a case that is a frequency of between 5-10, over the course of two years, across only three pages would just be ‘regular’ editor interaction. It’s an interaction that does cause some editors discomfort. For something to truly be wikihounding, it’s the combination of frequency, time and locations. A smaller frequency should have a shorter length of time. The longer the time of harassment continues, the frequency of revisions and edits must go up exponentially.
Hounding Examples: Frequency is more than 5+ interactions Length of time: longer than 24 hours but less than a month Locations: more than 3 pages (this includes an article page, and an article talk page)
Frequency: 10+interactions Length of time: longer than a month locations: 3+pages
Not hounding: 3-8 interactions 1+year any number of pages
The 3R (3 revisions) rule gets brought up a lot in AN/I cases to show examples in other harassment domain cases, like someone accused of personal attacks, etc, that wikihounding is also occurring as well. The 3R rule can be argued as wikihounding but it’s rarely agreed upon as wikihounding since usually the other editor involved in the case has responded. Instead, what it looks like is two editors are fighting and there were revisions involved.
It’s been hard and difficult to find a lot of wikihounding cases that are also canonical cases. Wikihounding can look like a lot of things- like various forms of edit warring, etc. There are a lot of cases that would technically be called hounding but occurred within a larger context of another case, e.g. a user being a sock puppet or engaging in more rampant harassment or bad behavior. Ultimately those cases are labeled as something else, not hounding. What we plan to do: Start by looking at AN/I labeled and archived wikihoundign cases. Our reasoning for this is because this is a good basis for having community labeled and decided wikihounding cases. From here, we may start to create a model, by hand reading some cases, and then plot out all of wikihounding cases using machine learning. Plotting, I mean, looking for how many cases have a similar frequency of interaction, or a similar length of time, or similar demographics of editors (experienced, unexperienced, etc).
- Looking at label wikhounding archived cases on an/i
- Looking at similarities by hand on a few cases
- Run machine learning algorithms over the entirety of cases and comparing the differences to what we saw as ‘labeled similarities’ by hand- are there patterns/ where are there patterns? Do they differ, at all, from human labeled patterns (there will be some differences)
- Plotting those cases across a graph?
- Looking at time, frequency, location
- then: contextual aspects of hounding: so incivility, antagonizing content, toxicity
- ^these terms: incivility, antagonizing content, toxicity, will need ables and examples
Our questions and concerns: What about recall? What is adjacent wikihounding? things that are similar to wikihounding but not quite hounding? What are the false positives, false observations- reverse wikihounding?
AN/I is unstructured data so someone needs to query those cases and then parse them Are there kinds of wikihounding that are not well represented in that dataset?
We want feedback, and suggestions, and criticism. So please leave your thoughts!
AN/I Analysis and Follow Up 09/12/2017
For the past month, the Anti-Harassment Tools Team and the Support and Safety team have been collaborating on a study specifically for AN/I. We were interested in exploring what ‘kinds’ of harassment, abuse, and conflict exist on Wikipedia. This concept of differing between ‘conflict’, ‘harassment’ and ‘abuse’ is also explored in a talk I gave at Wikimania ‘17. It explores why conflict can be necessary for Wikipedia- debate is at the heart of open knowledge, and there will be conflict within that. However, it’s important to see what makes up conflict, when conflict becomes harassment, what kinds of actions persist from conflict to harassment, and then abuse. What’s important to note here is that this talk covers why it’s important to look at and separate actions into different buckets of ‘severity’ because that severity informs what kind of action that needs to be taken. Mitigating wikihounding is different than responding to a death rate, it’s important to create buckets or a taxonomy of separating certain actions that will have certain reactions.
So this is where AN/I comes in. To kickstart achieving one of our Q1 Goals, which states "Assist Support & Safety in preparing a qualitative research methodology for measuring the expectations and experiences of people using our main noticeboards for user disputes.” To achieve this, myself and our senior developer, Dayllan Maza, decided to analyze two months of AN/I data as a representational slice of what potential conflict could look like on Wikipedia. This analysis is the start of analyzing a two pronged problem. ‘Try’ here is the key word. AN/I is incredibly hard to analyze, and our results were not quite what we had hoped for. The reason AN/I is hard to analyze because of how unstructured the data is- there is no form or set or specific way for editors and admin to ‘report’ or ‘file’ a specific request on AN/I. There’s a 1000 ways to write a request for help and users will find 100,000 ways to write those same requests. Think of all the ways you could write “This editor is engaging in COI,” it could be three paragraphs long to just three words. What are the keywords to pull out here? Is it the mention of “COI”- if analyzing just mentions of keywords, we could end up with thousands of cases and a fair amount not actually being legitimate COI cases, for example.
This is how we had hoped to structure the research: Amount of cases from April to May 2017 Breakdown of case characteristics: Reporter Alleged perpetrator Alleged victim (if not the reporter) Involved users (friends/supporters of perp or victim, enemies of perp or victims, users who witnessed the incivility.) (this may be hard to prove or show) Non-admin active ANI participants (users who regularly check ANI and participate in dispute resolution, although not admins.) Admins How cases were closed/resolved? How many cases were archived but not formally closed? The frequency of admins responding to cases What are the breakdown of reasons for filing cases? Check for keywords Cases that link to policy pages Users (editors and admins) weighing in on multiple cases
I ended up not being able to do a lot of the analysis by hand, simply because it took too long. So much of AN/I does require in-depth reading to see if the case is what the case is presented as. A lot of editors know that people will file false cases, or will be in the wrong when bringing another user to AN/I. But also, some editors would describe what was happening but not give it a label or a word, hence why reading individual cases was so important. We are hoping to collaborate with an university to hand separate and tag this information. What we have done recently is look at all of the formally closed as well as all cases that have admins linking to policy. From here, we plan to further analyze and group, potentially by hand but most likely by keyword, to see what policy is being linked and to which cases. What we can learn here is what are the kinds of cases where policy works well. Additionally, we’ve looked at which admins to respond to the majority of cases to analyze correlations of kinds of cases to amount of responses. What we found was interesting. Out of 533 total cases, 315 were resolved, had over 1008 users involved or weighed in, and the most popular day to start a case was Sunday. Out of 533 total cases, 514 cases linked to policy pages. How helpful was policy in recognizing behaviors or supporting arguments? With this kind of analysis, we can now drill into specific posts and arguments on how certain pages were used. In terms of policy and linking, the top five policy linked pages were (in order of popularity): What Wikipedia is Not (linked to 59 times) Edit Warring (linked to 48 times) Here to Build an Encyclopedia (linked to 46 times) Assume Good Faith (linked to 44 times) No Original Research (linked to 43 times) We also had 632 users, from editors and admins, participating in cases. Participation is sited as anyone who comments on a case in AN/I. For the months of April and May 2017, the top five users weighed on cases from a range of 51 times to 30 times: One editor weighed in 51 times One editor weighed in 38 times One Admin weighed in 32 times One Admin weighed in 34 times One Admin weighed in 30 times
The challenges here is that form and structure is needed in an structured space. An idea, simply an idea, could be to potentially create a form and form fields for editors and admins to write in the different kinds of issues they are facing, who’s involved, over what pages, etc. This would make the ability to parse the AN/I cases so much easier, faster, and would return much needed and usable data.
Part of SuSa’s goals and the Anti-Harassment Tools team goals, so my research goals, are to build out a larger qualitative and quantitative series of research plans to analyze confidence of admins and editors as well as experiences reporting and mitigating harassment using the tools the community currently has at their disposal. Being able to do small experiments for a few weeks or a few months to create this missing data is integral to our work. Moving forward, we are partnering with the research team to focus on what makes up wikihounding, while still going through by hand to hand to continue our analysis. Breaking up our research into smaller chunks will make it easier to analyze and solve.
[tags: AN/I, harassment, conflict, abuse, analysis, wikihounding, stats]
Apologies for the Delay! And follow ups from Wikimania 08-30-2017
Hi all! Sorry for the delay- prepping for Wikimania, then being at Wikimania, then catching up with work post Wikimania, and then chewing on what we discussed at Wikimania- clearly, there's been a lot!
Thoughts on Wikimania, I didn't get to see as *many* talks as I would've liked. It was my first Wikimania, balancing also giving a talk of my own, as well as facilitating a round table on the work the Anti-Harassment Tools Team is doing. But, I was really glad to catch Funcrunch's talk on online harassment called ""Facing Defacement", which was on forms of harassment (vandalism on his talk page) and an edit filter he made with Leon, from the Comm Tech team (Leon is one of my coworker's). What I really enjoyed about this talk is the distinction Funcrunch makes about the existence of a user talk page, which has a separate purpose and meaning, than an article page. What do I mean by this? A user talk page is *technically* not part of the encyclopedia, though it exists on Wikipedia. It's considered a separate thing, it's not an article page, but a page to chat, gather, and talk. Thus what can be edited, added, and implemented on a talk page is quite different than adding content to an article page, right? Yes, for sure. Talk pages can be quite silly, it's part of their design. Talk pages are important, and they are also personal. Funcrunch gives this well worded analogy, "Vandalizing a user talk page is like spray painting graffiti on their front door." Which, I agree with. Funcrunch credited edit filters that don't allow anonymous or unconfirmed users to edit their talk page. What I love about this filter is that it's a filter, and not a site wide feature. It's something that creates user agency, because a user can choose to enable it or not. Be able to offer small nuances of choices like this so editors, on their personal talk pages, can have a better experience and can mitigate their own interactions is. What makes Wikipedia such a great space is this ability for editors/admins/users/anywho to have a hand in the tool creations, interfaces, filters, etc that are created and implemented on the site.
What is key to mitigation harassment is thinking of different kinds of tools be it a tool, a UI/UX setting, settings for notifications or privacy (a variety of things) that can help users tailor the kinds of experiences they want to have without changing the identity of the site. Meaning, being able to be somewhat private in some instances can be helpful, thinking of interactions as more fluid as opposed to binary public or private is really important. I will be posting a blog post in a week that gets way more into this but user agency is key! And allowing more kinds of profile setting choices, like muting, is key in harassment mitigation.
We also just rolled out a new feature, and it's something I am super proud of the work our team has done. We've made more robust mute features: https://en.wikipedia.org/wiki/Wikipedia:Community_health_initiative_on_English_Wikipedia/User_Mute_features. Trevor Bolliger, the product manager of (my team) the Anti Harassment Tools Team made this great chart to help elucidate what kinds of brainstorming and studying we did around the new Mute features.
We would love to hear feedback about the Mute feature, we'll be doing more things with it soon.
Quantitive Research on AN/I 07-18-2017
To kickstart achieving one of our Q1 Goals, which states "Assist Support & Safety in preparing a qualitative research methodology for measuring the expectations and experiences of people using our main noticeboards for user disputes.” To do this, I want to analyze two months of AN/I data to analyze representational slice of the conflict within Wikipedia. This analysis is the start of analyzing a two pronged problem. This is just the beginning of our research under the Craig Newark grant, so full disclosure: it’s a way to start seeing what are the gaps in data that are missing, what is needed to better understand harassment and then use that data to come up with solutions mitigate harassment. But I also want to know what editors and admins think of our findings, what gaps do you see missing? What do you think of the analysis?
Two Month Analysis of AN/I (from April to May 2017)
- What are the amount of cases in this 2 month period?
- Breakdown of case characteristics:
- Alleged perpetrator
- Alleged victim (if not the reporter)
- Involved users (friends/supporters of perp or victim, enemies of perp or victims, users who witnessed the incivility.) (this may be hard to prove or show)
- Non-admin active ANI participants (users who regularly check ANI and participate in dispute resolution, although not admins.)
- How cases were closed/resolved?
- this involves coming up with a definition of resolution. The definition I have is: a solution is proposed, generally agreed upon, enforced, and then the discussion ends.
- Per example: “let’s i-ban someone for 24 hours,” some debate back and forth, then someone saying “I I-banned them for 24 hours” and then the case peters off.
- this involves coming up with a definition of resolution. The definition I have is: a solution is proposed, generally agreed upon, enforced, and then the discussion ends.
- How many cases were archived but not formally closed?
- This will then be compared to cases that are resolved and archived, versus all cases that are archived to get a better understanding of dispute resolution within AN/I
- The frequency of admins responding to cases (did one admin to respond to many cases, did a handful of admins respond to many cases- what is the frequency of admin response and how many admins respond to multiple cases? Can we look at this as a range, such as X amount of admins respond to 1-5 cases, and Y amount of admins respond to 10-20 cases, etc)
- What are the breakdown of reasons for filing cases?
- I will be doing this by hand, so there will be concrete numbers on out of X number of total AN/I cases for April- May 2017, Y numbers were wikihounding, Z numbers were contesting edits, etc.
What I will be doing with some of our developers:
- Check for keywords
- Which cases are linked to policies (e.gl Wikipedia:harassment)
- Creating graphical breakdowns of the above
- Bar graphs of kinds of cases, and types
- Show overlap of cases
- Frequency of data types:
- Kinds of cases
- Users who file a case
- Users involved in a case
- Admins who look over multiple cases
- Cases archived
- Cases considered ‘resolved’
- Analyzing who started cases, who are tagged in cases, who resolved/weighed in on cases and seeing if there’s any overlap between cases
- 2 month timeline- date/time frequency (are Friday’s a popular day to start a case? Maybe this isn’t important but it’s important to see a timeline)
My goal is to work within the next two-threes to gather and analysis data to represent at Wikimania, and then release that data set to the community.
One of the reasons there is a push for this kind of quantitative research now is it’s incredibly helpful to have numbers, and to share these numbers with the community. What is missing here, in the research? From there, a collaborative plan can be formed to research/asses necessary missing data points. It’s important to see which cases are getting resolved and then figure out why other cases aren’t. Is it too complicated, too messy, not enough information, hard to suss out fault, etc. This is important information, it’s important data to have.
Additionally, this is the precursor of a small qualitative survey that will be rolled out (post August) to admins specifically on AN/I about their thoughts on dispute resolution, on how AN/I works, if it works and suggestions they have to improve dispute resolution.
I’m trying to be super mindful that not all data is good, that there is data saturation, and survey fatigue. But right now is a kick off to trying to gather this much needed analysis, numbers wise, of the frequency of different kinds of conflict on En:Wikipedia, and when and how that conflict becomes abuse.
Part of SuSa’s goals and the Anti-Harassment Tools team goals, so my research goals, are to build out a larger qualitative and quantitative series of research plans to analyze confidence of admins and editors as well as experiences reporting and mitigating harassment using the tools the community currently has at their disposal.
I know there are holes (lot of holes, and questions that will create more questions and eventually- hopefully- generate answers), but it’s figuring out how big those holes are and what data is needed to start fixing those holes.
(feel free to leave comments/questions on the talk page)
Welcome to Caroline's Research Blog 07-12-2017
Hi everyone! I'm Caroline Sinders, the product analyst and lead researcher on the Anti-Harassment Tools team. I decided to start a blog as a place to put my in-progress ideas, research, links, and thoughts to better share with the community. As a researcher, I'm incredibly invested into creating open source research, which means showing the research in progress, and iterating upon that research. So this will be space to see ideas in progress, and seeing how those ideas change and evolve over time.
One thing I've been excited to dig into today and yesterday has been the Pew Research Center's Online Harassment 2017 study.
A lot of findings weren't that new or eye opening to me; I've been working in studying online harassment and protest for almost five years now. But what is helpful is the statistics year to year to help provide data around this subject. For example, a major takeaway was: "Four-in-ten U.S. adults have personally experienced harassing or abusive behavior online; 18% have been the target of severe behaviors such as physical threats, sexual harassment. Around four-in-ten Americans (41%) have been personally subjected to at least one type of online harassment – which this report defines as offensive name-calling online (27% of Americans say this has happened to them), intentional efforts to embarrass someone (22%), physical threats (10%), stalking (7%), harassment over a sustained period of time (7%) or sexual harassment (6%). This 41% total includes 18% of U.S. adults who say they have experienced particularly severe forms of harassment (which includes stalking, physical threats, sexual harassment or harassment over a sustained period of time)." This takeaway is pulled, verbatim, from the report.
The four-in-ten U.S. adults experiencing harassment (which is around 41% in this report) is up from 35% from when the report was last conducted in 2014.
Generally, what's important about this report is looking at it from a context specific perspective. The term 'online harassment' is a wide umbrella term running the gamut from low level harassment such as 'name calling' to very serious and abusive interactions such as physical threats, sustained harassment, sexual harassment or stalking. How users view the seriousness of harassment or when there should be more policy intervention or involving the policy often comes form the kind of harassment a user has received. What I mean by this is if the main kind of harassment you, the user has received, is something like name calling, you are more likely to view harassment as not so serious of a thing. But if the harassment you've sustained or witnessed involves physical threats, persistent harassment, stalking, etc than you view harassment as a much more serious thing that involves more policy, and more intervention. I found this specific takeaway to be illuminating of this problem:
"Men (30%) are modestly more likely than women (23%) to have been called offensive names online or to have received physical threats (12% vs. 8%).
By contrast, women – and especially young women – encounter sexualized forms of abuse at much higher rates than men. Some 21% of women ages 18 to 29 report being sexually harassed online, a figure that is more than double the share among men in the same age group (9%). In addition, roughly half (53%) of young women ages 18 to 29 say that someone has sent them explicit images they did not ask for. For many women, online harassment leaves a strong impression: 35% of women who have experienced any type of online harassment describe their most recent incident as either extremely or very upsetting, about twice the share among men (16%).
More broadly, men and women differ sharply in their attitudes toward the relative importance of online harassment as an issue. For instance, women (63%) are much more likely than men (43%) to say people should be able to feel welcome and safe in online spaces, while men are much more likely than women to say that people should be able to speak their minds freely online (56% of men vs. 36% of women). Similarly, half of women say offensive content online is too often excused as not being a big deal, whereas 64% of men – and 73% of young men ages 18 to 29 – say that many people take offensive content online too seriously. Further, 70% of women – and 83% of young women ages 18 to 29 – view online harassment as a major problem, while 54% of men and 55% of young men share this concern...
But the 18% of Americans who have experienced more severe forms of harassment – such as physical threats, sustained harassment, sexual harassment and/or stalking – differ dramatically in their personal reactions and broader attitudes toward online harassment." Those have have experienced more severe harassment have more stress, are more likely to change up their privacy settings, delete profiles, change their users names, cease going to specific online or offline places, and may contact law enforcement.
The reason it's important to bring this up, because this helps provide parameters of responses for a researcher. By understanding how users will respond to different kinds of stressors from low to high abuse, from once to persistent, it can help define what kinds of protocols, methodologies and responses to create. Meaning, not every situation will have the same solution, so how do we think about all of the different kinds of conflict inside of Wikipedia, and then think about when that conflict turns into harassment and then abuse. I think that conflict is necessary on a site like Wikipedia, we need to engage and debate about the kinds of information we are creating on this site. Conflict is not bad thing- healthy and some heated debates are good- they are necessary with knowledge creation. Not all conflict is abuse, not all conflict is harassment, and it's really important to highlight that. It's also important to know when conflict turns into harassment and then harassment turns into abuse, and then to understand what the responses that users will have TO conflict, TO harassment, and TO abuse. If harassment is the middle ground, a middle ground that causes emotional strife, it's important to understand that and outline the different kinds of behavior that exists within the bucket of conflict, the bucket of harassment, and the bucket of abuse.
tl;dr- read this report! I always find them to be an illuminating experience. Also, this takeaway was interesting: "What people actually consider to be “online harassment” is highly contextual and varies from person to person. Among the 41% of U.S. adults who have experienced one or more of the six behaviors that this survey uses to define online harassment, 36% feel their most recent experience does indeed qualify as “online harassment.”
If you have any questions or want to talk about the survey, feel free to reach out to me on the talk page :)