Jump to content

Wikipedia talk:Wikipedia Signpost/2020-05-31/Recent research: Difference between revisions

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Content deleted Content added
c re accuracy metrics
suggesting a wiki page to subject to examination both manually and automatically
Line 11: Line 11:
* Re: "Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes", every so often I chip away at articles containing "the The" (usually a typo, but not easily automated because of things like "...for the The Wall Tour", "Congo, Democratic Republic of the. The World Factbook", "...assessment criteria in the THE rankings" and "They were christened by the media as the "The" bands"). So, like many who look for typos, I have a large percentage of edits made by a user that are less than 10 bytes. --[[User:Guy Macon|Guy Macon]] ([[User talk:Guy Macon|talk]]) 18:05, 1 June 2020 (UTC)
* Re: "Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes", every so often I chip away at articles containing "the The" (usually a typo, but not easily automated because of things like "...for the The Wall Tour", "Congo, Democratic Republic of the. The World Factbook", "...assessment criteria in the THE rankings" and "They were christened by the media as the "The" bands"). So, like many who look for typos, I have a large percentage of edits made by a user that are less than 10 bytes. --[[User:Guy Macon|Guy Macon]] ([[User talk:Guy Macon|talk]]) 18:05, 1 June 2020 (UTC)
:*Guy, some of your edits might score high on that axis, but I don't think you would be selected by the algorithm due to the additional features outlined in '''4.2 User-based features'''. Both of these would score low: {{tq|average time between two consecutive edits made by the same user}} in the same article and {{tq|the percentage of edits made by a user that are less than 10 bytes}}. Unclear if the last thing is scored across the suspicious article or across the account's lifetime, but either way. ☆ [[User:Bri|Bri]] ([[User talk:Bri|talk]]) 19:03, 1 June 2020 (UTC)
:*Guy, some of your edits might score high on that axis, but I don't think you would be selected by the algorithm due to the additional features outlined in '''4.2 User-based features'''. Both of these would score low: {{tq|average time between two consecutive edits made by the same user}} in the same article and {{tq|the percentage of edits made by a user that are less than 10 bytes}}. Unclear if the last thing is scored across the suspicious article or across the account's lifetime, but either way. ☆ [[User:Bri|Bri]] ([[User talk:Bri|talk]]) 19:03, 1 June 2020 (UTC)

* I'm not sure that this is the correct place to suggest a wiki page to subject to examination; I think the wiki on Connections Academy reads like a sales pitch and would benefit the community to subject it to scrutiny. Thanks! [[Special:Contributions/146.14.46.201|146.14.46.201]] ([[User talk:146.14.46.201|talk]]) 02:12, 6 June 2020 (UTC)

Revision as of 02:12, 6 June 2020

Discuss this story

AUROC of 0.983 and average precision of 0.913 that sounds pretty good to me! Looking forward to seeing the results of this once it's applied in a more practical setting. {{u|Sdkb}}talk 11:57, 1 June 2020 (UTC)[reply]

Looking at it, I suspect it might help to filter out company Paidcoi SPAs, which is certainly worthwhile, but the "professionals" can tweak their behaviour fairly easy to heavily minimise their appearance without that much more work (e.g. in edits used to gain AC). Nosebagbear (talk) 12:45, 1 June 2020 (UTC)[reply]
Goodhart's law applies. Nemo 14:16, 1 June 2020 (UTC)[reply]
I've been working in this arena for a while, and in fact have a credit in the paper for contributing labeled data that was used to train the model. We aren't sure how sophisticated some of these operations are but my feeling is there's a distinct break between the activities of the outfits catering to well-funded Global North entities (in particular corporations and their executives, entertainers/entertainment companies, and politicians and political groups) – probably what you mean by the "professionals" – and the rest. I wouldn't be surprised if the former are highly aware of the investigative techniques used on-Wiki, and adapt to whatever metrics and techniques we apply, but the latter are unable to, at least quickly. But the greatest volume of stuff that has to be dealt with is due to the less sophisticated group, and it would still be useful to have tools that willow that away so human effort can be focused on the remainder. ☆ Bri (talk) 16:30, 1 June 2020 (UTC)[reply]
Yes, or in other words it's easy to focus on the least consequential cases, while large-scale manipulations by well-funded enemies of the neutral point of view will be left untouched. Nemo 18:00, 1 June 2020 (UTC)[reply]
That's kind of the opposite of what I said. Enhanced tools can help identify the least consequential cases; and dogged and talented experts can detect large-scale manipulations by well-funded enemies of the neutral point of view. You should drop in at WP:COIN and see how it works. ☆ Bri (talk) 02:47, 2 June 2020 (UTC)[reply]
@Sdkb: I put together the dataset that the authors used and generated a lot of the features. When I last checked on unseen articles, it was classifying 50% as UPE... so clearly not much help in practice. Admittedly I wasn't aware they'd published this and they might have improved, but the metrics they were getting back then were pretty similar. I still think this is possible, but it requires a lot more work to generate the training data. SmartSE (talk) 17:16, 3 June 2020 (UTC)[reply]
  • Re: "Among the most effective features was "the percentage of edits made by a user that are less than 10 bytes", every so often I chip away at articles containing "the The" (usually a typo, but not easily automated because of things like "...for the The Wall Tour", "Congo, Democratic Republic of the. The World Factbook", "...assessment criteria in the THE rankings" and "They were christened by the media as the "The" bands"). So, like many who look for typos, I have a large percentage of edits made by a user that are less than 10 bytes. --Guy Macon (talk) 18:05, 1 June 2020 (UTC)[reply]
  • Guy, some of your edits might score high on that axis, but I don't think you would be selected by the algorithm due to the additional features outlined in 4.2 User-based features. Both of these would score low: average time between two consecutive edits made by the same user in the same article and the percentage of edits made by a user that are less than 10 bytes. Unclear if the last thing is scored across the suspicious article or across the account's lifetime, but either way. ☆ Bri (talk) 19:03, 1 June 2020 (UTC)[reply]
  • I'm not sure that this is the correct place to suggest a wiki page to subject to examination; I think the wiki on Connections Academy reads like a sales pitch and would benefit the community to subject it to scrutiny. Thanks! 146.14.46.201 (talk) 02:12, 6 June 2020 (UTC)[reply]