Wikipedia:Bots/Requests for approval/Cyberbot II 4: Difference between revisions

Content deleted Content added

Inline

Revision as of 11:10, 28 August 2013

Cyberbot II 4

Operator: Cyberpower678 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:04, Thursday June 27, 2013 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): PHP

Source code available: No

Function overview: Tag all pages contatining blacklisted links in the MediaWiki:Spam-blacklist and the meta:Spam blacklist with {{Spam-links}}

Links to relevant discussions (where appropriate): Wikipedia:Bot_requests#Unreliable_source_bot

Edit period(s): Daily

Estimated number of pages affected: Unknown. Probably hundreds or thousands at first

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: This bot scans the above mentioned lists and tags any page with blacklisted link with {{Spam-links}} in the article namespace.

Discussion

Since the sites on those lists have been determined to be spam, would it be better to simply remove those links? Would your bot only consider external links, or also references? Thanks! GoingBatty (talk) 02:33, 27 June 2013 (UTC)[reply]

I believe it would be better to simply tag them instead of remove them. It is uncertain whether removing them may end up breaking something. I can have my bot remove them instead if that is what is preferred, or the MediaWiki software turns out to inhibit the bot. As for your questions, it would handle any link matched in article space.—cyberpower ^Chat_Online 02:40, 27 June 2013 (UTC)[reply]

External links are not {{unreliable source}}, they're external links. Also, you should probably skip links listed at MediaWiki:Spam-whitelist. And note there are also links on the blacklist that aren't there because anything using that link is unreliable, e.g. any url shortener is there because the target should be linked directly rather than via a shortener. Anomie ⚔ 11:24, 27 June 2013 (UTC)[reply]

Thanks for the input. I could remove external links while tagging refs with {{unreliable source}}.—cyberpower ^Chat_Offline 12:32, 27 June 2013 (UTC)[reply]

Wait, you're tagging external links that are listed on the spam blacklist with {{unreliable source}}? Unless I'm missing something here this won't work. When the bot tries to save the page, it will hit the blacklist and won't save. --Chris 13:55, 27 June 2013 (UTC)[reply]

I have considered that possibility, which is why my alternative is to simply remove the link and refs altogether.—cyberpower ^Chat_Online 14:51, 27 June 2013 (UTC)[reply]

In that case, I think this is something better dealt with by a human. Simply removing external links will probably lead to a bit of "brokenness" in the article where the link was, and would need human intervention to clean up after the bot. Also, if the article does have blacklisted links it in, chances are it probably has other problems (e.g. the entire article could be spam), so it would be preferable to have a human view the article and take action. I think if you want to continue with this task, the best thing to do would be for the bot to create a list of pages that contain blacklisted links, and post that for users to manually review. --Chris 15:18, 27 June 2013 (UTC)[reply]

I'm not certain if the software will block the bots edits, if the spam link is already there. I was thinking more along the lines that the tags place it in a category, that humans can then review. If it can't tag it next to the link, maybe it can tag the page instead and place it in the same category. What do you think?—cyberpower ^Chat_Online 15:30, 27 June 2013 (UTC)[reply]

As I understand it, if the spam link is already on the page, the software will block the edit anyway. --Chris 16:00, 27 June 2013 (UTC)[reply]

hmmm. I'm looking at the extension that is responsible. If the software blocks any edit that has the link in there already, that would likely cause a lot of problems on wiki. But, I'll have more info later tonight.—cyberpower ^Chat_Offline 16:57, 27 June 2013 (UTC)[reply]

{{BAGAssistanceNeeded}} I have tested the spam filter extensively on the peachy wiki. Tagging blacklisted links will not trip the filter, nor will removing it or adding the link if it already exists on the page. Modifying the link, or adding it to a page where the link is not yet present will trip the filter.—cyberpower ^Chat_Offline

Ok, I stand corrected. I'd like to review the source code for this bot. --Chris 12:40, 3 July 2013 (UTC)[reply]

Also can you give a bit more detail on exactly how the bot will operate? Will it only be tagging references, or will it remove external links as mentioned above? How will the bot deal with any false positives? Will it skip links listed on MediaWiki:Spam-whitelist? Will it be possible to whitelist other links (e.g. url shorteners as mentioned by Anomie), that shouldn't be tagged as unreliable? --Chris 12:47, 3 July 2013 (UTC)[reply]

The bot code is not yet fully completed as of this writing. I seem to be hitting resource barriers. Because it process an enormous amount of external links, I am working on conserving memory usage. Also, the regex scan is quite a resource hog as well, which I am trying to improve efficiency on. Yes, it will obey the whitelist. Because there is a risk of breaking things when removing the link, and tagging references can lead to false positives, I thought about placing a tag on the top of the page, listing the links that it found. False positives can be reported to me, or an admin, who will modify a .js page in my userspace with an exception to be added or removed, that the bot will read before it edits the pages.—cyberpower ^Chat_Online 14:00, 3 July 2013 (UTC)[reply]

The script is now finished. Chris G. has the code and is reviewing it. The task will seek out blacklisted external links and tag the pages containing them. Exceptions can be added for specific cases and it reads the whitelist too.—cyberpower ^Chat_Online 15:33, 24 July 2013 (UTC)[reply]

Although this page states that blacklisted external links will be tagged with {{unreliable source}}, Wikipedia:Village_pump_(miscellaneous)#New_Bot states that they will be tagged with {{spam-links}}. Could you please clarify? Thanks! GoingBatty (talk) 23:38, 24 July 2013 (UTC)[reply]

Some changes were made since the filing of this BRFA. I have now amended the above.—cyberpower ^Chat_Offline 05:43, 25 July 2013 (UTC)[reply]

Comment maybe post a note at Wikipedia talk:WikiProject Spam and possibly also at Wikipedia talk:Spam as the editors there might not have seen the note at the VPM. Just thinking. 64.40.54.156 (talk) 04:38, 25 July 2013 (UTC)[reply]
Done—cyberpower ^Chat_Online 08:40, 25 July 2013 (UTC)[reply]

Review:

Try and avoid using gotos wherever possible. It makes code hard to read, and often leads to strange bugs. E.g. at line 86 instead of:

    if( empty($blacklistregexarray) ) goto theeasystuff;
    else $blacklistregex = buildSafeRegexes($blacklistregexarray);

You could have written:

    if( !empty($blacklistregexarray) ) {
           $blacklistregex = buildSafeRegexes($blacklistregexarray);
           <LINES 89 - 112>
    }

Done all labels removed.—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC)[reply]

Line 13 - Why the while loop? Unless there is a continue I am missing somewhere it seems to just run once, and break at line #156

Already done The break command was a remnant from the debugging period. It's removed now.—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC)[reply]

Line 36 - while str_replace should work 99% of the time, it would be best practice to use substr instead. e.g.:

substr($exception[0],strlen("page="))

Done Missed this one.—cyberpower ^Chat_Online 12:44, 25 July 2013 (UTC)[reply]

Lines 127 - 131, you seem to be checking that the API hasn't returned a blank page? This should really be done at a framework level, not in the bot code. Basically you should check the HTTP code == "200", if it doesn't sleep for 1 second and try again. If it happens again sleep for 2 seconds. And so on. But this should be done at the framework level, so you don't have to worry about it each time you use "$pageobject->get_text();" (in fact, it should be checked on all API queries)

Already done You reminded me that I programed that safeguard into the Peachy framework already. :p—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC)[reply]

Bug at line 165 - "else return true;" I think you want "return true;" after the foreach loop. Otherwise it only checks one of the whitelisted links.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) return false; 
                else return true;
            }
        }

v.s.

        if( preg_match($regex, $link) ) {
            foreach( $whitelistregex as $wregex ) {
                if( preg_match($wregex, $link) ) 
                     return false; 
            }
            return true;
        }

Fixed—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC)[reply]

General comment. Considering how many edits your bot is going to make, you should put a sleep(); somewhere in the code to make sure you don't hammer the servers. At the very least after each edit, if not every http request.

Already done Framework has throttle.—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC)[reply]

lines 145ish - is it possible to get the page id in the same API request as you get the transclusions? That way instead of making 165,000+ API calls (for each page), you only make about 33 calls.

Done—cyberpower ^Chat_Online 12:24, 25 July 2013 (UTC) --Chris 09:29, 25 July 2013 (UTC)[reply]

AHHH. How did I not see that regex scan bug? D: Thanks for the input. I'll make the appropriate modifications now. I completely forgot that the framework was already designed to handle errors. :D
Modifications finished.—cyberpower ^Chat_Online 12:44, 25 July 2013 (UTC)[reply]

Trial

Ok, we'll start with a small trial to make sure everything runs smoothly, and then we can move onto a much wider trial. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. --Chris 10:57, 2 August 2013 (UTC)[reply]

It started out ok, but then something went horribly wrong and it started tagging pages with empty tags. I have terminated the bot at the moment and will be looking into what caused the problems.—cyberpower ^Chat_Online 12:46, 10 August 2013 (UTC)[reply]

Bug found. Bot restarted.—cyberpower ^Chat_Online 19:58, 10 August 2013 (UTC)[reply]

Trial complete. I haven't looked at the edits yet as it's currently the middle of the night right now.—cyberpower ^Chat_Offline 00:32, 12 August 2013 (UTC)[reply]

Even after the restart, 2 pages had blank tags added (1, 2). Also, maybe non-article pages should be skipped (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17), unless there is some reason that these should have the links removed:Jay8g [V•T•E] 01:18, 13 August 2013 (UTC)[reply]

Thank you. I am already looking into the bug. And am already working on excluding namespaces.

The bugs have been fixed. The exceptions list now supports entire namespaces.—cyberpower ^Chat_Online 12:14, 15 August 2013 (UTC)[reply]

Approved for extended trial (1000 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Although I would ask that you do them in batches (maybe 100 or 200 edits at a time) --Chris 10:16, 24 August 2013 (UTC)[reply]

The bugs seem to be fixed. However, upon reactivating the bot, it began tagging spam reports in the Wikipedia space. I have added the Wikipedia namespace into the exclusions list.—cyberpower ^Chat_Online 16:06, 24 August 2013 (UTC)[reply]
- I think that you should exclude pages in the file namespace and all talk namespaces as well:Jay8g [V•T•E] 18:01, 25 August 2013 (UTC)[reply]
  Why?—cyberpower ^Chat_Online 18:22, 25 August 2013 (UTC)[reply]
  - - With talk, it is usually discussing the inclusion of the link. With file, it is typically used as a file source. Neither should be removed:Jay8g [V•T•E] 02:53, 26 August 2013 (UTC)[reply]
      I've excluded MediaWiki talk, Talk, File talk, and File namespaces.—cyberpower ^Chat_Online 11:37, 26 August 2013 (UTC)[reply]
      User talk might be useful as well, for the same reason:Jay8g [V•T•E] 16:11, 26 August 2013 (UTC)[reply]
      
      Also, Wikipedia Talk should be excluded (see this edit):Jay8g [V•T•E] 16:30, 26 August 2013 (UTC)[reply]
      I've already added those.—cyberpower ^Chat_{Limited Access} 17:06, 26 August 2013 (UTC)[reply]
Comment - You may want to change the edit summary from "Tagging page with Spam-links" to "Tagging page with Template:Spam-links" to make it clear that the bot isn't adding links, but adding a template. Giving users a link to the template documentation may also help reduce the number of comments you get on your bot's talk page. GoingBatty (talk) 14:42, 26 August 2013 (UTC)[reply]
Good idea.—cyberpower ^Chat_Online 15:01, 26 August 2013 (UTC)[reply]
Comment Frankly, these spamlink tags are massive, too invasive and even disturbing for the readers. [1] They should have a more decent and reasonable size and position. Cavarrone 11:10, 28 August 2013 (UTC)[reply]

@@ Line 138: / Line 138: @@
 *'''Comment''' - You may want to change the edit summary from "Tagging page with Spam-links" to "Tagging page with [[Template:Spam-links]]" to make it clear that the bot isn't adding links, but adding a template.  Giving users a link to the template documentation may also help reduce the number of comments you get on your bot's talk page.  [[User:GoingBatty|GoingBatty]] ([[User talk:GoingBatty|talk]]) 14:42, 26 August 2013 (UTC)
 *:Good idea.—[[User:C678|<span style="color:green;font-family:Neuropol">cyberpower]] [[User talk:C678|<sup style="color:olive;font-family:arnprior">Chat]]<sub style="margin-left:-4.4ex;color:olive;font-family:arnprior">Online</sub> 15:01, 26 August 2013 (UTC)
+*'''Comment''' Frankly, these spamlink tags are massive, too invasive and even disturbing for the readers. [http://en.wikipedia.org/w/index.php?title=Texas_gubernatorial_election,_2014&diff=570208810&oldid=568846931] They should have a more decent and reasonable size and position. [[User:Cavarrone|'''C'''avarrone]]  11:10, 28 August 2013 (UTC)