Jump to content

Wikipedia:Bots/Requests for approval/Coreva-Bot 2: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
fix header
Line 248: Line 248:


'''As for the current status:''' Bug number 4 is now solved, Coreva will only add the footnotes template to pages of substantial length. It will also converts ampersands and other reserved HTML characters correctly now before saving the page, and I also updated the regex's used to determine if a template should be placed; thus reducing the amount of false positives. [[User:Excirial|<font color="191970">'''Excirial''']]</font><sup> ([[User talk:Excirial|<font color="FF8C00">Contact me</font>]],[[Special:Contributions/Excirial|<font color="FF8C00">Contribs</font>]])</sup> 22:11, 30 October 2009 (UTC)
'''As for the current status:''' Bug number 4 is now solved, Coreva will only add the footnotes template to pages of substantial length. It will also converts ampersands and other reserved HTML characters correctly now before saving the page, and I also updated the regex's used to determine if a template should be placed; thus reducing the amount of false positives. [[User:Excirial|<font color="191970">'''Excirial''']]</font><sup> ([[User talk:Excirial|<font color="FF8C00">Contact me</font>]],[[Special:Contributions/Excirial|<font color="FF8C00">Contribs</font>]])</sup> 22:11, 30 October 2009 (UTC)

:The intention is to tag 6 minute old articles with maintenance templates? Why? Is there community consensus that an editor should have only 5 minutes to write before a bot tags the article? What might be missing, imo, is a few more minutes to write an article.
:Personally, I'd let the bot finish it if my editing was interfered with in this manner. It takes hours to write an article. Sometimes I post a stub first. I'd like to see the community consensus for these tasks, for the templates to be added by a bot, and for the amount of time before adding the templates. It seems hostile if I understand the time frame correctly.
:Also, how many templates will it add? It seems to say it will only add one, but which one of the many? Or will it add more? --[[Special:Contributions/69.226.106.109|69.226.106.109]] ([[User talk:69.226.106.109|talk]]) 02:41, 31 October 2009 (UTC)

Revision as of 02:41, 31 October 2009

Operator: Excirial (Contact me,Contribs) 18:42, 7 January 2009 (UTC)[reply]

Automatic or Manually Assisted: Fully automatic, with the possibility to manually override the bots behavior if desired.

Programming Language(s): vb.net,

Function Summary:

  • Query Wikipedia API every X minutes (Idea: 5-10 min ish) for new pages
  • If bot is cold started, fetch newpagelist with the last X (Idea: 25-50) pages. (See: Note 1)
  • If the bot is running, only fetch the list of new pages since the last visit.
  • If the bot has found any new pages, load the page content and start to parse it.
  • Bot will parse the content to determine if any maintenance tags have to be placed.
  • If there is a need to place a maintenance tag, add the tag to the article, and resume with the next article.

Edit period(s) (e.g. Continuous, daily, one time run): Continuous

Edit rate requested: 1 edit per newpage tops. (Estimated 10 edits a minute, currently a test setting that is open to be lowered)

Already has a bot flag (Y/N): (Not applicable, new bot)

Function Details:
Note: Coreva-Bot had a previous bot request located Here. Prototypes of the previous idea behind Coreva showed that it would be virtually useless. This request is for a functionally completely different bot (But with an identical name).

Coreva's main task is placing maintenance tags on new pages that require them, similar to the way most newpagepatrol's work their beat. Coreva's will regularly(every 5-10 min) check the newpage list for new article's, fetch the new article's content, parse the content (See: Parser Table) and finally update the article, adding required maintenance tags.

Just like the previous Coreva, this one should also be quite light on server resources. The bot queries the server's new page list every 5-10 minutes, and (So far) each article re quire's two server queries (getting the article's content, and a query to check if the article is an orphan). Category counts, link counts et cetera are handled internally by the bot. Additionally, the bot will require one database write to add the template's (In case this is required). The estimated edit rate for the bot will be 2 edits per minute on average. (See: Note 2)

Coreva is not a miracle, and will never replace a living newpage patrol. Coreva cannot patrol for WP:CSD and does not understand hoaxes, advertising or vandalism. However, a lot of article's slip of the newpage list without having any form of maintenance tags. About half the pages on the newpagelist show as not being patrolled, and even though this is a very rough guess, this equals more then 2.000 pages a day. (See: Note 3) Since adding maintenance tags is thoroughly boring work, i think Coreva could spare quite a few patrols a bit of boredom :).(Unlike CSD tags which require at least some form of using your brain, maintenance tags require nothing more then checking 20 indicators, most of them nothing more then: Present/Not present)

Finally, just like the old Coreva, its still pretty much work in progress, which is only done in spare time. While the progress on this Coreva is much faster then on the previous one, i assume it will still take a few months before it is capable of being a fully automated bot. Even if it would be technically capable to do so, it will not be a fully automatic bot until i tested it thoroughly (few weeks i guess) in assist mode, which means Coreva would only me feedback on what tag it would place on every page it checks. This way any annoying mistakes in the parser should be ironed out, while at the same time it allows to improve the parser code.

Parser Table

This table gives an overview of the templates Coreva will be placing on the articles, along with the current criteria configuration for doing so. Note that this is still pretty much in beta stage; templates may be added and removed depending on tests. Also, the criteria are still based on very simple algorithm's. Coreva's tests are conducted on a very small and varied set of locally stored articles, thus criteria are still general. In their current form they should, however, produce very little false positives (But would likely have quite a few false negatives). So all in all: Work in progress! (See: Note 4)

Tag Criteria Comment
Wikify No internal links Amount of internal links = 0
Uncat No categories in the article. Amount of categories = 0
Unreferenced No references in the article Not Ref tags, or references/notes header detected.
Footnotes Article contains a standard "Notes" or "References" header, but no Ref tags -
Internal Links Article contains less then (Amount of words / amount of links) internal Links Percentage not yet set.
Orphan Article is linked by (0-2) articles. -
Stub Article size is smaller then X Suggestion: <1kb / 100 words / 1000 characters (inc. spaces)
Sections Article contains to little sections or readabilities sake (Note: section equals a linebreak) < 6 sections counter && (Amountofsections * 2500) > articlesize.
Too many links To be determined For this i still need to analyze guidelines, and the appropriate category.
Too many categories Amount of categories > X X: 10? 20? 30? Depends quite a bit on the article size. Perhaps a base of, say, 10, and another cat for every x words. (For example, World War II has 42 cats, but its a huge article).

Notes

  • Note 1: A second idea is to let the bot store his last query time permanently, and query all new (Non patrolled) pages since the bot went off line. These pages could then be processed at a lower priority, meaning that they would only be processed once the bot runs out of pages to process, with a limit on the amount of pages processed each minute. (So, if the bot limited itself to 5-10 edits a min tops, it would mean that 3-8 old low priority pages could be processed a minute). Being in the CET timezone, this would translate to a 250 or so page queue generates overnight that could be processed during the reminder of the day.
  • Note 2: I am currently in doubt if the bot should notify the user with a template in case maintenance tags are placed, encouraging the article creator to recheck the page while it is still "Warm". This would double the bots database writes, and at the same time i cannot predict if user are adverse to being templated, or if anyone would chance an article (Or ask for help). On the other side: If a user created a page on the basis of web site's, warning them no sources are added could prevent a hell lot of wasted time for other users to verify all the article's content from web searches.
  • Note 3: This is based on statistics from May, 2006. During that time Wikipedia got 3600 new articles a day. Nowadays the number is most certainly quite a bit higher, but due to the difference between peak and normal hours, its quite hard to make a guess based on special:newpages. :)
  • Note 4: Its rather obvious, but since i didn't mention it: Coreva does not add template's to pages marked for CSD, and does not add templates that already exist.

Discussion

Reopening request

Over the past two ish months the amount of time i could spend on Wikipedia was drastically reduced due to other duties, causing a certain lapse in coreva's development. Another issue halting development progress was caused by an old programmers trap: Building a patched together prototype which should be trown away once i had a proof of concept it actually worked, and instead keeping the prototype and resuming work on it, which eventually let to a horrible code mess and a completely non understandable program. In the past month i finally found the time and willpower to use a step trough debugger throughout the entire program to decipher and salvage the mess as much as possible, before rewriting coreva from scratch, sans for a few salvaged functions that actually worked.

The actual working of the bot have changed very little from the table i added above - i dropped the STUB, TOMANYCATS and TOMANYLINKS due to them being prone to false positive. I am currently testing a module that can detect peacock pages (Based upon statical analysis, weighted word lists andsome basic calculations); So far it work fine when comparing featured article's versus peacock articles (1 false positives on 270 correct tags), but the calculation algorithm makes to many mistakes on small articles, so its disabled for now.

Et Cetera

  • The bot language switched from C#.net to VB.net to force recoding, rather then copy paste while rewriting it.
  • Par suggestion, the bot dates every tag, and uses articleissues over single template when more then 2 tags are placed.
  • The bot won't check redirects, pages marked as disambiguation and pages marked for CSD, nor any pages that are already removed.
  • The bot won't double-tag. It can detect already placed {{ArticleIssues}} tags; Similarly it can detect single level templates, along with every listed alias of those templates.
  • Since the bot won't be running all day it remembers the article it last tagged before shutting down. When started again it will proceed where it left of (Can be manually reset after vacations, etc)
  • By default the bot checks the api for new pages every 30 minutes; The bot will keep those stored in a database and will form a buffer of articles to tag, with older article's having priority. Tests showed it is extremely rare for the bot to tag article's younger then 5 minutes. I hope this covers the "Pages take time" argument.
  • Edit rates are currently locked at 1 edit every 6 seconds. I am open for any advice regarding this rate.

Excirial (Contact me,Contribs) 21:05, 11 June 2009 (UTC)[reply]

Bots that add tags to articles tend to be controversial. See Wikipedia:Bots/Requests for approval/Erik9bot 9, where a proposal to add {{unreferenced}} to all unreferenced articles had to be modified in order to overcome objection from a number of editors. Some people feel that visible cleanup templates on articles detract from the reader's experience, and should not be used. – Quadell (talk) 14:46, 14 June 2009 (UTC)[reply]
Well, if there is consensus i will add categories instead of visible templates, but i find myself surprised with the discussion at Erik9bot 9. The majority of the tags added to new articles these days use friendly, which are all visible tags. Similarly the 500k or so articles in WP:BACKLOG all use visual tags; why should a bot that does exactly the same work be subject to an issue called "Ugly Templates"? If this should be changed i much rather see an RFC that changes this wikipedia wide, rather then trough individual judgement that will only create a mass of different styles instead of uniformity. For now i wont pre-emptively change this as i see no consensus on this. Excirial (Contact me,Contribs) 07:23, 16 June 2009 (UTC)[reply]
if there is consensus i will add categories instead of visible template I oppose adding hidden categories without tags unless the bot regularly runs a review of edited articles to remove the categories when the parameters are no longer met.BirgitteSB 14:27, 17 June 2009 (UTC)[reply]
It will not. Coreva will only tag an article once it is created. Technically it could easily iterate trough every wikipedia article, but that would be inefficient to say the least. The intent of the bot is giving new article's some basic improvement advice and traceability. New editors can technically see what they should improve, and it prevent article's from fading into the great unknown because they are not linked to\from anything else. I assume that was the original criticism basis: The other bot would scan a mass of (Long time) article's and add visible templates to them. Excirial (Contact me,Contribs) 14:47, 17 June 2009 (UTC)[reply]
I can't say whether that was the basis of the original criticism or not. I do not object to adding visible tags. But I disagree with this idea that because people object to that it will be OK to just add hidden categories. Without any sort of visible prompt, people are not going remove these categories as the article matures. The categories will be filled with false positives by the time the backlogs are worked through to these months.BirgitteSB 18:37, 17 June 2009 (UTC)[reply]

Approved for trial (10 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. This is a very long RfBA, and the specs have changed throughout and are difficult to follow. I think the best way for all parties to understand what this bot would do is to give it a very small trial. – Quadell (talk) 13:12, 18 June 2009 (UTC)[reply]

It seems that Coreva needs a slight correction - On the first two edits i made a manual issue, setting the bot to "Show only tags" mode which caused it to blank a page. On the next two page i noticed that i made a slight error in the saving code. Coreva accidentally saved the article's it was currently checking to the page is was processing, causing an overwrite with the wrong page contents.
My fault for assuming that a not throughly tested function would just work! It should not take to long to fix this though - ill run Coreva in diagnostics to test it, and after that i will resume the test run. Excirial (Contact me,Contribs) 17:42, 18 June 2009 (UTC)[reply]
{{BotTrialComplete}} - I took the liberty to make a new set of 10 edits after i fixed the above issue - it proved to be a minor issue where i confused two functions, one used for working on the NEXT page, and one that was user for the current page (Causing it to mix up two pages). The new edits are marked 0 to 9; The error on tag 9 - the incorrect addition of a sections template - has already been fixed. Excirial (Contact me,Contribs) 18:24, 18 June 2009 (UTC)[reply]
Another issue, the incorrect dating of maintenance tags (It didn't include "Date=") has also been solved now. Excirial (Contact me,Contribs) 07:54, 19 June 2009 (UTC)[reply]

Approved for trial (20 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Okay, let's have another go. – Quadell (talk) 22:38, 22 June 2009 (UTC)[reply]

Trial complete. Again a few slight bugs which have been fixed. In retrospect it might have been wiser to develop Coreva for a single template and add additional functionality later on - at least it would have prevented the need for repeated trials.
* Bug 1: Incorrect tagging with Unref templates. - Not really a bug. I optimized the regex and missed a character in the process, causing it to always fail.
  • Umm, of course it's not a bug if you forsaw the false positives, but not very considerate to leave it to others to fix them! Why dont you manually review flagged articles first? Sparafucil (talk) 00:33, 29 June 2009 (UTC)[reply]
  • Of course i did not foresee the errors, if i did, i would have fixed them before running Coreva wouldn't you think :)? What i means with "Not really a bug" is that i already had the correct code, but that i made a small copypaste error in it causing it to malfunction. As for cleaning this one up, i went after those myself and reverted them? Excirial (Contact me,Contribs) 06:53, 29 June 2009 (UTC)[reply]
Ah, we're perhaps talking about different things then, and I'm sorry for the sarcasm. What got my temper up was [edit] in the article space, labled as a trial edit. Why should checking for references be done by a bot? The article clearly is based on the Encyclopedia article given at the end. Sparafucil (talk) 23:01, 29 June 2009 (UTC)[reply]
* Bug 2: Tagging of a disambig page. - Caused by the use of a specialized disambiguate template ({{Hndis}}). Coreva now checks for these, and every other template listed onto that templates "See Also" section. Never even knew those excisted :).
* Bug 3: Dating categories incorrectly.- Apparently an uppercase "Date" is not accepted as a parameter, so i use "date" now, which is accepted.
Excirial (Contact me,Contribs) 20:37, 28 June 2009 (UTC)[reply]
If you haven't put one in (I couldn't see a mention from skimming through this page), then you should add a limit to how soon after creation the bot it "allowed" to tag. Because it's possible (although highly unlike since it doesn't run very often) that it would get in the way of deletion tagger etc. It should only mark articles which have been around for a few hours or so. - Kingpin13 (talk) 19:33, 2 July 2009 (UTC)[reply]
* Bug 4: adding requests for inline references to one or two sentence article. This makes it hard to read the article. Particularly after smackbot moved the huge banner to the center of the article. Footnotes are useful for the reader, but the banner has more text than the article, and with so little text, footnoting is not so urgent. Please look at the articles and watch-list them while your bot is in its trial phases. If the banner overpowers the article it should not be there. Consider that very short stubs (about 32 words of texts) with banners with the same number of words may not need the banner and a copy of bots modifying the article. --69.226.103.13 (talk) 07:11, 4 July 2009 (UTC)[reply]
{{OperatorAssistanceNeeded}} Any news on bug four or the status of this request? MBisanz talk 22:04, 18 July 2009 (UTC)[reply]

Re-Opened (Yet again *Sigh*)

Due to some unforeseen circumstances i have been almost completely inactive the last 3 or so months, causing this bot request to expire yet again. Finally having found some spare time to work on this bot again, i would like to reopen this RFBA.

As for the current status: Bug number 4 is now solved, Coreva will only add the footnotes template to pages of substantial length. It will also converts ampersands and other reserved HTML characters correctly now before saving the page, and I also updated the regex's used to determine if a template should be placed; thus reducing the amount of false positives. Excirial (Contact me,Contribs) 22:11, 30 October 2009 (UTC)[reply]

The intention is to tag 6 minute old articles with maintenance templates? Why? Is there community consensus that an editor should have only 5 minutes to write before a bot tags the article? What might be missing, imo, is a few more minutes to write an article.
Personally, I'd let the bot finish it if my editing was interfered with in this manner. It takes hours to write an article. Sometimes I post a stub first. I'd like to see the community consensus for these tasks, for the templates to be added by a bot, and for the amount of time before adding the templates. It seems hostile if I understand the time frame correctly.
Also, how many templates will it add? It seems to say it will only add one, but which one of the many? Or will it add more? --69.226.106.109 (talk) 02:41, 31 October 2009 (UTC)[reply]