Jump to content

Wikipedia:Bots/Requests for approval/ContentCreationBOT

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by 69.225.3.119 (talk) at 01:04, 24 September 2009 (→‎Trial?: wt). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Operator: ThaddeusB

Automatic or Manually assisted: automatic, unsupervised

Programming language(s): Perl

Source code available: here

Function overview: fill in tables with data on prehistoric creatures

Edit period(s): one time run

Estimated number of pages affected: 13

Exclusion compliant (Y/N): N/A - this only applies to user & talk pages, correct?

Already has a bot flag (Y/N): N

Function details: Using a large database of information downloaded from http://paleodb.org and http://strata.geology.wisc.edu/jack/ the first function of this bot will be to fill in the tables found in various "list of" articles. A sample entry has been filled in here. Any data that is missing from the database will simply to left blank.

Only a tiny number of pages will be affected, but the amount of bot filled in content will be immense. As such, I am suggesting the bot trial be something along the lines of "the first 10 entries on each page" rather than a number of edits.

A copy of the database is available here (425k). The database is organized as follows:
Genus--Valid?--Naming Scientist--Year named--Time period it lived during--Approx dates lived--locations

  • A "1" in the valid column means it is currently listed as a valid genus, "NoData" means it couldn't be determined - most likely because there are two genus with the same name, and "No-{explanation}" means it is not currently listed as a valid genus.
  • Data proceed by a "*" means it was derived from Sepkoski's data, using the dates found here (compiled by User:Abyssal). All other data came from paleodb, using their fossil collection data for more precise dates (when available.)
  • Spot checking of my data is encouraged, although I'm confident no novel errors have been introduced. If anyone knows of additional sources to derive similar data, let me know and I'll incorporate those sources into the database.

List of pages to to affected: (might be expanded slightly if others are found)

Discussion

Source code will be published shortly, although the code itself is a quite simple "read from db publish to Wikipedia" operation. --ThaddeusB (talk) 01:48, 3 September 2009 (UTC)[reply]

Now available here. --ThaddeusB (talk) 13:16, 8 September 2009 (UTC)[reply]

I have spammed asked the relevant projects for input: [1] --ThaddeusB (talk) 02:08, 3 September 2009 (UTC)[reply]

Bullets?

Maybe the countries should be a bulleted list to save horizontal space. IE:

  • Poland
  • Switzerland

as opposed to "Switzerland, Poland". Abyssal (talk) 15:30, 3 September 2009 (UTC)[reply]

Which is preferable: Option 1
Genus Authors Year Status Age Location(s) Notes
Advenaster Hess 1955 Valid Late Bajocian to Late Callovian Switzerland, Poland
SampleEntry Hess 1955 Valid Early Cretaceous to present Switzerland, Poland, United States, France
Bad Genus Invalid rank changed to subgenus, name to Genus (Sub genus)
Option 2
Genus Authors Year Status Age Location(s) Notes
Advenaster Hess 1955 Valid Late Bajocian to Late Callovian Switzerland
Poland
SampleEntry Hess 1955 Valid Early Cretaceous to present Switzerland
Poland
United States
France
Bad Genus Invalid rank changed to subgenus, name to Genus (Sub genus)
or Option 3
Genus Authors Year Status Age Location(s) Notes
Advenaster Hess 1955 Valid Late Bajocian to Late Callovian
  • Switzerland
  • Poland
SampleEntry Hess 1955 Valid Early Cretaceous to present
  • Switzerland
  • Poland
  • United States
  • France
Bad Genus Invalid rank changed to subgenus, name to Genus (Sub genus)
Any is fine by me. --ThaddeusB (talk) 18:49, 3 September 2009 (UTC)[reply]
Definitely 2 or 3, although I don't care which. I love that you're using the sort template. <3 Also, could you have the year link to the "year in paleontology" article? And link to the countries? Abyssal (talk) 04:35, 4 September 2009 (UTC)[reply]
No problem, I will link the date & countries. --ThaddeusB (talk) 13:45, 4 September 2009 (UTC)[reply]
Hmm, WP:Context and all that - re countries. Surprise links to "year in" are not that great either. Rich Farmbrough, 09:51, 6 September 2009 (UTC).[reply]

Invalid genera should be in separate table, because it is better for general public. If there is any reason to have them together, then it could be also OK. Such tables will be also useful. --Snek01 (talk) 11:48, 4 September 2009 (UTC)[reply]

I'm neutral on linking the countries, although I will point out that such links would allow someone to easily figure out where in the world the fossil was found. I view the "year in" links as completely appropriate as naming of new genus is something that is/should be covered in those articles. --ThaddeusB (talk) 13:16, 8 September 2009 (UTC)[reply]
The bot is only going by the entries that already exist in the tables & it looks like there are only about 5 invalid entries across the dozen or so pages. As such, I've leave it up to regular editors to pull the entries out of the main table rather than trying to write a function to do it.

Improper age sorting (fixed)

A sample update has been performed here. --ThaddeusB (talk) 13:16, 8 September 2009 (UTC)[reply]

I haven't had time to look closely at it, but I notice the age sorting isn't working right, but I can't figure out why. Any idea? Abyssal (talk) 18:31, 8 September 2009 (UTC)[reply]
Hmm, could you be more specific as it seems to work for me. The first click put it from most recent to oldest and the second from oldest to most recent. --ThaddeusB (talk) 00:57, 9 September 2009 (UTC)[reply]
The first 20% or so is alright, but after the Miocene (when viewed in ascending order) it starts listing Jurassic stages, then it goes back to Pleistocene, and for some reason Cretaceous ages are listed as if they were the oldest. It's looked this way both at home and at school. The browser I used is Firefox. Abyssal (talk) 19:09, 9 September 2009 (UTC)[reply]
OK, I figured out the problem. Apparently the {{sort}} template doesn't work properly with numbers, so everything is being sorted "alphabetically" - that is 1 < 10 < 2 etc. It was only coincidence that the first 20% is still correct (and I didn't look down far enough to realize the error). I'll add a fix for this tonight & re-upload the sample page. --ThaddeusB (talk) 20:05, 9 September 2009 (UTC)[reply]
Good sleuthing! Abyssal (talk) 22:28, 9 September 2009 (UTC)[reply]

I believe the latest upload fixes the issues. --ThaddeusB (talk) 21:19, 10 September 2009 (UTC)[reply]

How did you fix the problem (I'll need to use the same method for other articles). Abyssal (talk) 02:28, 11 September 2009 (UTC)[reply]
By adding enough zeros in front of the numbers to make them correctly sort as strings. That is, by converting "1.23" to "001.23", "45" to "045", etc. --ThaddeusB (talk) 20:41, 12 September 2009 (UTC)[reply]

Need for outside input

Just a thought, but in light of the Anybot debacle - it might be a good idea to put a call out to the WikiProjects and recruit some marine biologists/fossil guys/crustacean guys/etc. to come take a look at your trial edits and check them over with a fine tooth comb before the bot is given final approval. If Anybot taught us anything, it's how simple errors in interpreting database content can lead to masses of incorrect information going live to the 'pedia and remaining there for months, unnoticed. --Kurt Shaped Box (talk) 22:28, 10 September 2009 (UTC)[reply]

I certainly want several people to look over the data and have notified the 6 most relevant WikiProjects. So far, those notices don't seem to have attracted many people. :( --ThaddeusB (talk) 22:45, 10 September 2009 (UTC)[reply]
Can you try asking specific editors by the type of articles you intend to produce? Asking editors if they will check the data? --69.225.12.99 (talk) 02:34, 11 September 2009 (UTC)[reply]

13 pages? Why do we need to approve a bot for that? Run your program. Dump the output to a screen. Post it by hand. Preview. Save. A bot will save you a few minutes whilst vastly increasing the risk to the project. Hesperian 23:55, 10 September 2009 (UTC)[reply]

First off, let's not be ridiculous - me running a program locally & manually uploading the data is no less of a risk than me running a program locally & it automatically uploading the data. Now, there are several reasons I am requesting approval rather than just uploading the data:
  1. Yes it will only edit a few pages, but the amount of data it will import is immense, as we are talking about the automated filling of several thousand table entries. The amount of info that is being auto generated and added deserves community consent, IMO.
  2. I want as many people as possible to look it over to make sure the bot isn't adding inaccurate info. If I just uploaded it all in my name, it wouldn't get the same scrutiny
  3. There is a planned second part of this task (automated creation of stubs) that will edit thousands of articles. This will be a separate BRFA, but the idea here is to get any bugs/inaccurate input data fixed on a relatively non-controversial task before moving onto a possibly controversial one. If the bot can handle accurate adding content to existing articles, then there is concrete evidence that it should be able to create stubs with prose based on the same information.
I hope that clear it up. --ThaddeusB (talk) 00:20, 11 September 2009 (UTC)[reply]
I obviously agree with Thad. This sort of tedious point-by-point extraction of information from a database is what bots are Wikipedia bots are made for. As someone who has filled similar tables out manually, I can vouch that using a bot for this purpose is the most effective way to accomplish it. Abyssal (talk) 02:18, 11 September 2009 (UTC)[reply]
I agree with that, Abyssal. I run scripts like that myself. But you don't need a bot account to run a script against an external data source. You only need a bot account to post the formatted results to Wikipedia. Personally I prefer to run my scripts, examine the results, tweak the scripts and run them again if necessary, iterate, eventually load the results into an edit window, preview, tweak, and finally save. It is a lesser risk to do it this way. The risk is only the same if you are going to copy-paste-save without examining the results, just like a bot would do. And in that case, I cannot comprehend why a bot is necessary. Once you've generated the data, posting 13 pages by hand will take you 6½ minutes. Thaddeus, I sure hope you'll be spending more time than that implementing and testing your bot. So where is the benefit? I also dispute the scrutiny argument. People don't scrutinise bots more; they largely ignore them until they screw up royally. And no, you don't need to obtain community consent before you edit Wikipedia, even if you are posting big pages. Hesperian 05:42, 16 September 2009 (UTC)[reply]
I already have examine the input data, tested the bot, and reviewed its results fairly extensively. I've probably put around 40 hours into it in fact. I am well aware that I didn't need approval to do this task. I merely feel it is better to do it with approval than without. This vetting process has already led to some subtle improvements that likely would have never happened if I'd only reviewed the output on my own. --ThaddeusB (talk) 13:09, 16 September 2009 (UTC)[reply]

Discussion regarding some objections

In my opinion, I don't think this bot should go forward without proactive community support for the bot. This means more than no one disapproves or shows negative interest. It requires editors from relevant projects get on board for vetting uploaded data. Without a group of editors to check data, it is my opinion the potential for another AnyBot type mess exists. Yes, this is the type of work that bots should be used for, in my opinion. But it requires a human editorial community to accompany the creation of articles. I'm also not thrilled with the sideways answers to some of my questions about this bot, before the RFBA. A single straight-forward answer to a question, when first asked, is more in keeping with the kind of communication that should be done when running bots that create articles, in my opinion. --69.225.12.99 (talk) 02:33, 11 September 2009 (UTC)[reply]
To be fair, no one asked me a single question. An IP (possibly you) asked Abyssal some questions, but they seemed to be directed towards his editing activities & not this bot. Additionally, he obviously didn't know the exactly details of how the bot would operate since he wasn't programming it... and frankly I didn't know the exact details either since it wasn't complete yet. I have explained what the bot plans to do in this BRFA (which is the correct place to do so), released the source code, and released the database. I personally have manually checked dozens of entries and I'm pretty sure Abyssal has as well. If you are offering your help in spot checking, then please check how many ever entries you want from the database, against whatever data source you can. If you find any problems, by all means please tell me so I can make adjustments to the database.
Beyond that, there really isn't anything I can do. I can't force people to spot check the data. Nor can I, or anyone else, check more than a trivial % of the total. I have to rely on the accuracy of the data I obtained from reliable sources. I can't verify every piece of data by hand, but only confirm the general integrity of the sources from which the data came. --ThaddeusB (talk) 03:06, 11 September 2009 (UTC)[reply]
And BTW, the bot isn't going to be creating any articles at this time. --ThaddeusB (talk) 03:08, 11 September 2009 (UTC)[reply]
If the bot is going to be creating tables of data on wikipedia and there is not a single editor outside of the creators interested in the data, there appears to be no desire for the bot on wikipedia. If aren't willing to check the data and are not interested enough to comment on the bot, who wants the bot?
According to bot policy, to gain approval, a bot must :
  • be harmless
  • be useful
  • not consume resources unnecessarily
  • perform only tasks for which there is consensus
  • carefully adhere to relevant policies and guidelines
  • use informative messages, appropriately worded, in any edit summaries or messages left for users
Consensus involves discussion and other editors. No other editors = no discussion = no consensus for the task to be done, much less by a bot. --69.225.12.99 (talk) 06:57, 11 September 2009 (UTC)[reply]
You are mistaken about the way Wikipedia works. Discussion is not needed to take action, discussion is only needed if the action taken if met with objection. Being bold is a core principle. If we had to discuss every action first, very little would actually get done. The task clearly falls under policy so there is already consensus to do the task. This discussion is to establish that my bot can do the task accurately and efficiently.
Second, you mistake low input for lack of interest. Just because few people have commented here, doesn't mean no one is interested in the data. We are talking about scientific data on prehistoric creatures here, not Britney Spears. The audience interested in this material is limited, of course, but nearly everyone would agree that this sort of information is at least as important to have on Wikipedia as pop culture, even though far fewer people are interested in it.
Third, you misunderstand what this request is for. The request is to fill in the existing tables, not create new ones. Technically (as pointed out by Hesperian) I do not even need BOT approval to do the task. The request was created in part to solicit additional eyes to make sure the data is accurate. Again, the information is 100% from reliable sources and a bot will copy the information far more accurately than a human ever could. So, I ask again are you willing to look over the data? Or are you just trying to block the task from happening? --ThaddeusB (talk) 15:28, 11 September 2009 (UTC)[reply]
No, I'm not mistaken about how wikipedia works, and I'm not mistaken about your insulting me rather than addressing the issue. The use of bots does not function by the, be bold, run a bot, make 10,000 entries, then decide if the community wants it theory. Please read the bots policy in its entirety before requesting approval for a bot. It is your responsibility as a bot operator to adhere to bot policy. You can't do that if you don't know it.
I think it's time to close this bot request for approval until there is community consensus for this task to be done. The lack of editors monitoring AnyBot was problematic enough, but that bot at least had some community consensus. This bot has absolutely none, and its operator denies there is any need for community consensus for its task. This is a bad start. Couple with the bot operator's combative nature and inability and/or unwillingness to address issues, I can't see anything but disaster and another mess of 1000s of entries for someone else to clean up. --69.225.12.99 (talk) 16:32, 11 September 2009 (UTC)[reply]
As the creator, designer, and near sole contributor to Wikipedia's lists of prehistoric invertebrates, and also the user who "proactively" sought out someone capable of programming a bot to perform the task at hand, I am curious as to who you expect us to seek consensus from. Should I make sock puppets and then ask them if they approve? I created every single one of the lists ContentCreationBOT will be contributing to, and somewhere around 23-26 of the 28 lists of prehistoric invertebrates in total on Wikipedia. Yes, I gave a range of pages there, as in "I created so many of them that I've lost count." The community-of-people-who-contribute-to-Wikipedia's-lists-of-prehistoric-invertebrates consists nearly entirely of myself, and there is strong consensus between myself and I that this task should move ahead. It's also nice of you to dishonestly claim that we're creating stubs here. We've worked diligently for months preparing this bot and you're willing to shut us down without even having read the description of the task we're requesting approval for. And the when we don't just hump up and take it, you throw a hissy fit and demand that the discussion be closed. Wow. Abyssal (talk) 17:20, 11 September 2009 (UTC)[reply]
I didn't use the word stub, until now. I think that this type of personal attack of people ("you throw a hissy fit") who have questions and concerns about the bot, once more, bodes poorly for the use of this bot to create any type of content. --69.225.12.99 (talk) 03:39, 12 September 2009 (UTC)[reply]
I took "create 10,000 entries" as you implying the bot would engage in stub generation, if that's not what you meant, sorry. Assuming you were referring to the data-adding task the bot isn't "creating" the enries, it's filling in blank entries in tables that already exist. Abyssal (talk) 18:15, 12 September 2009 (UTC)[reply]
1) I didn't insult you, and I am sorry you took it that way. I merely stated that you don't appear to understand what consensus means (at least as it applies to bot tasks).
2) "Perform only tasks for which there is consensus" means perform a task for which there is generally consensus. It doesn't mean we need 20 editors to come comment on every bot and say "yep, there is consensus for this task." There is already implicit consensus for adding this sort of information to articles as it has been done hundreds of time by many different editors with no objections. The bot can just do it faster and more accurately.
3) I have 3 approved bots and understand bot policy thoroughly.
4) You are arguing over semantics, not substance. I most certainly didn't claim the bot doesn't need consensus.
5) I can't "address the issue" as you have not outlined any actual issue with either the bot or the RS data it is using. You have only stated your personally opinion that you think more people should look at the data.
6) Do you have any actual policy based objection to the task of filling in existing tables with reliable source data? Or any objections to the data/code to put it on Wikipedia?
7) If the bot is rejected I will just manually upload the exact same data - as is my right as an editor - and the community will loose the benefit of explicitly knowing it was extracted from a database by a bot. Again, I didn't even have to ask for approval of this task as there is no actual need to automate the uploading. I did so for the community's benefit. --ThaddeusB (talk) 16:56, 11 September 2009 (UTC)[reply]

I would also like to add that the data has been looked over by accredited scientists or it wouldn't have been on Paleodb.org to begin with. To expect me or anyone else to manually check every entry is completely unreasonable and, in my opinion, would be more likely to introduce novel errors than improve overall accuracy even if it was possible. --ThaddeusB (talk) 17:02, 11 September 2009 (UTC)[reply]

I stand by my original "hissy fit." --69.225.12.99 (talk) 03:39, 12 September 2009 (UTC)[reply]
Does that mean you also stand by your refusal to provide any concrete objections? --ThaddeusB (talk) 17:38, 12 September 2009 (UTC)[reply]

Question for ThaddeusB and Abyssal

As a matter of interest - and this may help to assuage "the IP algae guy"'s (as I think of him, based on our previous work together cleaning up after Anybot and in lieu of a better name) doubts and concerns, how familiar are you guys with the taxonomy of prehistoric invertebrates in a non-WP context? Is this your chosen field, area of scholarly interest or hobby? IOW - do you *know* these ex-critters, or are you strictly data-processing here?

The reason I ask is so that it may be established how likely it would be that a subtle misunderstanding/misinterpretation of the data presented at Paleodb (perhaps due to incorrect assumptions being made from incomplete knowledge) could occur, go unnoticed during the transfer because no-one knows what to look for - and thus result in massive factual errors being introduced to the wiki. Going back to Anybot, one of the reasons that it failed so hard was the that BotOp didn't really 'know' algae to any great extent - but did he earnestly believe that he was capable of extracting the data automatically and formatting it into encyclopedia articles. That's all well and good, if it works - but well as misunderstanding some fundamental algae-related terms, he incorrectly assumed (as far as I am aware) that 'number of taxa/species listed at AlgaeBase = number of taxa/species known to science' and ran with it. Then, as there was no-one else around at the time who knew better (or perhaps because no-one else even looked at the resultant articles once they'd gone live), the assumption was made that the bot's output was correct. IIRC, the systematic errors were only uncovered when "IP algae guy"'s students started handing in coursework containing WP-sourced nonsense.

This scenario may not be exactly applicable to the Paleodb dataset - but before this goes any further, I would like to gauge the likelihood of the same thought process being applied again and leading to a different, but equally-borked end result. --Kurt Shaped Box (talk) 09:09, 12 September 2009 (UTC)[reply]

I believe the main problem with Anybot was programming, not lack of knowledge - although the second obviously contributed. The programmer made several fundamental errors like not resetting variables & making it runnable from a remote location without a password. These problems were not caught because 1) the code wasn't published and 2) no one who knew what they were doing looked over the sample stubs from the trial run. My code has been published, as has the data, as has a sample page. The code is not runnable remotely.
I do not personally have any knowledge of the subject. I was solicited as a capable bot op, with (what I believe to be) a reputation for carefully checking my bots' output & correcting errors. Abyssal is the one with the knowledge of the subject and the idea for the bot. He was doing the task manually for some time, but some users contacted him to say they thought a bot would do the task faster and more accurately, which is true. A human copy and pasting data will make an occasional error despite their understanding of the subject that a bot wouldn't make.
The reason I have been asking User:Abyssal about the bot is because there are problems with the fish stubs he/she created en masse. I am still waiting for him/her to respond to a question about the fish stubs I put on the user's talk page on May 30th.
While Abyssal claims to be the only working invertebrate paleontologist on wikipedia that is incorrect. I edit invertebrate paleo articles as do a number of my colleagues. Ultimately I will be more concerned about the stubs, but, thank you Kurt Shaped Box for reminding me how I met Abyssal: correcting problems with stubs. Oh, by the way, I'm not really algae guy, as I've said before, I'm marine invertebrate paleo guy. --69.225.12.99 (talk) 09:29, 12 September 2009 (UTC)[reply]
You are putting words in Abyssal's mouth. He didn't claim to be the "only working invertebrate paleontologist." He claimed to be the only one working on the specific articles for which this bot will provide data. Additionally, the error you found is precisely why this task should be done by a bot. No human is going to be able to copy gobs of data without introducing novel errors. A well made bot, won't introduce novel errors, although obviously it won't correct any errors in the original data either. (However, by Wikipedia policy we really should be goign by what the source says anyway, not using our own knowledge.) If you have any source to cross-check the paleodb data against, I'd love to hear it. Otherwise, I say that this is the best available data and that there is absolutely nothing wrong with reproducing it.
Again, why are you trying to block this task. You say you have knowledge with the subject, yet you refuse to offer you help looking over the data. You demand I find people willing to do this, yet you yourself are a prime candidate to help & refuse. Why? --ThaddeusB (talk) 17:38, 12 September 2009 (UTC)[reply]
If anyone has issues with the stubs I made before, all they have to do is ask. If they've asked and I've forgotten to respond, all they have to do is ask again and remind me. That's much nicer than asking, then waiting for months before bringing it up as an act of passive aggression. Also, the substubs I created are not only irrelevant to the discussion on face, but they also bear little resemblence to the fuller, more complete stubs that ContentCreationBOT may create in the future and serve as a poor analogy for such. Abyssal (talk) 18:30, 12 September 2009 (UTC)[reply]
Okay - 'Marine Invertebrate Paleo Guy' is is, then... :) By the way, is this the user talkpage post you're talking about? If so, Abyssal did reply to you. --Kurt Shaped Box (talk) 09:41, 12 September 2009 (UTC)[reply]
No, in fact he/she didn't. The last question I posed has been ignored since it was posted--reread the post about the two articles with similar names. He/she only responded to the first part, agreeing I had corrected his data for the one article, and thanking me for doing so, but not for the question of whether both articles for what appear to be a single organism should be on wikipedia. This last is precisely the type of mistake that needs reviewed and corrected by humans with content creation bots. This bots owner and assistant have resorted to bullying. Bullying by the bot operators coupled with failure to act = giant mess on wikipedia that someone else has to clean up. It took months and probably a dozen wikipedia editors to clean up the AnyBot mess. In my opinion it's time to put an end to this RfBA as a place for User:ThaddeusB to post his personal attacks, since he doesn't have what is necessary for running a bot of this nature, and is focused on attacking me rather than getting the bot together. --69.225.12.99 (talk) 18:17, 12 September 2009 (UTC)[reply]
The only one attempting to bully people here is you with your whole I don't agree with this so let's shut down discussion right now attitude. You have repeatedly demanded this not take place, but still have yet to offer a single constructive comment. You say I am focused on "attacking you rather than getting the bot together." Um, the bot is together. There is nothing to "get together." Again, you have yet to offer a single actionable complaint with the actual bot that I can address.
You claim to be interested in making sure the bot doesn't make any errors, yet you refuse to help. I think it is pretty clear that your objection is either philosophical against this sort of task ever being done, or is motivated by personal dislike for me and/or Abyssal.
I have not made a single personal attack against you. I merely comments on your comments, just as you have commented on mine. Somehow it is perfectly acceptable for you to distort others comments and say whatever crap you want about them, but if they dare mention you in a reply they are personally attacking you?
Finally "this is the kind of mistake that needs to be reviewed by humans" is an irrelevant comment because this mistake was made by a human, not a bot. In fact, this example is proof why the task should be done by a bot - humans will always make some mistakes when copying large amounts of data. --ThaddeusB (talk) 19:05, 12 September 2009 (UTC)[reply]
I'm male, no need to use the double pronoun thing. As for bots adding errors, the articles won't be set in stone after creation, they will be subjected to the same scrutiny and incremental revisions and fact-checking that all other Wikipedia articles are. It's almost certain that bot generated content will introduce some errors, however, our human editors do substantial amounts of that as well. If a human added 99% good information and 1% inaccurate information, we would think of them as doing a good job. It's illogical to demand more from an automated contributor than from a flesh-and-blood one, but you seem to be expressing that double standard anyway. "Failure to act"? We've already run succesful demonstrations and made the full code public! What would it take to please you? As for us being bullies, well, an old proverb comes to mind. Abyssal (talk) 18:51, 12 September 2009 (UTC)[reply]
I understand that all bots can hiccup and make mistakes from time to time. Unless they start crapflooding, blanking or overwriting a huge number of articles, it's a pretty matter to put right. I'm more concerned about systematic errors that could result in non-apparent-except-to-experts factual inaccuracies across the majority of the bot-generated content. How confident are you with the subject matter at hand and the interpretation of the database content that this may be avoided - or if it did occur, that you'd be able to spot it quickly? I don't want it to come across as though I'm picking on you and ThaddeusB here - but Anybot has left me wary of bots that autogenerate content in this manner, wary enough at least to be thorough in asking questions. --Kurt Shaped Box (talk) 20:17, 12 September 2009 (UTC)[reply]
I think that thorough testing and a bit of preliminary fact-checking will demonstrate whether or not ContentCreationBOT can succesfully utilize the database to generate new content for Wikpedia. If it does prove successful in drawing from the database, then any errors will be on the database's side and thus out of our control. However, since the database was compiled and is operated by scientists, I have confidence that there will be no major problems. At this point it's really just a matter of testing. Abyssal (talk) 15:22, 14 September 2009 (UTC)[reply]
One of the reasons the anybot mess wound up so spectacularly bad was poor communication on the part of the bot operator and unwillingness to respond to concerns. These two users see expressions of concerns about their bot as an opportunity to attack someone for expressing concerns. This will make communication hard to impossible. Poor communication means it won't matter how a mistake is made, because the response will be to attack those who raise issues. And keep attacking and attacking them. Then come back and attack them some more. In my opinion, it simply doesn't matter, once an attitude of this manner has taken hold of the bot operator, there will be no means for issues of concern about the bot to be raised, no means for problems with the data to be pointed out. All such actions will get is an attack. And another attack, and an attack from a different angle, and a new attack. --69.225.12.99 (talk) 06:43, 13 September 2009 (UTC)[reply]
For the dozenth time, do you have any actual objection to express or are you just trying to block the bot? Also, for the dozenth time I can't address "your concerns" until you actually express something concrete. And no that isn't a personal attack despite what you seem to think. --ThaddeusB (talk) 13:06, 13 September 2009 (UTC)[reply]
Oh yeah, I see what you mean. Abyssal - could you check to see whether Graphiuricthys and Graphiurichthys (with an extra 'h') are supposed to be two separate articles? --Kurt Shaped Box (talk) 20:23, 12 September 2009 (UTC)[reply]
Thay're the same animal, so far as I can tell, but both have been used in the technical literature. I'm not sure which one is correct. Abyssal (talk) 15:16, 14 September 2009 (UTC)[reply]
What to do, then? Pick the most commonly used name and redirect the other article to it (Google Scholar would suggest that 'Graphiurichthys' is the way to go)? I know that these two were human-created articles - but this is exactly the sort of thing that the bot must not be permitted to do, if it starts creating stubs. --Kurt Shaped Box (talk) 19:47, 14 September 2009 (UTC)[reply]
Sepkoski spelled it wrong. It seems I added both the correct and incorrect spellings, forgot about it, and accidentally created an article for both. The problem seems to be pure human error on my part, and therefore unlikely to be duplicated by a bot. Abyssal (talk) 13:03, 15 September 2009 (UTC)[reply]

The problem with anybot was not necessarily the database. The data at algaeBase are fine, and the means of gathering data are identified. Part of what led to a huge mess, that caused the deletion of over 4000 articles and a couple of thousand redirects, was the lack of understanding of the data by the human coder and no community involvement in checking and verifying the articles.

Add to this an operator who would not deal with the problem articles as they were pointed out and you get a couple dozen other editors having to sort out the mess and delete the content.

In spite of the accusation of my being "passive-aggressive" two problem articles generated by Abyssal have been on wikipedia for a long time. He's willing to throw accusations at me, but still hasn't risen to the occasion of correcting the error. If he's going to leave articles that need deleted or corrected up, and these are just two, maybe he's expecting that someone else will clean up after the bot.

These are just two articles with little information in them, and one article is wrong and needs to be either a redirect or deleted. If this bot contributes 10,000 data items, who's going to check for accuracy? If there is any inaccuracy who's going to clean it up?

It seems ThaddeusB is going to blame me for not cleaning up the articles he wants to create-no, I'll do my own volunteer work on wikipedia, not yours, ThaddeusB. Let me know when you're going to start creating the articles I want. And Abyssal is going to throw accusations at the reporters of errors, but not going to correct errors.

Anybot generated errors due to human mistakes. Bots are subject to human errors. An unwillingness to address or correct errors is not an indicator for responsible bot running. --69.225.3.119 (talk) 05:18, 17 September 2009 (UTC)[reply]

Yet again do you have any actual objection I can address? Anybot's code was never checked & it seems was riddled with errors. I am sorry that happened, but its operator's problems are not a reflection on me. This bot's code has been checked & verified that it will copy the data exactly as planned. Further, there is no special knowledge required to copy, for example, the naming scientist from a database to a table.
I most certainly will listen to complaints if you have any that I can actually address. So far your complaints consist of 1) I won't manually check every entry and 2) I allegedly won't respond to complaints. The first is an unreasonable demand that would defeat the point of the bot. The second is merely speculation on your part, and runs contrary to my actual history on Wikipedia.
Then there is your constant stretching the truth\drawing unreasonable conclusions. E.g., "two problem articles generated by Abyssal" somehow equates to this bot screwing up massively. Wow, a human that made two errors in over 2000 articles. (OK, he probably actually made a few more, but clearly the error rate is very low.) One of which was copying Sepkoski's non-standard spelling, which is hardly a serious error. That is hardly reason to shut down this bot. And again, it is completely disingenuous to compare stub creation to filling in a table - the two are hardly the same thing.
Finally, I am not asking you to check 10,000 items I am just asking you to be reasonable and not expect me to check every item either. I have checked about 100 items and found no errors. That is a reasonable spot check. Others have checked some items as well. If you are unwilling to check even one item, then you have no right to complain that others haven't checked enough. --ThaddeusB (talk) 12:23, 17 September 2009 (UTC)[reply]
I'd feel a little less concerned about all this if Abyssal had fixed those duplicate articles already. It's been a few days now since they were pointed out to him. Now, if the bot buggers up the current task, fixing it will be a simple matter of a few one-click reverts. However, if something goes wrong if/when the bot starts creating stubs and we end up with another Anybot-type mess, I'd hope that A. would be much more enthusiastic in trying to put it right (being the guy with the subject knowledge) than he seems to be WRT the above. --Kurt Shaped Box (talk) 23:17, 17 September 2009 (UTC)[reply]
I went ahead and redirect it myself. Rest assured that if the stub creation (which obviously isn't being approved in this task) were to go awry I wouldn't hesitate to "delete all" first and then go and find the problem before starting over from scratch. --ThaddeusB (talk) 01:59, 18 September 2009 (UTC)[reply]


My impression from following this discussion (and participating slightly) is that the bot operator is reasonably careful and conservative, and appreciates the concerns being raised in the aftermath of the AnyBot debacle. There is no reason to tar Thaddeus with that brush. So long as Thaddeus continues to bear in mind the concerns raised here, and works slowly, and works closely with Abyssal or someone else who has a solid grasp of the field, and is willing to put the brakes if and when problems and issues are raised by others, then I am not opposed to this going ahead. Hesperian 23:52, 17 September 2009 (UTC)[reply]

I will certainly be cautious with this. I view myself as directly responsible for every edit my bots make & always proceed with caution. I always comb my bots' contributions and try to stamp out even the tiniest errors before releasing them on a larger scale. I assure everyone reading this that I most certainly will take any and all complaints about data integrity seriously.
Furthermore, I am well aware the reputation of bots that produce content has been severely tarnished by Anybot. This is part of the reason I brought this minor task here to begin with. Sure, I could have just uploaded the tables manually and no one would have ever questioned it. However, I want accurate data and I want to start re-building the community's trust that bots can build content. Thus I came here. --ThaddeusB (talk) 01:52, 18 September 2009 (UTC)[reply]

--ThaddeusB (talk) 01:52, 18 September 2009 (UTC)[reply]

Thank you. I still think we should run some more trials before going ahead, just to be safe. Abyssal (talk) 01:04, 18 September 2009 (UTC)[reply]
I disagree with you, Hesperian. The attitude by ThaddeusB and Abyssal is: they are not responsible for the mistakes the bot makes. I was accused of being "passive aggressive" for failing to parent Abyssal through a correction of an article mistake he made. I don't think this is a team that will clean up after themselves.
The correct response to the problem with the two articles, to show good faith effort toward dealing with future problems with bots, would have been for one of them to correct the articles immediately. But, no, it was more important to call someone names ("passive aggressive) than to make the encyclopedia accurate.
There is no community support for this bot. ThaddeusB is weirdly trying to bully me into being the bot's monitor. If he can't get anyone to check the bot, and he is not able to, and Abyssal won't, and the community isn't interested, why should this bot go forward?
The way to deal with someone who disagrees with something you want and to gain their support is to address their concerns, stay rigorously on target on the issue, and don't tell them they are "passive-aggressive," "throwing hissy fits," "mistaken about how wikipedia operates." All of these comments are personal issues about me. If they are more important than the data, maybe the data aren't that valuable or useful to the encyclopedia.
ThaddeusB and Abyssal have established how they will act already: They will make personal accusations against people raising issues about the bot.
This bot is a disaster in the making because of its operating team. That's my passive aggressive, mistaken-user, can't-raise-substantive-issues, hissy-fitting opinion. --69.225.3.119 (talk) 05:09, 18 September 2009 (UTC)[reply]
If you don't have any concrete objections (and you have yet to offer any), and no actual evidence of how I'll address complaints, then this is just your personal opinion and nothing more. And if you look at my actual record with my actual bots, you will see that I do address actual complaints in a timely manner.
No one is trying to force you to do anything, but you posted here and said you think the bot will screw up but offered no evidence. Of course I am going to respond to that by telling you to check the data if you think it'll mess up. I have personally already checked it & found it to be accurate, but that isn't good enough for you.
Yet again, I can't respond to some theoretical eventual complaint until one actually surfaces. Yet again, do you have an actual complaint with the bot or is this just a philosophical objection and/or personal vendetta?
P.S. Saying someone is mistaken about something isn't a "personal issue" and I take no responsibility for the other two comments, which I didn't make. --ThaddeusB (talk) 13:18, 18 September 2009 (UTC)[reply]
The correct response to the problem with the two articles would have been to tell you to shut up and stay on topic, but I tried to be more diplomatic. I said part of the reason your problems with edits I made (that are irrelevant to ContentCreationBOT's approval) were not addressed is because I get busy and sometimes forget about messages left on my talk page. Further, I said, if you had problems with me not addressing those issues, all you had to do was remind me about them on my talk page. Instead you waited weeks and weeks and only bought the subject up when you could use it to beat me over the head in an unrelated discussion, namely, this one. "Name-calling" or not, I stand by my description of your actions as "passive agressive."
I have little reason to believe that you raised the issue out of legitimate concern because even after you expressed the complaint you were not very helpful in the matter of getting your own problems addressed. No progress was made towards resolving your own issues until Kurt Shaped Box stepped in. Not that any of this matters, because this is the ContentCreationBOT request for approval discussion page, not the "whine about Abyssal making an error that wasn't even entirely his own fault in a tiny article on an obscure genus prehistoric fish" discussion page.
There is no community support for this bot? What? Who should we be asking? The guy who started the List of graptolites? That was me. What about its chief contributor? Me, again. List of prehistoric starfish? That's me as well. List of prehistoric barnacles? Another one by me. Crap! List of crinoid genera? Uh oh, it looks like a pattern is emerging. Turns out I'm both the creator and sole major contributor to every single page that the bot is slated to edit. Every. single. one. If you can find another major contributor to the articles, please do invite them to see if we can form a consensus.
Thad trying to bully you into monitoring the bot? Come on. You claim to be an invertebrate paleontologist and you're on a website based around volunteering to edit encyclopedia articles. So, when we come here with a plan to add a lot of information to encyclopedia articles on prehistoric invertebrates, and then you oppose the addition of information to articles in your field complain that no would be monitoring the data, it's only natural that we stare at you in disbelief. Regarding my willingness to fact-check, considering that I've explicitly called for more fact checking and testing before we proceed with the bot, even though we've both performed successful tests and received tentative approval from another member, your claim that I'm unwilling to check the data rings very hollow.
I'd love to "rigorously" stay on topic, but someone keeps raising issues about stub creation and something to do with a typo in the title of an article I created about a prehistoric fish no one has ever heard of. I'd love to address your very serious objections, but for the life of me I can't remember them. I remember a complaint about a lack of consensus, but when I pointed out that I was the only one who contributed any meaningful content to the articles the bot would edit, and that I both supported and actively solicited the creation of the bot, you ignored me. Other than that, all I remember is a long series of complaints that we weren't taking your complaints seriously enough.
Congratulations. You've cast a dark shadow over the topic and single-handedly discolored the entire discussion. Sadly, the useful input given by of Kurt Shaped Box, Anomie, and Hesperian has gotten somewhat lost in the resulting din. Abyssal (talk)
--69.225.3.119 (talk) 21:23, 18 September 2009 (UTC)[reply]

I see no issues with this bots proposed work, and all the legitimate issues raised have been addressed. Unless the anon IP wishes to raise a useful objection that has to do with this specific bot approval request, I don't see any further issues which need to be addressed. I'm in favor of approving the bot as it currently stands for this test run on the 13 lists given above (and perhaps the additional ones listed below if they are similar enough and the source database contains information which could be added to them). ···日本穣? · 投稿 · Talk to Nihonjoe 14:18, 23 September 2009 (UTC)[reply]

I've raised issues, and ThaddeusB and Abyssal have played word games and delivered personal insults and criticisms against me as a person. If this is the response to issues before it's running, this is, imo, how they'll respond when it's running: insult the person who raises the issue (personal attacks), play word games (wikilawyering), insult the level of wikipedia knowledge of the person raising the issue (biting the newbie--although I'm not new), and demand that if someone has a problem with the data they should devote their wiki career to monitoring the bot's input.
No. That's my opinion. --69.225.3.119 (talk) 22:45, 23 September 2009 (UTC)[reply]
You have absolutely refused to make any concrete objection that can be addressed. Instead you merely repeat the same line over and over about this bot will obviously screw up because Abyssal and I are bad people.
According to your own words, you have the ability to provide expert advice on the material. Your advice on the data would be appreciated, but apparently all you want to do is criticize others and offer nothing. It's a shame that you want to play petty games ("they tried to bully me into helping, so I won't help") rather than helping to improve Wikipedia. --ThaddeusB (talk) 23:39, 23 September 2009 (UTC)[reply]
I see that despite your fervent insistence that this proposal not go forward, Mr. IP, that my challenge for you to remind us of just one of your many very informed and serious objections continues to go unanswered. Abyssal (talk) 00:11, 24 September 2009 (UTC)[reply]

Code review

ThaddeusB asked me for a code review, so here it is. Not much to mention, really:

  • The error checking could use some work. You properly check for HTTP errors, but for API errors in the initial page query or for json decoding errors (i.e. a truncated response) (never mind, from_json just dies on error).
  • $timestamp2 will not have a value unless you run into a maxlag error when querying the edit token (check rvprop in the first query). That would probably give an error in the action=edit request.
  • Will it output a period such as "Mid Ashgill to Mid Ashgill"? If so, wouldn't that be better as just "Mid Ashgill"?
  • It looks like it will screw up the location field if the last entry is not a one-word location name; that may be left over from changing from plain text to a bulleted list. Should the <br /> and the substr($line, 0, -4); line just be removed?
  • I note that the bot will wipe out whatever content is currently in the tables, even those marked "NoData". This may not matter, as I don't know whether there is any such content currently in the tables for those entries. It also seems that it will die if any of the tables contains a genus not in the database, which is an appropriately safe failure mode.

Anomie 16:41, 13 September 2009 (UTC)[reply]

Thank for the help. I fixed all the errors. The "Mid Ashgill to Mid Ashgill" thing was something I meant to correct, but apparently forgot to do. The location thing was indeed left over from changing to a bulleted list. All of the tables are currently blank, so overwriting them isn't an issue. If I needed to rerun the task for some unforeseen reason I'd change the code to be more cautious at that time. --ThaddeusB (talk) 02:55, 15 September 2009 (UTC)[reply]
Not a programmer, so I can't say much, but thanks for reviewing the code. Also, the "Ashgill to Ashgill" thing has been bothering me, too. Is it possible to to remove the duplicate? Abyssal (talk) 15:14, 14 September 2009 (UTC)[reply]


Additional pages

The bot may also be useful on the following pages:

These pages would work well with the bot if put into the table format:

Abyssal (talk) 16:37, 17 September 2009 (UTC)[reply]

Trial?

Is this ready for a trial? Mr.Z-man 00:43, 24 September 2009 (UTC)[reply]

So, this is a 100% dismissal of all objections to the bot? Why? --69.225.3.119 (talk) 01:04, 24 September 2009 (UTC)[reply]