User talk:Smith609/DOI bot

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

DOIs that Wikipedia can't render

I don't know if you've encountered this, but I wanted to warn you if you haven't. A small number of publishers have taken to issuing DOIs such as "10.1130/0091-7613(1995)023<0004:RDDTLD>2.3.CO;2" that can't be rendered correctly by Mediawiki because the angle brackets "<", ">" are interpreted as HTML delimiters. Special care needs to be taken when handling cases like that so that the brackets are url encoded as "%3C" and "%3E" respectively. Dragons flight (talk) 16:37, 30 April 2008 (UTC)

DOIbot adding duplicate page numbers

In at least this edit to planet, DOIbot is adding the "pages" field to the cite journal template even though the pages are already there. Otherwise, thanks for the bot; it's doing great work. ASHill (talk) 16:53, 30 April 2008 (UTC)

Thanks for your note. I hadn't set the bot to recognise page numbers starting with letters; I've re-coded it accordingly. Hopefully that'll fix everything, I'll keep an eye on it, but please let me know if it continues to misbehave! Thanks again for the feedback. Smith609 Talk 17:39, 30 April 2008 (UTC)
Check for volume numbers starting (or ending) with letters, too. Whosasking (talk) 17:09, 1 May 2008 (UTC)
 Done Smith609 Talk 17:15, 1 May 2008 (UTC)

Another DOI bug—this one worse

On this edit to star, the DOIbot added an incorrect volume to a citation that already had a (correct) volume listed. See line 260, the edit to Sackmann et al. ASHill (talk) 18:12, 30 April 2008 (UTC)

How bizarre. I'll pause the bot while I figure this one out. It's meant to leave things well alone when data's been added manually! Thanks a lot for your vigilance. Smith609 Talk 18:20, 30 April 2008 (UTC)
In that particular case, there were two pages fields that had been inserted manually (one correct, one incorrect), but only one volume field (correct). The correct doi was also already there. I have fixed the entry. ASHill (talk) 18:27, 30 April 2008 (UTC)
Thanks. Turns out the bot had got the right volume number, but the wrong citation - the things are horribly slippery when you start changing them on the fly! They were meant for "Distant future of the ...", as the fixed bot should be just about to tell the article! Smith609 Talk 18:34, 30 April 2008 (UTC)
The latest edit looks good. Thanks for the quick fix. ASHill (talk) 18:43, 30 April 2008 (UTC)

Minor DOI bot errors

Hi. I remarked that some DOI were included here twice and that existing “doi= ” are not removed/replaced by the bot. --Leyo 13:16, 1 May 2008 (UTC)

Hi, thanks for pointing this out. The case where a DOI was added twice is because of an "error" in the article itself: the same reference was cited twice in full. You should instead refer to the one reference by name multiple times, using the syntax
<ref name=example2008>{{cite journal|details}}</ref> and <ref name=example2008/>
The empty | doi = tags do not cause any damage, but would be very difficult to weasel out. I will add a "citation tidyup" function to the bot when I get round to it.
Smith609 Talk 13:40, 1 May 2008 (UTC)
One of the tasks for a citation tidyup would be to transform id = PMID 12345 to pmid=12345, and perhaps to identify when duplicate information (multiple citations of the same reference, multiple specifications of the same parameter) is included.Whosasking (talk) 13:58, 1 May 2008 (UTC)

 Done

Cleanup of citation: page = 'only one number' ?

Hi Smith,

In this edit, your bot did not recognize ranging and formatting for pages in a reference using a citation template. It read like | pages = pp.&nbsp;12&ndash;14 |, to which was added another field with | pages = 12 |. Is this a side-effect of the bot, or was this intentional, to comply to a standard that I'm not yet aware of? Wim van Dorst (talk) 12:38, 4 May 2008 (UTC).

Hi, this was a bug, now hopefully fixed. Thanks for pointing it out! Smith609 Talk

I've removed your DOI here:

Article: Delayed sleep phase syndrome

cite journal |author=Nicholson AN, Turner C, Stone BM, Robson PJ |title=Effect of Delta-9-tetrahydrocannabinol and cannabidiol on nocturnal sleep and early-morning behavior in young adults |journal=J Clin Psychopharmacol |volume=24 |issue=3 |pages=305-13 |year=2004 |pmid=15118485 |url= http://www.cantodiem.org/PDF/Nicholson_CBME_Sleep.pdf |format=PDF: full text

I've been going through the refs in this article, checking the PMIDs and DOIs as well as changing |format=PDF to |format=PDF: full text when that applies.

Your bot added a doi to the above ref. That link went to a page with no information other than title, and did not respect my "back" button, such that I had to find Wikipedia again and find the article again. I hate sites like that and do not want links to them. Thanks. --Hordaland (talk) 12:02, 3 May 2008 (UTC)

Hi, firstly, it sounds like you are doing a very useful job! Regarding your comment, if you wish to request the removal of DOIs from the cite journal template, please do that at Template talk:Cite journal. In brief, a DOI provides a permanent link to an article, so having one is preferable to not. Many universities and libraries also offer browser plugins to direct their users to full text version which the have access to. The website has clearly upset you on this occasion, but please do remember that other users find it useful to have this link, and as the PDF URL remained intact, you were not inconvenienced, so there is little case for removing this useful data.
For future reference, most modern browsers have a drop down list of the last few pages you have visited by the back button, so when you do end up on one of those annoying sites you can't go "back" from, you can find Wikipedia there. Smith609 Talk 12:20, 3 May 2008 (UTC)
Thanks for very quick response. I do not want all DOIs removed, but I will continue to selectively remove bothersome ones. I've just removed another:
In the article Pineal gland, ref: Arendt J, Skene DJ (2005). "Melatonin as a chronobiotic". Sleep Med Rev 9 (1): 25–39, I'm removing the doi because it gave only: Error - DOI Not Found. --Hordaland (talk) 12:33, 3 May 2008 (UTC)
Just note that doing so will cause future readers inconvenience when the URLs provided change and go dead. This talk page is probably not the best place to discuss the removal of DOIs from articles, as not many people visit. Smith609 Talk 14:06, 3 May 2008 (UTC)

bot adding duplicate pages entry

See [1]. - Merzbow (talk) 19:46, 1 May 2008 (UTC)

Thanks. Should be fixed... Smith609 Talk 19:52, 1 May 2008 (UTC)

Please turn bot off until issues are resolved.

While I certainly appreciate the usefulness of DOIs, this bot is undoing the work of human editors. Bots should exist to make the life of editors easier, not harder. In particular:

  • The bot is removing URLs, which unlinks the title. Although this should be fixed (most likely) in template:cite journal, until that is done (and because it may not be done, discussion is still on-going) the bot should not remove any URLs, even redundant ones. It can always remove the redundant ones later, once this is changed.
  • The bot is removing final page numbers. If the editor wants them, they should be kept. This is not the type of decision that should be made by a bot.
  • Please give an example where the bot does this, so I can fix it. I've never noticed this bug.
  • See the section above. Your reply was Unfortunately, The bot cannot find the issue number or final page number.. In this case, if the first number agrees, it should leave the final page number.
  • The example above was not of the bot removing a final page number; it was of the bot inserting only the leading page number, on a page where the citation style is to give both initial and final page numbers. Eubulides (talk) 21:17, 1 May 2008 (UTC)
  • Partial information is better than none at all. I wish DOI bot could find out what page articles ended on - but it cannot! I will work on a way, but this will take time. Smith609 Talk 07:57, 2 May 2008 (UTC)
  • Redundant DOIs. If an article is cited several times, the DOIs get repeated. This is clearly wrong and needs to get fixed, even it is tricky as explained above. The argument that a tricky bot is needed to do something right is really an argument that a bot is not the correct solution for this task.
  • The article should be cited repreatedly using the syntax above.
  • The argument that editors should do things a different way is not the job of a bot. If this error is present, the bot just made it worse, by changing an article with one problem (not immediately obvious) into an article with *two* problems, one very obvious. If the correct response is the combine the references, then the bot should do that, or leave it alone, not make it worse. (Perhaps it could leave a message?). And the bot needs to check that the cites are identical - if someone uses this mechanism to call out individual parts of a large article (not the recommended use, to be sure) then combining them erases careful human work.
  • An editor complained about preferring "date=" rather than "year=". The reply was "It's not clear what difference this makes.". Just because it's not clear to the bot does not mean it's not clear to the editor. The bot again has to respect the human editor.
  • It's not clear to me, either, and I've inspected the template code. Could you elaborate the difference?
  • Sure. I agree there may be no difference to the code, or the rendering. But if the editor has a preference among two equally good ways of doing something, it's not the job of the bot to change it. Maybe the editor has been typing this way for years, maybe the field has a particular teminology, or any other reason. Bots should not, IMO, change user typed words to others that are "just as good". Unless there is a specific reason to change it, the author's wishes should be respected.
  • I don't think the bot changed anything. If there is a date parameter specified, the bot should not add a year. As with all bugs, I can't fix it unless I have a specific example of the edit in question, which I can't spot. Could you give me an edit and line number, please?
  • There are troubles with DOIs that contain odd characters. It's not clear if this has been fixed...
  • Again, I've not seen any problems. Please give examples so I can fix these troubles.
  • See the above section "DOIs that Wikipedia can't render". Is this fixed?
  • This was never broken. I think I replied on the user's talk page.
  • It's still removing useful URLs. See the section It's still removing useful URLs above, which states: Even with the above changes it's still removing useful URLs. IMO, the default should be reversed - unless the bot can *conclusively* show they are the same, the URL should remain.

So while DOIs are nice, the work of human editors is even nicer. So I think the bot should be turned off until all known issues are fixed. Then turn it on, and when the first problem occurs turn it off again. Once discussion has determined the right options, turn it back on. At the first problem, turn it off and repeat the process. This way the bot will do little harm (unlike now) and its contributions will be appreciated, and not cursed. What do others think? LouScheffer (talk) 18:58, 1 May 2008 (UTC)

Replied above. Smith609 Talk 19:51, 1 May 2008 (UTC)
Replies to replies above. LouScheffer (talk) 20:16, 1 May 2008 (UTC)
Actually, I find this bot immensely helpful. Why not turn-off the "commit edits" checkbox and force the user to check? This way, you turn the responsibility over to the user who is using the bot. --Rifleman 82 (talk) 19:07, 1 May 2008 (UTC)
Just to clarify, the bot's currently running through all 45,000 articles which use the cite journal template. Smith609 Talk 20:02, 1 May 2008 (UTC)
I have absolutely no problem with a user running the bot on page(s) of their choice. What I don't think is good is the bot visiting (and changing) pages all by itself, when it may undo the work of human editors, who worked hard to get the references the way they want them. LouScheffer (talk) 19:20, 1 May 2008 (UTC)

I'd have to agree with LouScheffer; there's no hurry to fix all 45,000 pages--you should turn the bot off until you've fixed all the known bugs.Whosasking (talk) 20:47, 1 May 2008 (UTC)

If the bots top speed is 6 per minute, and it achieves half of that, then it can visit all 45000 pages in just 10 days. So there's no rush to get it started - let's get it right first. LouScheffer (talk) 21:07, 1 May 2008 (UTC)
I just wanted to say I like the idea of this bot, and my criticisms are in no way intended to say "let's not do this". Thanks for doing the work to get citations improved! It's just that these little details, which may seem niggling, actually mean a reasonable amount of work to careful human editors, so it's important to get them right, or at least very close to right. Thanks. Eubulides (talk) 21:17, 1 May 2008 (UTC)

I also think that this bot should not be run on pages in general. If a human editor wants to take responsibility for the changes it makes on a specific page, fine, but should either turn off the part that rewrites non DOI URLs into DOIs, or turn the whole thing off. See concerns about privacy and feature creep I posted under separate heading. Zodon (talk) 02:22, 3 May 2008 (UTC)

DOI bot added additional "pages" parameter

At R (programming language), DOI changed this...

  • Ihaka, R. (1996). "R: A language for data analysis and graphics". Journal of Computational and Graphical Statistics. 5 (3): pp. 299-314. doi:10.2307/1390807. {{cite journal}}: |pages= has extra text (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

To this...

  • Ihaka, R. (1996). "R: A language for data analysis and graphics". Journal of Computational and Graphical Statistics. 5 (3): 299. doi:10.2307/1390807. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

by adding an extra "pages" parameter. (The old one was left in place.) I expect this is because the original "pages" parameter was incorrectly formatted, but the addition of another pages parameter was not the correct fix, and also it "lost" the ending page number. --Northernhenge (talk) 09:29, 3 May 2008 (UTC)

Ooh, thanks. The bot didn't spot the pages because they were prefixed with pp. - I thought I'd fixed the bot to spot these; I'll loosen its conditions some more. Smith609 Talk 09:31, 3 May 2008 (UTC)

similar issue here

See [2], it adds an extra "pages" parameter instead of using the existing empty one or deleting the empty one before creating the new one --Enric Naval (talk) 17:33, 3 May 2008 (UTC)

It's duplicating both the "pages" and the "doi" parameter here, on two different citations. --Enric Naval (talk) 19:24, 3 May 2008 (UTC)

If it encounter a "page" parameter, it duplicates it into "pages"[3] (first citation on the diff) --Enric Naval (talk) 20:59, 3 May 2008 (UTC)

Empty parameters will be duplicated, because under the current coding the bot has no way of working out which citation to remove the blank entities from. This will be fixed in version 2.0 - i.e. the next major re-write - but as it doesn't cause a problem in the rendering of citations is deemed acceptable for now. The "page=" parameter does not exist, nor render in the template. Smith609 Talk 23:16, 3 May 2008 (UTC)
OK, thanks for answering --Enric Naval (talk) 12:23, 4 May 2008 (UTC)


Odd bot edits

I don't understand what happened with this or this edit. Were the URLs really supposed to be removed? --EncycloPetey (talk) 12:57, 30 April 2008 (UTC)

Yes I agree that perfectly valid URLs should not be removed by the DOI bot, and I have been regularly restoring the URLs where I find them. To me there's no reason why a URL and doi can't co-exist, and some of the URLs provide free access to papers that are otherwise restricted. So please stop removing the URLs. Thanks.—RJH (talk) 16:24, 30 April 2008 (UTC)
I disagree. When (and only when) the doi leads to a URL that is identical to the URL in the cite template, the URL is redundant and ought to be removed because the doi is more permanent. However, if the URL points to a different link (such as the Astrophysics Data System entry for astronomy articles), it ought to stay (as I believe the bot currently does). ASHill (talk) 16:59, 30 April 2008 (UTC)
Just to clarify, if the URL is not the same page as the DOI points to, the URL parameter is left intact. (Very similar pages are treated as identical for this purpose.) The URL is ONLY removed if it is identical to the DOI destination. If you find any specific cases where this doesn't hold, please let me know ASAP so I can fix the bot. Smith609 Talk 17:35, 30 April 2008 (UTC)
Well then I stand corrected: I was unaware of this nuance. Sorry for the bother.—RJH (talk) 17:56, 30 April 2008 (UTC)
Not at all - thanks for checking! Smith609 Talk 18:38, 30 April 2008 (UTC)
But URL and DOI are not equivalent, even if they point to the exact same page. URL creates a link from the title, and DOI does not. To someone who knows and understands DOIs, there may be no difference, but to the average user there is a lot of difference. (And we should cater to the non-expert user, IMO). On the other hand, if this can be fixed in Template: cite journal such that it creates a title link from a DOI if there is no URL, then this would be OK too. Until then, we should not delete URLs, in my opinion.

DOI bot removing existing URLs?

Hi! I'd think removing the URL when it already points to a DOI is not helpful to the reader. The URL creates a link from the title, but DOI does not. From a user interface point of view, everyone knows what a linked title does, but only a small fraction of readers know about DOIs, and a lot of people are (rightly) hesitant to click on links they do not understand. (Especially a link that looks like a few random characters with no obvious correspondence to the title. This is often the form for spam and phishing links.) So if the URL is already a DOI, I think you should leave it. Adding the extra DOI field is fine. I know that this results in a duplicate link, but we are trying first and foremost to make a good experience for readers. LouScheffer (talk) 20:47, 30 April 2008 (UTC)

Upon further thought, this would be my preference. If the URL is of the form "dx.doi.org/...", then leave it. If the DOI appears elsewhere in the URL, change it to the dx.doi.org/ form (in case it's part of a search string, for example.). Finally, if there is no URL, and you do find a DOI, then *add* the URL with the dx.doi.org/ form. This would create as many linked titles as possible - the best form for the user, IMO. LouScheffer (talk) 20:55, 30 April 2008 (UTC)
I think this is better done automatically by the cite template (as Template:citation already does) and should be discussed at Template talk:cite journal. ASHill (talk) 21:06, 30 April 2008 (UTC)
I find the abundance of blue a bit overwhelming, but take your point and have requested your edit at {{cite journal}}. Smith609 Talk 22:18, 30 April 2008 (UTC)
I hear your point about the abundance of blue. My absolute personal preference is the behavior of the citation template (see my comment at Template talk:cite journal on 2008 February 8), which uses the doi to form a link from the title without separately listing the doi if no url parameter is present, but there was clearly no consensus to do that on the basis that dois should remain visible. Therefore, I think this is the best chance for a consensus option; I agree with Lou that a link from the title is most intuitive to a user who doesn't know about dois. ASHill (talk) 23:56, 30 April 2008 (UTC)
I agree. As usual, I seem to have started a larger discussion than I'd intended to! Feel free to continue this one over at Template talk:Cite journal. Smith609 Talk 07:55, 1 May 2008 (UTC)

Please don't remove existing URLs. For example, this change to Vaccine removed a useful URL; when users click on the DOI they have to click again to see the entire referenced article, but when they click on the URL they see the entire article immediately. I edited that citation to restore the URL (plus improve some other things); I hope the bot doesn't go back and remove that URL again. Eubulides (talk) 22:17, 30 April 2008 (UTC)

The main problem that URL-removal is trying to avoid is stability: nejm, and many other major publishers, are wont to change the organisation of their website or availability of their content without notice, which would take a great deal of work for editors to fix; however, DOI links will always work. The link to full text may be more useful to you (with, I assume, full access to this publication); however, (if I recall correctly how these sites work), somebody without institutional access would receive an "access denied" page, and be one equally irritating click away from the abstract that was the best page for them. I imagine that the publishers have made a conscious choice about where they want their DOIs to point.
Please feel free to correct me if I've recalled that wrongly.
Another thought, is that wikipedia is not meant to be a library of links. I accept that many can be useful, but with articles hosted at a number of sites, where do we draw the line? Ought we include links to JSTOR for every article there, for people whose subscriptions are limited to this resource? What about articles hosted at sepmonline and geoscienceworld? With a DOI, most people's institutional software can provide an "on campus" link to the full text they can access; casual readers are more likely to just want a quick, fast-to-load abstract.
Now, the bot currently has a few rules about which URLs are similar enough to the DOI to be considered equivalent; one involves equating abstracts to full texts, and can be easily removed if you do decide that the URL parameter should always point to a full article text, unavailable to the majority of Wikipedia's readers.
Let me know what you think! Smith609 Talk 22:28, 30 April 2008 (UTC)
A common practice used in medical articles like Vaccine is to use full URLs only to articles that are freely available to all readers. That is, all citations have PMIDs and many have DOIs, even for non-free articles; URLs are used only for freely-readable citations. The particular article in question (Offit 2007, PMID 17898096, doi:10.1056/NEJMp078187) is freely readable to all; it is not accessible only to institutions. The NEJM does this occasionally with articles it feels are of wide importance to the public. The URLs for these articles are stable in practice, and are more useful than the DOI because you get to the article in 1 click, not 2. For citations like these it is OK to have both a URL and a DOI (and a PMID too, since that's standard in medicine), but it is not OK to remove a URL that is better than the DOI or the PMID. Eubulides (talk) 22:59, 30 April 2008 (UTC)
Similarly edit[4] to Syphilis removing url that linked to full article for a doi that only gives an abstract - by all means have DOI.bot add missing doi's, but please stop removing valid free-to-access full urls. If the bot can't distinguish a url that is a doi link from a full direct page access then the bot needs to be stopped and a rethink be had - thi sis being disruptive. Hint in example given, the url contains the directory structure "content/full/" David Ruben Talk 00:27, 1 May 2008 (UTC)
Even the presence of a DOI in the URL may not be distinctive enough to make it a duplicate. For example, http://adsabs.harvard.edu/abs/1979QJRAS..20...29S is the link to the abstract, and http://articles.adsabs.harvard.edu/cgi-bin/nph-iarticle_query?1979QJRAS..20...29S&data_type=PDF_HIGH&whole_paper=YES&type=PRINTER&filetype=.pdf is the link to the full article. Although this uses their own IDs, the same could easily be true of DOIs - the DOI for the abstract could appear in the URL for the whole article. On the other hand, if the URL ends with .pdf it's almost surely a link to the full article and should not be replaced. LouScheffer (talk) 01:27, 1 May 2008 (UTC)
Given the issues that vary from publisher to publisher, I think the URL should only be deleted if it is exactly the same as the URL the doi resolves to. Rules for a bot to make decisions based on similar (but not identical) URLs are just too difficult—there's always going to be an exception. ASHill (talk) 01:33, 1 May 2008 (UTC)
Thank you all for your input. I've removed the equivalence of "/fulltext/" and "/abstract", so the bot's list of "identical pages" now equates the following:
  • "jsessionid=[Blah]" : optional tracking parameter, presence or absence does not affect viewed page
  • "%2F" = "/" : URL encoding of parameter does not alter destination page
  • "/full?cookieSet=1" : Full text is seen without this ending, where I've seen it
  • "/citation/" = "/abstract/" : Since URLs should only have /fulltext/ (from my newly gained understanding of the parameter) this shouldn't cause an issue
  • "/extract/" = "/abstract/" : From what I recall, these pages have equivalent content.
More of these can be removed if there is a valid reason for doing so - just let me know. Thanks again for all your feedback! Smith609 Talk 07:43, 1 May 2008 (UTC)

It's still removing useful URLs

Even with the above changes it's still removing useful URLs. This change to MMR vaccine removed the URL to http://www.bmj.com/cgi/content/full/323/7303/32, which is a freely readable article. Eubulides (talk) 17:41, 1 May 2008 (UTC)


PMIDs are great

As DOI bot proves that it is stable, could you consider expanding it to look up PMIDs? Like DOIs, not all articles have them, but they are exceedingly useful (more so than many URLs) when available. Interconverting between DOI and PMID is not simple, since the Library of Medicine doesn't always know the DOI, and you'd have to look up the PMID for each reference with the reference, not the DOI. The PMID's are stable and provide links to both abstracts and often to free copies of the fulltext--I'd love to see a DOI+PMIDbot.

One complication, though, is that each time the bot goes through a page it will query all the articles from 1832 for their DOIs and PMIDs. It would be nice if this eventual behavior, repeatedly and fruitlessly looking up the same references, could be avoided.Whosasking (talk) 13:58, 1 May 2008 (UTC)

PMIDs may be more difficult, or at least slower (according to the PMID API T&Cs. I might as well give it a go when I get the chance, though! Smith609 Talk 14:04, 1 May 2008 (UTC)

One DOI is enough

Tetrahydrocannabinol was recently edited by the DOIbot and the end the DOI appeared several times in the reference

<ref name="pmid12648025">{{cite journal |author=Grotenhermen F |title=Pharmacokinetics and pharmacodynamics of cannabinoids |journal=Clin Pharmacokinet |volume=42 |issue=4 |pages=327–60 |year=2003 |pmid=12648025 |doi= | doi = 10.2165/00003088-200342040-00003 | doi = 10.2165/00003088-200342040-00003 | doi = 10.2165/00003088-200342040-00003 | doi = 10.2165/00003088-200342040-00003 | doi = 10.2165/00003088-200342040-00003}}</ref>

Can you somehow fix it? -- Panoramix303 (talk) 16:21, 1 May 2008 (UTC)

The problem lay with the article, which possessed five identical references. I'll try to get the bot to work around such errors when I get the time; it may be tricky. Smith609 Talk 17:03, 1 May 2008 (UTC)
There are several articles where one reference appears several times. I really appreciate this bot but every entry has to be checked by a user again which is quite tedious.

-- Panoramix303 (talk) 22:15, 1 May 2008 (UTC)

DOI bot problem with issue=, pages=, date=

This change by DOI bot to Asperger syndrome had some problems:

The DOI points to http://apt.rcpsych.org/cgi/content/full/7/4/310, so the URL is redundant. If you'd like the template amended to display this link in the title, please add to the discussion at Template talk:Cite journal.
I want the title with a link if a pointer to free full text is available, and I don't want it linked otherwise. Unless I'm missing something, this will be beyond the capability of a simple change to the template. This make take a while to implement on the template side. For now, let's please leave the URLs alone, at least for URLs to free full text. Eubulides (talk) 21:17, 1 May 2008 (UTC)
  • For doi:10.1007/s10803-007-0442-9 it did not update the year from 2007 to 2008. This article was originally cited as an online-first article, before it was given a volume number and page information. The online-first publication was in 2007, but the formal publication (which DOI bot acted on; see the next point) was in 2008. When this happens, it's incorrect to list the date of informal publication along with the volume and page info; the date needs to be consistent with the other information.
The bot does not alter any information that has been manually entered, on the assumption that thinking editors can always do a better job than automated software.
Actually, the bot does alter manually-entered URLs, no? That is the point of the previous bullet. As for the dates, it is frustrating for the human editor when the bot to take a perfectly valid citation to a preprint, and transforms it into an invalid citation to the final article (invalid because the year is wrong). I'd rather have a valid citation to a preprint than an invalid citation to the final article. Eubulides (talk) 21:17, 1 May 2008 (UTC)
  • For doi:10.1007/s10803-007-0442-9 it inserted "| volume = 38 | pages = 748"; this omits the issue number and the final page number. I changed this to "|volume=38 |issue=4 |pages=748–58", the usual style for that article. Can the DOI bot be fixed to follow the usual style for an article?
Unfortunately, The bot cannot find the issue number or final page number.
  • Asperger syndrome uses a style like " |name=value" for values that do not contain internal spaces, and " |name= value" (with a space after the "=") for values that have internal spaces. Can DOI bot please be fixed to use this style, or to recognize the style in a particular citation and conform to it?
 Done
  • One other minor annoyance; Asperger syndrome prefers "date=2008" to "year=2008"; can DOI bot be changed to recognize the preferred style of an article?
It's not clear what difference this makes.
It makes no difference to the reader of the article, but it makes a difference to the editor. I often search for "date=" while editing and this sort of search is foiled by a bot that inserts "year=". Eubulides (talk) 21:17, 1 May 2008 (UTC)

I fixed the problems by hand in Asperger syndrome. The URL munging is the most important problem, of course; the off-by-1 date is the 2nd most important. The others are just annoyances, but they add up when one has a ton of citations. Eubulides (talk) 17:28, 1 May 2008 (UTC)

Just came over to see if this was followed up on. SandyGeorgia (Talk) 17:45, 1 May 2008 (UTC)
See comments above. Smith609 Talk 18:09, 1 May 2008 (UTC)


DOI bot conclusion

Hi,

I'm afraid I've run out of time to dedicate to DOI bot. A couple of closing comments:

  • If one reference is cited more than once in an article, it should be cited as follows:
The first time it appears: <ref name=example2008>{{cite journal|details}}</ref>
and subsequent times: <ref name=example2008/>
I'm pretty sure you'll find this quoted somewhere in the MOS, as it makes for a better reader experience.
The additional DOIs may look a mess in the source code, but at least highlight to editors that the citation has been incorrectly multiple times, and most importantly this makes no difference to the reader.
  • I'd love to fix petty details such as formatting, and duplicating blank parameters, but it's going to be a good while until I get the chance. Since these do not affect readers, and cause editors minimum inconvenience, they do not require an urgent fix; when I've coded up for them the bot can do another quick run and tidy up such citations.
Please list such requests at User:DOI bot/bugs#Feature requests.
  • If there are any other bugs - i.e. where DOI bot is:
  1. making a reference display incorrectly
  2. Inserting inaccurate DOIs (or other erroneous data)
  3. Removing a URL parameter that is not identical to the page you get when you click the DOI
  4. Removing or replacing any other parameter
then please let me know immediately by posting a message here and I will fix it.

Thanks, Smith609 Talk 07:54, 2 May 2008 (UTC)

DOI Bot Concerns - Privacy and Single Point of Failure

Replacing URLs with DOIs that link to the same place is not a neutral action. Introducing another layer of indirection has implications for information gathering and reliability.

  • The DOI dereferencing system is a huge information gathering tool, as with Google gathering search strings or an employer monitoring what web sites you visit. By pushing pages to use the DOI system the bot is increasing this information gathering. The privacy policy on the DOI website is inadequate.
"Our logs collect and store only domain names or IP addresses, dates and times of visits, and the pages visited. Data from the logs may be used to measure the number of visitors to the site."
No indication of what they might be used for in the future, no indication of what they will not be used for, no data retention times, etc. It is as if they never heard of the Code of Fair Information practice.
  • Control of DOI dereferencing controls access to information. By introducing another layer of indirection it may make accesses slower or less reliable, or block access in some areas (where some material is filtered and others not), or block access when and if the access policies change (DRM, charging for access, etc.). Examples have been mentioned of when URLs may change/be reorganized, but DOIs can be changed/reorganized/limited also.

To the extent that the bot tidies up pre-existing DOI references fine. But replacing non-DOI URLs with DOIs (even if they currently go to the same place) may have major implications for privacy and access, and is certainly not a neutral activity. (Can you imagine the uproar if Microsoft or Google had a bot that went around the web and changed URLs on websites to links that always went through their servers?) Where are these considerations documented and addressed in relation to the activities of the DOI bot? Thanks. Zodon (talk) 19:18, 2 May 2008 (UTC)

  • The entire intent of DOIs is to create more reliable references, by creating ways to access articles that do not change, ever. Regular web references cannot be used as any sort of scholarly reference precisely since the keeper might re-organize them, or delete them. But for DOIs, the keeper commits to keep them on-line, and if they re-organize, they will fix the DOI mechanism to point to the same articles. This way, at least in theory, a scholarly article can refer to a web resource without fear of it disappearing. LouScheffer (talk) 13:53, 2 May 2008 (UTC)
Intent - sure, but intents can change, systems get subverted, companies bought, databases repurposed (i.e. feature creep happens). Consider the problems of academia being increasingly held hostage by publishing companies controlling the journals. Consider the Westlaw dispute, Westlaw#Legal_disputes, where a company tried to charge for access to public law records by claiming copyright to the indexing system. In a day to day sense, DOIs add another set of servers which may or may not be working (or blocked by proxies, or corrupted by attackers, etc.) In a longer term view, the database and numbering system would represent increasingly valuable intellectual property, which could be exploited in various ways. Which doesn't mean don't use it, but users should be aware of (and try to minimize) the risks. Since this project is pushing the use of DOIs, it should offer a clear balanced view of the downsides as well as the upsides, and it should offer a clear, easy way for editors to opt-out. Zodon (talk) 01:25, 3 May 2008 (UTC)
Well, we're talking about a template. If the DOI resolver we're currently using gets subverted, bought, or repurposed, we can change the template in one place to use another resolver and the problem is fixed everywhere. This isn't an encyclopedia article subject to neutral point of view requirements, although it certainly is subject to a good decision about the risks for readers. However, a user's talk page is not the place to discuss something like this as it won't be widely seen. Perhaps WP:Village pump? ASHill (talk) 02:35, 3 May 2008 (UTC)
Is there provision for alternative independent resolvers? How are they updated? How is the data and update path protected? Those are the sort of issues that I was asking about, saying we can use an alternative resolver is okay if we know they will exist.
The neutral I meant was information/functionally neutral (benign, equivalent, etc.), not the Wikipedia POV neutral. (By removing URLs and inserting DOIs, the bot is obviously pushing a POV that DOIs are better than URLs.)
My apologies if this is not an appropriate place for this question. I am still fairly new to Wikipedia. The first I heard of DOIs was when the DOI bot made changes to a page on my watch list. I read the wikipedia entry on DOIs and did some reading on the DOI and handle websites and tried some searches for more information. Aside from general mention of concerns about privacy and control on the wikipedia entry, I didn't find much to indicate that where these issues are considered, discussed, etc. The material about the bot directed me to this page for questions/etc. So far I haven't found much about it in the village pump, bit I will keep looking. Thanks. Zodon (talk) 20:29, 3 May 2008 (UTC)
This is the place for a a discussion about the bot, but if you're talking about a Wikipedia-wide privacy policy concern, starting a discussion at the Village Pump is probably a better place to develop a broad consensus. (There's no existing discussion that I'm aware of, but I haven't looked much.) It looks like User:MCB will start such a discussion soon. ASHill (talk | contribs) 21:14, 4 May 2008 (UTC)

I share Zodon's concerns, and the arguments mentioned here are largely responsible for my block of the bot, and the resulting discussion. Please see WP:AN#DOI bot blocked for policy reconsideration. --MCB (talk) 23:47, 4 May 2008 (UTC)

Replacing URLs with additional DOI parameters

This edit to Omega-6 fatty acid replaced DOI URLs with DOI parameters, even though the links already had DOI parameters. Why would a bot not check for parameters before adding them? Why would URLs be removed at all? Yes, both the URL and DOI pointed to the same place—this is so users can click the article title and get something useful, while bots can consistently find the DOI of an article. —Werson (talk) 07:06, 4 May 2008 (UTC)

If the URL is set to a DOI, then it's not displayed in the citation. Smith609 Talk 07:55, 4 May 2008 (UTC)
A younger me would try to convince you that that isn't true, but I've come to find and accept that the vast majority of Wikipedia users really are completely batshit insane and won't take objective reality as an argument. I see the few users with any sense at all have already tried to get through to you on this discussion page, but it looks completely hopeless. I guess I'll just leave you alone and continue on my way. —Werson (talk) 12:03, 4 May 2008 (UTC)
Many apologies - I completely misunderstood your post. Fixing this bug is tricky due to the way the bot is coded. However, I've got a free hour, so I'll give it a go.
And as I understand the concerns of other users listed here, I believe (hope?!) that most of them will be resolved with the next update of {{cite journal}}. Smith609 Talk 14:57, 4 May 2008 (UTC)


The first letter

In this diff the bot added journal tags to the refs. However the refs already had journal names, but with the wrongly cappitalised first letter. Can the bot check the first letter before adding anything? Ruslik (talk) 14:56, 4 May 2008 (UTC)

I'm afraid I can't make the bot correct all types of human error. However, I might be able to catch this one - I'll see what I can do. Smith609 Talk 15:03, 4 May 2008 (UTC)  Done17:40, 4 May 2008 (UTC)

DOI bot patch

Hi,

I've just spent the afternoon coding a patch for DOI bot, so it now leaves citations in a tidier state. If you know of a page that was left with multiple instances of a parameter in one citation, you can feed it to DOI bot using the links on its user page; it'll take a second look and try to tidy up.

I've tested the patch quite thoroughly but it may have unintended side effects; please notify me of any here as usual and I will fix them. Smith609 Talk

DOI bot blocked for reconsideration of policy

I have blocked DOI bot for 24 hours and intend to propose a permanent ban pending a full policy discussion. The first problem is that it appears to be broken, resulting in edits like this, which broke the reference and left a mess, but more importantly because it is implementing a major policy change in the way Wikipedia makes web references, without large-scale community consensus and buy-in. That is not something a bot should be doing. (I'm aware that the bot received approval, but I don't believe that its full impact was understood.) Basically, the bot appears to be editing URLs in citation templates and replacing them with a DOI scheme that relies on an external private organization (doi.org). In some cases the URL is left alone, but a DOI is added, and what is rendered in the article is a DOI that if clicked on, will take the user to the link indirectly via the doi.org site.

This raises all sort of issues, and in my view, violates WP:EL because it promotes an external organization (doi.org), and drives huge amounts of traffic to that site, by Wikpedia reader who think they are going to a particular source, and then are taken to doi.org. Regardless of the noble aims or promises of the organization, that is inappropriate. In addition, routing the traffic through a private site allows that site to collect the IP addresses and search terms of all the traffic, a very serious privacy and data collection issue. Furthermore, it is a single point of failure for potentially all the cited sources in Wikipedia. If doi.org goes away in the future (lack of funding, lack of interest, factionalism, who knows what), Wikipedia would suffer immeasurable harm. If doi.org is taken over by a group with different aims and values, Wikipedia would suffer immeasurable harm.

This is a very, very, very bad idea, and should not be implemented without a very strong consensus after a full review of the issues. I am copying this to WP:AN for further discussion. --MCB (talk) 20:36, 4 May 2008 (UTC)

Response left at WP:AN. Smith609 Talk 08:01, 5 May 2008 (UTC)


DOI bot, adding doi parameter when already present but empty

See this edit, the bot recognises correctly that the 2 references are missing doi data, but rather than fill in the empty parameter at the end of the citatoin template (i.e. the existing "|doi=}}"), it adds a new doi parameter (i.e. "|doi=|doi=.....}}"). Whilst this works as far as the eventually shown footnote in the reference section, it is messy and I suspect risks the bot targeting needlessly the citation in future for having an empty "doi" parameter. this search (very imperfectly, as it locates other hits too) helps show multiple examples of "|doi=|doi=...." creation. David Ruben Talk 02:24, 6 May 2008 (UTC)

As noted above, this has now been fixed. Thanks for pointing it out. Smith609 Talk 07:34, 6 May 2008 (UTC)

Software engineering suggestion

Since the bot can modify 45,000 pages, it makes sense to make sure it's as good as possible before it starts. I'd suggest starting a test page, where everyone is free to add each troublesome case they have found. (same cite journal twice, URL is completely different, URL is mildly different, has page number, has no page numbers, has a range of page numbers, has mis-capitalized keywords, etc.). Then you can try the bot on this page, and only go to automated operation when there are no known bugs. Whenever the bot is updated, you can try it again on the page, to make sure the changes had no unintended consequences. LouScheffer (talk) 15:48, 6 May 2008 (UTC)

Thanks - as advertised at User:DOI bot, you can find this "test" page at User:DOI bot/Sandbox. Because there are such a range of tiny quirks, not all of them can be pre-empted. For example, I didn't anticipate citations having capitalised "Journal" parameters, because any editor who checked their edits before saving them would notice that it hadn't worked and correct it.
Therefore, once all known bugs are corrected, the only way to find more is to run the bot on real pages. I do this in small batches, gradually increasing in size as my confidence in the bot increases, and review each batch immediately to ensure that the fix has not had any unintended consequences (and spot any novel bugs).
After I have done this to my satisfaction, I let the bot loose, checking occasional edits and stopping it as soon as further bugs are reported.
Smith609 Talk 15:59, 6 May 2008 (UTC)
Where is this advertised on the User:DOI bot page? I've read the page several times, but have not seen it... LouScheffer (talk) 22:00, 6 May 2008 (UTC)
Oh, sorry, you had to follow the link under "bugs". The test page remains User:DOI bot/Sandbox. Smith609 Talk 09:26, 7 May 2008 (UTC)

OK, I added a bunch of examples to that page, and removed the generic text. LouScheffer (talk) 22:36, 8 May 2008 (UTC)

Thanks a lot - that's a very diverse range of structurally interesting references, and will come in incredibly useful when I start work on the overhaul! The bot's currently tuned to only add DOIs when it's as good as certain of a match, which explains why it didn't find a couple that you've listed - I'll relish the challenge of improving its techniques to reduce the false negatives! Smith609 Talk 10:14, 9 May 2008 (UTC)

Categorisation of cite doi/subpages

No problem. :) --WoohookittyWoohoo! 10:09, 8 May 2008 (UTC)


DOI bot gives syntax error

Hi, I tried to use the DOI bot manually to look up some DOIs, but it gives a syntax error when I try.

  • Parse error: syntax error, unexpected '}' in /home/verisimilus/public_html/Bot/DOI_bot/doibot.php on line 648

LouScheffer (talk) 18:06, 9 May 2008 (UTC)

Hi, it's been disabled for bureaucratic reasons. I hadn't realised it was still useful in its crippled state! It's back running again now. Smith609 Talk 20:14, 9 May 2008 (UTC)

Bot vacation?

Is the DOI bot on vacation? All i get is this message: "Sorry, the DOI bot is temporarily unavilable [sic] while bugs are fixed. Please try back later." When will it be working again?  —Chris Capoccia TC 15:41, 2 October 2008 (UTC)

See User talk:Citation bot -- I just need to make sure some bugs are fixed; I hope to have to time to do this at the weekend. Martin (Smith609 – Talk) 18:46, 2 October 2008 (UTC)

Google Scholar Universal Reference Formatter doesn't timeout

Hi, I'm just dropping a note about your URF not doing its timeout. Way back when I asked a function to condense these citations, but since the citation is always loading, it won't transform it into a condensed citation. Thanks for your work. II | (t - c) 16:47, 23 April 2009 (UTC)