User:Drilnoth/codefixer.js/doc

From Wikipedia, the free encyclopedia
Jump to: navigation, search

CodeFixer is a user script that allows you to quickly and easily update common mistakes in HTML and WikiText. Please note that this script is still new, so there may still be bugs and not all of its final functionality has been implemented. Additionally, as the script is being worked on, errors may be introduced which cause it (and possibly other user scripts which you are using) to stop functioning for a short time, although this should usually be fixed within a few minutes.

This script is based upon MECU's BR fixer, Plastikspork's script, and Formatter, and combines various elements from them with new code. Each script functions slightly differently, however, so you can choose whichever one suits your personal tastes.

Use[edit]

When CodeFixer is installed (see #Installation), you should see two new tabs at the top of each page while in edit mode and when viewing the page normally: "fix code" and "fix code (+)". Clicking either of these tabs tab will cause the script to automatically edit the page and fix common errors, or just cleanup code even if it wasn't an error. The script contains a list of all things that it can do (although formatted in JavaScript, it should be pretty clear to anyone what the fixed problems are). The range of what is fixed is always expanding.

The two buttons make slightly different fixes. The first button represents the "standard" version of CodeFixer. Although it has some false positives (see below), most all of its fixes shouldn't be problematic. The "codefixer (+)" button, however, starts CodeFixerPlus, which in addition to the "basic" fixes of the normal version performs more advanced cleanup. CodeFixerPlus's edits, however, usually are better at just helping to cleanup code... the script, while working on advanced things like converting HTML tables to WikiText tables, will cause errors in doing so that need to be cleaned up by humans before the edit is saved. CodeFixerPlus, then, serves as a quick way to "help" cleanup such code, but it doesn't do the whole thing and you need to go over it to make sure that it all looks good, previewing and editing further to make sure.

Note that you should always check the diff of any edit made using this script before saving, to make sure that there weren't any false positives (the "show changes" button is automatically clicked when you use this script, so that you don't need to do it manually). If there are, please fix them before saving and report it at #Bugs and to-do list, so that I am aware of the issue.

CodeFixer works with Formatter, although it contains many of the same fixes. Someday CodeFixer may incorporate all of Formatter's current fixes, except maybe for whitespace fixing.

False positives[edit]

Because of the simplicity of this script's RegEx, it will find some false positives, such as those below. These can hopefully be fixed at some point; if you find any other false positives, please add them here so that I can try to figure out how to fix them. CodeFixerPlus is known to have false positives which need further human cleanup before saving; this list contains only those caused by the standard CodeFixer.

  • The script's ISBN fixing automatically looks for "ISBN:", "ISBN-13:", "ISBN-13", "ISBN-10:", and "ISBN-10" in articles and replaces them with the correct "ISBN" syntax to allow the MediaWiki software to automatically link the ISBN, but it will also change ones which shouldn't, such as many in the actual International Standard Book Number article.

Installation[edit]

To install CodeFixer for your own use, simply add the following to your monobook.js page (if you're using a non-standard skin, you probably know what to do).

importScript('User:Drilnoth/codefixer.js'); //See [[User:Drilnoth/codefixer.js/doc]] for details

After adding this, just purge your cache and the script should begin working.

Configuration[edit]

By default, CodeFixer marks edits you make using it as minor. If you'd prefer that they not be, you can add the following right under the "importScript" text on your monobook.js page:

codefixerMinor = false;

Please note that setting this to false will actually mark the edit as major even if it had been marked as minor before you used the script, due to a technical restriction. You can, however, still manually check and uncheck the "mark this edit as minor" box, depending on what other edits you make.

Bugs and to-do list[edit]

If you have found a bug: Please leave a descriptive entry here and a note on my talk page about the issue. Thanks!

  • Add more errors to the script so that they can be fixed. –Drilnoth (TC) 17:12, 26 March 2009 (UTC)
    • Currently focusing on: Merging Formatter's various functions into this script (without whitespace perhaps, since its so minor and makes the diffs too large!) –Drilnoth (T • C • L) 22:02, 19 April 2009 (UTC)
  • CodeFixer+ mode will currently make changes inside working links (including internal, external, and interwiki links). While this can sometimes be beneficial, it usually is not. It's probably best not to do such changes automatically. Gavia immer (talk) 20:30, 22 April 2009 (UTC)
    • (copied from my talk page, since I saw this there first): Could you give me a link for CodeFixerPlus editing inside links? It shouldn't be doing anything right now other than what the standard CodeFixer does and converting HTML tables to wikitables. Were both versions causing the problem, or just CodeFixerPlus? Thanks. That having been said, CodeFixer does not have any way to separate text in links from text outside of links. In my hundreds of edits using it, however, I've only encountered a few cases where it would have broken an external link (the changes which caused that have now been fixed, however), and I've never encountered a problem with internal links. None of the edits that it makes really change the kinds of things that are in most links other than to unicodify elements (which doesn't harm internal links, but can cause trouble for some external links. However, the most common change which would harm external links—replacing & with just &—has been deactivated) and remove unicode control characters (which so far has had no effect on internal or external links whatsoever). –Drilnoth (T • C • L) 21:23, 22 April 2009 (UTC)
      • As noted on the talk page, it appears it was WP:FORMATTER that caused the issue, not this script; sorry for the false alarm. Gavia immer (talk) 21:29, 22 April 2009 (UTC)
  • The script tried to change "i.e." to "ie." and I looked through the script and couldn't find this change so I assume its a bug. According to i.e., "i.e." (with two periods) is the correct form and it shouldn't be changing that. (however correcting other forms to the right one might be a nice addition). --Yarnalgo talk to me 05:59, 24 April 2009 (UTC)

Requested fixes[edit]

If you would like to request that a new fix be added to the script, please edit this section and add your request here. If you request an addition here, I'll do that as soon as possible if it is a logical fix for this script. Once the fix is added, I will remove the request from this section and notify you on your talk page. If the requested fix cannot be implemented due to technical restrictions, or because it is an inappropriate task for this script, I will say such in the edit history and leave a note on your talk page.

<br> tags[edit]

(NOTE: Previous discussion on this topic removed because page was getting too large. See [3] for earlier parts of this discussion. –Drilnoth (T • C • L) 16:22, 21 April 2009 (UTC))

probably better to paste everything to User talk:Drilnoth/codefixer.js/doc and have this section point folks there; then archive the talk normally. Cheer, Jack Merridew 06:07, 24 April 2009 (UTC)
Good idea. I hadn't originally been expecting sections here to grow this large.Drilnoth (T • C • L) 13:24, 24 April 2009 (UTC)

You should probably note that the period in '<.BR>' matches all characters, hence all the particular matches you have after that are redundant. This is probably fine, unless you are worried about matching something like '<abr>' or '<ubr>', although I don't there are any three letter tags that end in br. If you really want to match a literal period, you have to backslash it. The three lines above could be reduced to the following two lines

    txt.value = txt.value.replace(/(?:<[\\\/\.]+BR[\\\/\. ]*>|<[\\\/\. ]*BR[ ]*[\\\/\.]+[ ]*>)/gi, '<br />');
    txt.value = txt.value.replace(/<[ ]*BR[ ]*>/gi, '<br>');

The second line will perform one no-op, which is replace '<br>' with '<br>', but it's not a big deal in my opinion since it reduces the complexity of the match. By the way, I maintain my own script, which does some cleanup but it's mostly orthogonal to what you have here. Although there is some minor overlap. Thanks for your contributions! Plastikspork (talk) 00:32, 20 April 2009 (UTC)

Okay; thanks! That code looks a lot more confusing, but I think that I can trust you. :) I wasn't really aware about how the periods worked (as I said, still kind of new to RegEx), so thanks for telling me (I'd seen it before, but was looking at a lot of the different RegEx symbols at the same time, so I got a tad confused. :) That looks like a useful script... would it be OK with you if I used somequite a bit of the RegEx in yours to help expand CodeFixer? Most of the things that your script does are kind of on my long-term to-do list, but why develop it separately? (don't worry; you'd get credit both in the intro for this page and in the code itself once I added your RegEx to my script). It's completely up to you; I'll understand if you'd rather that I not do that. –Drilnoth (T • C • L) 00:55, 20 April 2009 (UTC)
If you want something a bit easier to read, try the following
    txt.value = txt.value.replace(/<[\\\/\.]+BR[\\\/\. ]*>/gi, '<br />'); // Tag starts with a slash or period
    txt.value = txt.value.replace(/<[\\\/\. ]*BR[ ]*[\\\/\.]+[ ]*>/gi, '<br />'); // Tag ends with a slash or period
    txt.value = txt.value.replace(/<[ ]*BR[ ]*>/gi, '<br>'); // Tag contains no slashes
The (?:foo|bar) matches either 'foo' or 'bar', but does not save the pattern, so it cannot be referenced with a $1.
The [\\\/\.] matches either '\' or '/' or '.'
The '+' modifier indicates it must match one or more times, while the '*' modifier is zero or more.
You are certainly free to copy regular expressions from my script. That's the great thing about WP is that people generally share stuff. It would be even better if there was a way we could both link to common code rather than just copying and pasting. One way to do this would be to create a set of smaller helper functions and put those in a common script file. For example, Wikipedia:WikiProject_User_scripts/Scripts/Formatter seems to be well written in that regard. If that script had split out the 'addOnloadHook' and 'format' functions, then I could just include it and re-use the subfunctions. The only other problem with that script is the possibility of namespace collisions since the function names are very common. Plastikspork (talk) 03:59, 20 April 2009 (UTC)
That does look a bit more readable; I'll probably use it. Thanks! Anyway, I share your feelings on the code use... I hope to eventually merge Formatter into this script, too, but that's on my long-term to-do list. Thanks! –Drilnoth (T • C • L) 13:15, 20 April 2009 (UTC)
@David Göthberg (talk) 14:58, 19 April 2009 (UTC) (mostly)
I can see your reasons for not wanting to propagate the xml form en mass, so I'm fine with this script not converting-up in a blanket manner. And I'll certainly rethink the notion of whitespace around br-elements as I do prefer more readable forms.
That said, I still feel that the '/' is a legit form and that we should not be automatically removing them either. The various expressions looking for malformed code should make a good faith attempt at seeking whatever was intended.
I did not know that MediaWiki was emitting PDF somewhere; people browsing with printers? The future will offer many new things.
Ignore most of my initial comments re {{br}}, which redirects to {{-}}; the edit request I made (Template talk:-#Time to make this bad idea right) was about getting rid of clear="all" in favour of style="clear: both;" — and note the xml form in use there ;)
Getting the assorted tools working on the same page would be good; there's little sense in the scripts an bots thrashing the database with automated edit warring.
Thanks for the blog link; something I'll have to read more of. Now off to read the spork script.
Cheers, Jack Merridew 06:23, 20 April 2009 (UTC)
p.s. most of this br-discussion would apply to hr-elements, too.
There should be a "PDF version" link in the toolbox whenever you're looking at an article (it's relatively new). Also, I'm not familiar with hr elements... what do they do, and when are they used? Thanks. –Drilnoth (T • C • L) 13:15, 20 April 2009 (UTC)
I had not noticed the PDF version link; tons of junk at the end. Sure you've used them ;) horizontal-rules aka ---- They don't have a closing tag — no </hr>, so the xhtml form is <hr />. Same would apply to img-elements if they wern't hoovered-out of the code. Raw hr-elements would be fairly uncommon, unlike the breaks. Cheers, Jack Merridew 13:46, 20 April 2009 (UTC)
Maybe I should just replace various forms of hr tags with the four wikitext dashes? –Drilnoth (T • C • L) 14:12, 20 April 2009 (UTC)
Probably reasonable; I would expect that the primary reason someone would hard-code an hr-element would be to get some specific look with a style attribute; a basic S&R would leve those alone. Back some years, most tables were done w/raw html; pasted from Mozilla Composer and the like. Cheers, Jack Merridew 14:35, 20 April 2009 (UTC)
Okay; I'll add this. –Drilnoth (T • C • L) 16:23, 21 April 2009 (UTC)

I don't think your code is completely safe in that if the <hr> is preceded by a pipe, it will look like a |- for a table. For example, try it on the following (check the wikisource to see exactly what I am talking about):


For this reason, I suggest the following code instead, which will add a newline if the 'hr' is not at the start of the line:

    txt.value = txt.value.replace(/([\r\n])[\t ]*<[\\\/\. ]*HR[\\\/\. ]*>/gi, '$1----');
    txt.value = txt.value.replace(/(.)<[\\\/\. ]*HR[\\\/\. ]*>/gi, '$1\n----');

Let me know if there is anything else I can do to help. Plastikspork (talk) 23:03, 21 April 2009 (UTC)

Okay; thanks for the code! –Drilnoth (T • C • L) 23:30, 21 April 2009 (UTC)

unbalancing tags[edit]

I noticed an issue; sometimes this script offers to convert mundane elements like the b-element to wiki-markup, but gets confused by any attributes present.

  • [[User:Rlevse|<b style="color:#060;"><i>R</i>levse</b>]], for example, would become:
  • [[User:Rlevse|<b style="color:#060;">''R''levse''']]

This does still work, but is definitely not something we want happening — to anything, not just sigs. An expression for this would be rather messy as there's a lot that could be going on in the middle.

fyi, Jack Merridew 06:37, 20 April 2009 (UTC)

Thanks for letting me know about this. At his point I don't know how to fix it, but I'll keep it in mind. It shouldn't really come up that much though... I've only ever seen styled b and i elements on Wikipedia in signatures, which this script shouldn't be used to change anyway. –Drilnoth (T • C • L) 13:17, 20 April 2009 (UTC)
If you want to make sure you only unwrap nested tags, the match both the start and the end tag at the same time:
  txt.value = txt.value.replace(/<(B|STRONG)[ ]*>([^<>]*)<\/\1[ ]*>/gi,  "'''$2'''"); // Wikify <B> and <STRONG>
  txt.value = txt.value.replace(/<(I|EM)[ ]*>([^<>]*)<\/\1[ ]*>/gi,  "''$2''");       // Wikify <I> and <EM>
In the first expression, '\/\1' will resolve to either '/B' or '/STRONG' depending on the outcome of the first match. Of course, this won't match unclosed tags, or bold tags around other nested tags, but this would be safer. You could repeat these two lines twice if you wanted to add the possibility of nested versions. Although rare, it might be useful to add something before this which would fix nesting problems. For example <b><i>foo</b></i>. I could write something up if you are interested. Plastikspork (talk) 13:46, 20 April 2009 (UTC)
Thanks for the code. I'd much appreciate it if you could write the RegEx for the nesting problem. Thanks! –Drilnoth (T • C • L) 14:09, 20 April 2009 (UTC)
Catching all of them would be more involved, and I am not sure if it ever comes up in practice. It would be useful to have a real world example where this was spotted in WP. As I said, I believe it is rare. If it is very rare, then we might want to wait until it comes up in practice. If you want something that matches simple singly entangled tags, I could write that when I have some spare time. Today has been very busy. Thanks! Plastikspork (talk) 23:03, 21 April 2009 (UTC)
Well, maybe just wait on it until we come across and instances of it. –Drilnoth (T • C • L) 23:29, 21 April 2009 (UTC)
This would be rare; the example I gave, Rlevse, was the only instance I have seen. I coded that version of his sig for him and it is using html — the i-element, too — because there were odd things happening to his sig when there was a higher level of apostrophe-markup wrapping it; things like people quoting him in italics. Generally, I use tools like this as an fyi, and pick and choose the bits to actually use.
— Cheers, User:Jack Merridew aka david 05:22, 22 April 2009 (UTC)

Adding spaces[edit]

I noticed someone made an edit that was attributed to this script, and that edit added lots of incorrect spaces: [4]. There's no need to replace entities with unicode in the first place, but if the script does do the replacement it should not add spaces. — Carl (CBM · talk) 17:11, 22 April 2009 (UTC)

As far as I can tell that user must have manual added the spaces before/after the script had finished, or used another script to add the space. If you search for '&mdash;' and '&ndash;' in the source, it does not pad the substitutions with spaces. 128.174.237.100 (talk) 17:20, 22 April 2009 (UTC)
For anyone wondering (and future reference), MOS:EMDASH is pretty clear that em-dashes shouldn't be spaced. I'll drop the user a note on their talk page. ダイノガイ?!」(Dinoguy1000) 17:22, 22 April 2009 (UTC)
OK, thanks for the quick answer. — Carl (CBM · talk) 17:39, 22 April 2009 (UTC)
The script does not add in extra space by default; it simply replaces the entities with unicode. Gavia immer must have added them in on their own or with another script in that edit. –Drilnoth (T • C • L) 19:54, 22 April 2009 (UTC)

H1 - H6 tags[edit]

I notice that the code substitutes '<H1>' for '=' without checking to see if it's preceded by a newline. Note that while HTML is pretty much newline insensitive, wikipedia text is not, especially in this case. A simple, appears to be completely safe, solution is to match the start and end tags together, and make sure there are no problems with newlines. For example this is<h1>a heading</h1>with problems would not render the same as this is=a heading=with problems. You could start by adding newlines where needed:

  txt.value = txt.value.replace(/([^\r\n ])[\t ]*(<H[1-6][^<>]*>)/gim, '$1\n$2');   // Make sure <H1>, ..., <H6> is after a newline
  txt.value = txt.value.replace(/(<\/H[1-6][^<>]*>)[\t ]*([^\r\n ])/gim, '$1\n$2'); // Make sure </H1>, ..., </H6> is before a newline

and then follow that by a match which replaces only those which can be safely changed

  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H1[^<>]*>([^\r\n]*?)<\/H1[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1=$2=$3');
  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H2[^<>]*>([^\r\n]*?)<\/H2[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1==$2==$3');
  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H3[^<>]*>([^\r\n]*?)<\/H3[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1===$2===$3');
  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H4[^<>]*>([^\r\n]*?)<\/H4[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1====$2====$3');
  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H5[^<>]*>([^\r\n]*?)<\/H5[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1=====$2=====$3');
  txt.value = txt.value.replace(/(^|[\r\n])[\t ]*<H6[^<>]*>([^\r\n]*?)<\/H6[\r\n\t ]*>[\t ]*([\r\n]|$)/gim, '$1======$2======$3');

Note that if there is a newline inside of the headline tags, then the substitution is not safe, which is why my suggestion guards against newlines inside. A solution to this problem would be to remove these newlines before performing the substitution. If you want that code, search for 'remove newlines from inside' in my script. You can copy the for loop and change 'str' to 'value.txt' and it should work. Plastikspork (talk) 02:15, 25 April 2009 (UTC)

Good point; thanks. I don't have the time tonight to do this, but will implement it tomorrow. Thanks! –Drilnoth (T • C • L) 02:24, 25 April 2009 (UTC)
 Done; let me know if there are any problems. Thanks again! –Drilnoth (T • C • L) 20:56, 25 April 2009 (UTC)

Linkfixes[edit]

Hi. I use a Norwegianised version of your script along with the formatter-script, and appreciate it very much. However, the latter can add and remove whitespaces in ways that are not always desirable. I was wondering if it was possible for you to add a fix similar to the link simplifier (simplifies some links e.g. [[Dog|dog]] to [[dog]], [[Dog|dogs]] to [[dog]]s and [[Dog|canine]]s to [[Dog|canines]]) to your script, eliminating my need to use both scripts? Thanks, and keep up the good work. -Helt (talk) 14:39, 27 April 2009

Happy to hear that it's helpful for you! Having a linkfixer is high on my to-do list for this script; I'll leave you a note on your Norwegian talk page when I add it. –Drilnoth (T • C • L) 14:52, 27 April 2009 (UTC)
By the way, I am always making various updates to this script; e.g., it's now possible to start CodeFixer when viewing or editing a page. You may wish to check in for changes like that if they seem helpful. –Drilnoth (T • C • L) 14:54, 27 April 2009 (UTC)
 Done; I actually just copied the RegExp from Formatter (along with the section header name fixes). You can copy it over if you'd like; let me know if there are any errors. –Drilnoth (T • C • L) 15:30, 27 April 2009 (UTC)
Note that what you copied uses 'TurnFirstToLower' from Formatter, so if someone hasn't included that, it probably won't work. A safe option would be to copy that function, then call it something else, just in case someone includes both. See, for example, User:Plastikspork/tools.js. Plastikspork (talk) 16:30, 27 April 2009 (UTC)
Oh, duh. I didn't read the whole code, and it just worked for me because I already have formatter. Thanks for mentioning that; I'll add that function in. –Drilnoth (T • C • L) 19:39, 27 April 2009 (UTC)
FYI, checking no:Bruker:Helt/codefixer.js it appears there are some bugs in that version which were fixed in this version. We should come up with a way for you to include the functions from this code in your code so you don't have to worry about following all the bug fixes and can regionalize it for your own purposes. Plastikspork (talk) 17:59, 27 April 2009 (UTC)
Discussing on my talk page. :) –Drilnoth (T • C • L) 19:39, 27 April 2009 (UTC)

I really appreciate both of your efforts. I do pop in now and again to copy some updates, but I guess I need to copy the whole page from time to time and translate the lot again, as I might miss the odd bugfix when I sow it together bit by bit. I don't know any java, but a few things seem quite intuitive, so I'm able to weed out the bits that don't fit the Norwegian standards. Adding my own fixes, now that requires more than my little brain can handle, so I'll have to ask you. Like the new error from Check Wikipedia, the indented list. Would it be much of a problem changing the :* ::* to ** and *** and so on? -Helt (talk) 18:52, 28 April 2009 (UTC)

That's on my to-do list to add to the script, but first I plan to get WP:AutoEd functioning... it seems kind of counterproductive to keep adding to this and then need to rework it all. –Drilnoth (T • C • L) 19:05, 28 April 2009 (UTC)

Here's an interesting case you may want to look into: [[text#text|text]] (where all three "text"s are identical). This type of link may occur when an inexperienced user tries to bypass a redirect to a section link by copying from the URL (e.g. if Foo redirects to Bar#Foo, when you click on a link to Foo, the URL becomes http://en.wikipedia.org/wiki/Foo#Foo - the user then copies "Foo#Foo" from the URL and pastes it into the link in the original article, perhaps following the lead from other piped links on that article; thus [[Foo]] becomes [[Foo#Foo|Foo]]). Probably, the best option per WP:R2D is just to replace it with a non-section, non-piped link. For an example, see this edit which fixed one such occurrence - ADV Films redirects to A.D. Vision#ADV Films. ダイノガイ?!」(Dinoguy1000) 19:59, 28 April 2009 (UTC)

That shouldn't be too hard; I'll try to create it once WP:AutoEd is running. –Drilnoth (T • C • L) 20:07, 28 April 2009 (UTC)
I had a look through your WP:AutoEd-plans, and though I don't understand the tech talk, I do think it sounds like a great idea. I tried to copy the latest version of your script to my codefixer page, and it worked like a charm until I tried to translate it. And I can't, with my previously mentioned small brain, understand where I go wrong, but the 'fix code' tabs just disappear when I purge after translating. So I'll eagerly be awaiting yours and Plastiksporks new adventures in Javaland ;) -Helt (talk) 20:28, 28 April 2009 (UTC)
Hmm... interesting. Well, AutoEd should be working in a few days (it actually works now, it just isn't quite ready), at which point I'll also have written some documentation on how to use/customize it. –Drilnoth (T • C • L) 21:05, 28 April 2009 (UTC)

See also[edit]

  • FullWidth replacer, another script that uses CodeFixer's framework to replace "fullwidth" characters with normal letters, numbers, and symbols.