Jump to content

User:Dispenser/Automatic linking problems

From Wikipedia, the free encyclopedia
Title URL limitations
  • Control character (0x00 - 0x1F) are forbidden
  • Spaces ( , 0x20) are not allowed
  • Double quote (", 0x22) needs to be percent encoded
  • Number Signs (#, 0x23) are not allowed
  • Percent sign (%, 0x25) needs hexadecimal characters to follow, so cannot end with it
  • Angle brackets (<>, 0x3C 0x3E) are not allowed
  • Question mark (?, 0x3F) used as query string separator needs to be percent encoded, should have corresponding redirect
  • Square brackets ([], 0x5B 0x5D) are not allowed
  • Trailing underscore (_, 0x5F) are stripped
  • Braces ({}, 0x7B 0x7D) are not allowed
  • Vertical bar/Pipe (|, 0x7C) are not allowed
  • Non-ASCII characters (0x80 and higher) are normally percent encoded. Sometimes decode (ruwiki) for ease of use. We will focus on non-Unicode systems.

Assuming letters, numbers, underscore, and forward slash (A-Z, a-z, 0-9, _) are safe, these are the characters we need to test:

! $ % & ' ( ) * + , - . : ; = ? @ \ ^ ` ~
  1. [0-9A-Za-z\-.:_] not escaped
  2. [;:@$!*(),/] are converted back in GlobalFunctions.php (whenever I wrote this)
! $ ( ) * , - . : ; @ ~


-- 2> /dev/null; date; echo '
/* Trailing characters survey
 *
 * License: Public Domain
 * Run time: 20 minutes
 */
SELECT
  RIGHT(page_title, 1) AS "Trailing",
  SUM(page_is_redirect  = 0) AS "Articles",
  SUM(EXISTS (SELECT 1
    FROM page AS rd
    JOIN redirect ON rd.page_id = rd_from
    WHERE rd.page_namespace = 0 AND rd_namespace = 0
    AND rd_title = page.page_title
    AND rd.page_title = LEFT(rd_title, CHAR_LENGTH(rd_title) - 1)
  )) AS "Fixed",
  SUM(page_is_redirect != 0) AS "Redirects",
  CONCAT("[[",REPLACE(page_title,"_"," "),"]]") AS "Example"
FROM page
WHERE page.page_namespace = 0
  AND page.page_title REGEXP ".[!$-.:;=?@\\\\^\\`~]$"
GROUP BY 1
;-- ' | sql enwiki_p > ~/public_html/trails.txt; date
Article + Redirect suffixes usage
Articles Fixed Redirects Example
! 5186 1389 7200 !!
$ 19 0 148 $$
% 24 5 152 %$CLS%
& 6 1 10 &&
' 2329 177 3255 "Flipnote Studio 3D''
( 0 0 42 )'(
) 595329 643 897217 !!! (Chk Chk Chk)
* 41 6 209 (Z/nZ)*
+ 222 8 759 (NH4)+
, 5 4 354 "This Is Our Punk-Rock", Thee Rusted Satellites Gather + Sing,
- 93 16 762 ' or 1==1--
. 17475 1661 57474 "Buster" Collier, Jr.
: 7 1 179 (-:
; 1 0 30 (;
= 4 0 28 !=
? 2549 800 3590 !?
@ 16 1 34 $@
\ 2 0 41 +/'\
^ 1 0 17 A^
` 26 1 63 A`
d 151541 91 218950 !Xam Khomani Heartland
s 577484 10866 841081 !Alarma! Records
~ 20 0 452 "Yume" ~Mugen no Kanata~
  • Highlight: Default URL wont end in these characters.
Recommendations
  • Extend append implementation from ) and s to include ! . ? characters
  • Add a new test which removes the last character from greedy linking
Previous work
/* Stripped punctuation leading to different page
 *
 * License: Public domain
 * Run time:
 */
SELECT page.page_title, rd.page_title
FROM page
JOIN page AS rd ON rd.page_namespace=0 AND rd.page_is_redirect=1
               AND rd.page_title = LEFT(page.page_title, CHAR_LENGTH(page.page_title)-1)
LEFT JOIN redirect ON rd_from=rd.page_id AND rd_namespace=0 AND rd_title=page.page_title
WHERE page.page_namespace=0
AND page.page_is_redirect=0
AND page.page_title REGEXP ".[!$()*,-.:;?@~]$"
AND  rd_from IS NULL
LIMIT 100;
Resources