User:Dispenser/Automatic linking problems
Appearance
- Title URL limitations
- Control character (0x00 - 0x1F) are forbidden
- Spaces (
- Double quote (
"
, 0x22) needs to be percent encoded - Number Signs (
#
, 0x23) are not allowed - Percent sign (
%
, 0x25) needs hexadecimal characters to follow, so cannot end with it - Angle brackets (
<>
, 0x3C 0x3E) are not allowed - Question mark (
?
, 0x3F) used as query string separator needs to be percent encoded, should have corresponding redirect - Square brackets (
[]
, 0x5B 0x5D) are not allowed - Trailing underscore (
_
, 0x5F) are stripped - Braces (
{}
, 0x7B 0x7D) are not allowed - Vertical bar/Pipe (
|
, 0x7C) are not allowed - Non-ASCII characters (0x80 and higher) are normally percent encoded. Sometimes decode (ruwiki) for ease of use. We will focus on non-Unicode systems.
Assuming letters, numbers, underscore, and forward slash (A-Z, a-z, 0-9, _) are safe, these are the characters we need to test:
! $ % & ' ( ) * + , - . : ; = ? @ \ ^ ` ~
- [0-9A-Za-z\-.:_] not escaped
- [;:@$!*(),/] are converted back in GlobalFunctions.php (whenever I wrote this)
! $ ( ) * , - . : ; @ ~
-- 2> /dev/null; date; echo '
/* Trailing characters survey
*
* License: Public Domain
* Run time: 20 minutes
*/
SELECT
RIGHT(page_title, 1) AS "Trailing",
SUM(page_is_redirect = 0) AS "Articles",
SUM(EXISTS (SELECT 1
FROM page AS rd
JOIN redirect ON rd.page_id = rd_from
WHERE rd.page_namespace = 0 AND rd_namespace = 0
AND rd_title = page.page_title
AND rd.page_title = LEFT(rd_title, CHAR_LENGTH(rd_title) - 1)
)) AS "Fixed",
SUM(page_is_redirect != 0) AS "Redirects",
CONCAT("[[",REPLACE(page_title,"_"," "),"]]") AS "Example"
FROM page
WHERE page.page_namespace = 0
AND page.page_title REGEXP ".[!$-.:;=?@\\\\^\\`~]$"
GROUP BY 1
;-- ' | sql enwiki_p > ~/public_html/trails.txt; date
Articles | Fixed | Redirects | Example | |
---|---|---|---|---|
! | 5186 | 1389 | 7200 | !! |
$ | 19 | 0 | 148 | $$ |
% | 24 | 5 | 152 | %$CLS% |
& | 6 | 1 | 10 | && |
' | 2329 | 177 | 3255 | "Flipnote Studio 3D'' |
( | 0 | 0 | 42 | )'( |
) | 595329 | 643 | 897217 | !!! (Chk Chk Chk) |
* | 41 | 6 | 209 | (Z/nZ)* |
+ | 222 | 8 | 759 | (NH4)+ |
, | 5 | 4 | 354 | "This Is Our Punk-Rock", Thee Rusted Satellites Gather + Sing, |
- | 93 | 16 | 762 | ' or 1==1-- |
. | 17475 | 1661 | 57474 | "Buster" Collier, Jr. |
: | 7 | 1 | 179 | (-: |
; | 1 | 0 | 30 | (; |
= | 4 | 0 | 28 | != |
? | 2549 | 800 | 3590 | !? |
@ | 16 | 1 | 34 | $@ |
\ | 2 | 0 | 41 | +/'\ |
^ | 1 | 0 | 17 | A^ |
` | 26 | 1 | 63 | A` |
d | 151541 | 91 | 218950 | !Xam Khomani Heartland |
s | 577484 | 10866 | 841081 | !Alarma! Records |
~ | 20 | 0 | 452 | "Yume" ~Mugen no Kanata~ |
- Highlight: Default URL wont end in these characters.
- Recommendations
- Extend append implementation from
)
ands
to include! . ?
characters - Add a new test which removes the last character from greedy linking
- Previous work
/* Stripped punctuation leading to different page
*
* License: Public domain
* Run time:
*/
SELECT page.page_title, rd.page_title
FROM page
JOIN page AS rd ON rd.page_namespace=0 AND rd.page_is_redirect=1
AND rd.page_title = LEFT(page.page_title, CHAR_LENGTH(page.page_title)-1)
LEFT JOIN redirect ON rd_from=rd.page_id AND rd_namespace=0 AND rd_title=page.page_title
WHERE page.page_namespace=0
AND page.page_is_redirect=0
AND page.page_title REGEXP ".[!$()*,-.:;?@~]$"
AND rd_from IS NULL
LIMIT 100;
- Resources