Talk:Percent-encoding

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Computing  
WikiProject icon This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
 ???  This article has not yet received a rating on the project's quality scale.
 ???  This article has not yet received a rating on the project's importance scale.
 

Interpretation of percent-encoded octets[edit]

if i have a encoded sequence %7e%7e, how does it know it's ~~ or a unicode char with hex 7e7e? Xah Lee 03:42, September 10, 2005 (UTC)

That's like asking what the letter "I" represents. It could represent the letter "I", the byte 0x49, the personal pronoun denoting individuality, the Roman numeral one…
At the lexical level, if it is in a URI, then %7E%7E and ~~ mean the same thing: two instances of the tilde character (U+007E) and, simultaneously, two instances of the byte 0x7E. These characters and values, however, may be representing almost anything, depending on where in the URI they appear, and why. Maybe each tilde represents byte value 7E, and maybe the pair together is significant. Maybe not. So, you don't know, really. In order to figure it out, you need to know more about the context. How and why the sequence was produced? Is it HTML form data? Was the producer of this sequence following the guidelines of some URI scheme? — mjb 04:56, 10 September 2005 (UTC)
In URLs UTF-8 is used to encode non-ASCII characters. The binary representation of %7e%7e would be: 01111110 01111110. Both bytes start with a zero and this indicates, that a normal ASCII-char is encode here. If it should represent an Unicode-char in UTF-8 then the byte should start with a 1. Since that is not the case it is interpreted as two single-byte chars.
Or short: %7e%7e ist not a valid 2-Byte UTF-encoding. -- JonnyJD 23:45, 24 May 2007 (UTC)
I thought the answer is a simple "because there is a % in between the two 7e's" because that is the difference between %7e%7e and %7e7e. Daveoh 11:08, 29 July 2007 (UTC)

%7e7e is no complete encoding. Only %7e would be encoded the second 7e would be normal text. So %7e7e == ~7e

Even in UTF the bytes get encoded one by one. -- JonnyJD 11:54, 30 July 2007 (UTC)

URIs are not generally UTF-8 encoded. Cite from text: The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. IMHO this is wrong. RFC 3986 does norm this only for the host part. The encoding of the URI is generally transparent: Each application that generates an URI can interpret its own URIs correctly (apply encoding and decoding correctly). The only interesting point is the behaviour of externally generated URIs e.g. browser forms and external applications (e.g. goolge parameter ie=UTF-8). The browser behaviour may be different among the browser types. --Jenswilke (talk) 09:31, 28 February 2008 (UTC)

From RFC 3986 sec. 2: When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".
The paragraph you're taking issue with is paraphrasing that in almost exactly the same terms, so I don't see what the problem is.
The only thing that might need emphasis is that it's new URI schemes that need to do this; the ones that are most ubiquitous (http, mailto, and the application/x-www-url-encoded MIME type as defined by XHTML1/HTML4 and lower) aren't affected and thus use arbitrary encodings that vary from app to app. —mjb (talk) 19:27, 28 February 2008 (UTC)

Added a reference table[edit]

Hey all, I just came to this article looking how to encode a % sign in a url string... noticed that the article couldn't tell me, just how to find out (which is arguably more encyclopedic) so I went along and added a table. Possibly needs a bit of rewording if it's decided to keep the table, else if you don't like it feel free to remove it =) Themania 15:12, 1 March 2007 (UTC)

Isn't the % symbol itself supposed to be mentioned somewhere ? Abhijitpai (talk) 13:18, 2 January 2011 (UTC)
It is, in the aforementioned table: the one titled Common characters after percent-encoding. —mjb (talk) 03:29, 3 January 2011 (UTC)
"The characters allowed in a URI are either reserved or unreserved...", "No other characters are allowed in a URI." And I don't see % in either table. Abhijitpai (talk) 12:45, 3 January 2011 (UTC)
It's right there:
"The characters allowed in a URI are either reserved or unreserved (or a percent character as part of a percent-encoding)." This probably could be better phrased, but still.
And the "Common characters after percent-encoding" table is in the character data section.—mjb (talk) 19:01, 4 January 2011 (UTC)

Whitespace[edit]

Shouldn't the whitespace character %20 be encoded also? It already is encoded and is mentioned in examples several times in the RFC3986, just read all the paragraphs containing "%20".

I also think the line "No other characters are allowed in an URI." deserves a citation. Daveoh 12:05, 29 July 2007 (UTC)

I agree, I just came to this page to check the value and was surprised it wasn't here. I've added it to the tables. Not sure if there is a better way of representing a blank character?! --Bleveret (talk) 13:17, 22 January 2008 (UTC)
Reverted. Space is not a "reserved" character. As the article and specs explain, a reserved character has a special purpose (usually as some kind of component delimiter) in a URI, so when it appears literally in a URI, it must be used for that special purpose. If it's not used for that purpose, then it has to be represented by a percent-encoded octet (8-bit byte value).
Space is also not an "unreserved" character. That is, it's not one of the very small number of characters that can simply appear in a URI literally. Unreserved characters can optionally be represented by percent-encoded octets.
Not being in either of those special sets, space is one of the very large range of characters that must always be percent-encoded. There's no need to single it out; there are over 100,000 other characters that must also be percent-encoded for the same reason.
As for the "no other characters" ... that's covered in the spec, in several places where it's said that a URI must match the syntax rules. I don't mind adding specific citations, but it's not really up for debate, is it? —mjb (talk) 21:23, 22 January 2008 (UTC)
OK fair enough, however it still seems odd to me that information regarding %20 (which is probably the most commonly percent-encoded character) barely gets a mention. Would be it OK if I added a section with a helpful reference table of commonly used characters? --Bleveret (talk) 11:12, 8 February 2008 (UTC)
Maybe. It's original research, maybe best sought elsewhere and linked to, but as long as it's correct I'm not going to argue about it. My fear is that people will keep adding to it, or they won't understand that just because the reserved character "@", for example, is listed in the table, that it always has to be percent-encoded as "%40". So you have to make it clear that that's not the case. And it has to be clear that your table assumes an ASCII-based encoding for the non-'reserved', non-'unreserved' characters.
See, in theory, characters that aren't in the reserved or unreserved sets get percent-encoded after being converted to bytes according to any encoding (i.e., whichever one you need to use; the spec doesn't mandate one, although, going forward from Jan 2005, new specs are supposed to stick to UTF-8). In practice, the encoding is almost always a superset of ASCII, like UTF-8 or ISO-8859-1, but not UTF-16LE or UTF-16BE or UTF-32. So, space almost always becomes %20, and the other ASCII-range characters (U+0000 to U+00FF) likewise become one percent-encoded byte per character. Meanwhile, the U+0100 to U+10FFFD range (the upper limit of Unicode) gets percent-encoded with more than one per character, depending on what encoding was used as the basis.
Also, since percent-encoding is used in different contexts by different applications, there's differences of opinion over whether and how certain characters are percent-encoded. The table would need to make it clear that it's normal for "+" to be used instead of "%20" in application/x-www-form-urlencoded data, for example, although you may see different behaviors for this in different browsers. mjb (talk) 23:17, 8 February 2008 (UTC)

Terminology[edit]

From a standards perspective (HTML, Javascript and ECMAScript) and from the standpoint of programming languages that assist encoding strings for use as a URL, the proper term (one that is "standard"), is URL encoding. Since verifiability is a requirement for entry on Wikipedia, this standard term should be the preferred term as an article entry. The term "Percent-encoding" is not nearly as likely to appear in a search result.

Accepting as argument to this opinion, search results that link to a location associating these two terms seems irrelevant unless there is also evidence the association is made in usage. Kernel.package (talk) 20:58, 29 July 2010 (UTC)

Regarding searches, if you type "URL encoding" in the search box on Wikipedia, you go straight to this article. If you type "URL encoding" in the search box on Google, this article is #4 (#6 if you search for "URL encode"), probably only because the higher-ranked ones are more heavily linked-to and/or have the search term in their page titles. So I'm not sure what you mean by "percent-encoding" is not nearly as likely to appear in a search result.
The term "URL encoding" does have a lot of informal traction, but aside from the CGI spec, it actually isn't used much outside of references to the x-www-form-"urlencoded" MIME type. Variations of "escape" (not "percent escape" or "URL escape"...just "escape") are far more common. I think we have a responsibility to use the most correct and accurate terms as defined by the standard most applicable to the subject of the article, STD 66/RFC 3986. If we lend any additional traction to "URL encoding", it's just going to make things more confusing for people, IMHO. Also the 2nd sentence of the lead makes it clear that the "URL" term is common but imprecise on multiple levels. If the article title were changed, that sentence would read as more of a contradiction.
Regarding standards, the first and only standard which actually refers to "URL encoded" is the (IMHO poorly written) CGI spec, which was originally just documentation for people using the NCSA Mosaic browser, where forms were first implemented as an unofficial add-on to the original draft of HTML. The subsequent HTML+ proposal said certain characters needed to be "escaped" but otherwise didn't use the CGI spec's terminology. HTML 2.0 followed with an adaptation of the CGI methodology, again saying "escaped" but also saying the MIME type x-www-form-urlencoded needed to be used (in the transmission to the web server). RFC 1867 and HTML 3.2 both made reference to the MIME type but didn't talk about the procedure at all. Same with RFC 1738 (the current spec's grandparent), although it does say "escape" in the BNF grammar. HTML 4.01 refers to the MIME type and "escaping" as per RFC 1738. RFC 1738's successor, RFC 2396 says "escaping" and "escape sequences" but wasn't referenced by any HTML spec. STD 66, a.k.a. RFC 3986, RFC 2396's successor, calls it "percent-encoding". This is the most current and relevant source, even though it's not referenced by HTML 5, which doesn't use any term other than yet another passing reference to the MIME type, for compatibility. ECMAScript has the functions encodeURI() and escape(), but in both cases only refers to "escaping" and "escape sequences". XHTML 1.0 is in the same boat as HTML 4.01. XHTML 1.1's Forms module doesn't address the issue at all. XForms 1.0 refers to the MIME type and extends XHTML 1.0 by referring to RFC 2396 and giving some guidance for "encoding" troublesome characters. XForms 1.1 has a "urlencoded-post" form submission method written from scratch, but which still only refers to the MIME type.
So, I don't support renaming the article to be any different than the correct, current terminology defined by STD 66/RFC 3986. However I do think we could (even should) have a section in the article which mentions all these other terms (incl. the MIME type) as used by standards as well as informally, with examples of where they're found, if not also some historical context. —mjb (talk) 07:46, 30 July 2010 (UTC)

Hyphen[edit]

Why is there a hyphen in the article title? --Kvng (talk) 15:23, 30 September 2010 (UTC)

Because that is how it is officially spelled. See http://tools.ietf.org/html/rfc3986#section-2.1
If it were written without the hyphen, it would be ambiguous as to whether it were the "percent" method of encoding (the correct interpretation) or a method of encoding "percent". —mjb (talk) 03:22, 3 January 2011 (UTC)

Percent Sign[edit]

The corresponding german article about url encoding mentions the percent sign as a reserved character as well. But here I don't see the percent sign in the list. If one reads RFC 3986, the percent sign is indeed not reserved. So its more a problem of the german article.

Janburse (talk) 15:33, 24 July 2011 (UTC)

Coverage of "extended" characters needed please[edit]

I think, in the character data section, there should be a more thorough explanation of the significance of UTF-8 percent encoding; for instance there is no examples of "higher" characters; how do you encode say a diacritic or a kenji, etc.? At this point, I don't know, and the article as written will not give me that knowledge either... A.R. — Preceding unsigned comment added by 205.151.118.180 (talk) 20:34, 1 December 2011 (UTC)

Percent encoding ambiguity[edit]

The text in the article repeats an ambiguous statement from RFC3986:

"Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI."

The grammar of the sentence makes it unclear whether "that octet" refers back to "%25", or to an octet in "percent-encoded octets", or to "it" or furthest back of all to '("%")'.

I'm pretty certain that the sentence means:

"For a percent ("%") character to appear as stand-alone data (octet value 25) in a URI, it must avoid being interpreted as the first character of a percent encoding. Thus a stand-alone "%" must be encoded, as "%25".

If someone has a reference which backs this up, perhaps we could use this clearer sentence, or some better one? Gwideman (talk) 10:23, 9 February 2012 (UTC)

It is not a character encoding[edit]

It encodes bytes. A code unit like “%C2” represents a byte, not a character. Revert (or do not make) silly edits like [1][2], please. Incnis Mrsi (talk) 16:10, 27 May 2013 (UTC)