Jump to content

Shift JIS: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
See talk page re: Removing "UTF-8 is recommended"
Description: "can be used for HTML" is nonsense - anything containing these characters can be, and HTML should ALWAYS be in context of a known char set; the point is that the browser may not really need to know the difference between ASCII and SJIS
Line 14: Line 14:
== Description ==
== Description ==


Shift JIS is based on character sets defined within [[Japanese Industrial Standards|JIS]] standards [[JIS X 0201]]:1997 (for the single-byte characters) and [[JIS X 0208]]:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth [[katakana]] characters in the single-byte range [[JIS X 0201#Encoded Katakana|0xA1 to 0xDF]]. The single-byte characters [[Hexadecimal|0x]]00 to 0x7F match the [[ASCII]] encoding, except for a [[Japanese yen|yen]] sign (U+00A5) at 0x5C and an [[overline]] (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. Shift JIS can be and is used for HTML since the important start and end of HTML tags and fields, <, >, /, " appear as themselves only, not as a part of a two byte sequence.
Shift JIS is based on character sets defined within [[Japanese Industrial Standards|JIS]] standards [[JIS X 0201]]:1997 (for the single-byte characters) and [[JIS X 0208]]:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth [[katakana]] characters in the single-byte range [[JIS X 0201#Encoded Katakana|0xA1 to 0xDF]]. The single-byte characters [[Hexadecimal|0x]]00 to 0x7F match the [[ASCII]] encoding, except for a [[Japanese yen|yen]] sign (U+00A5) at 0x5C and an [[overline]] (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, since the important start and end of HTML tags and fields, <, >, /, " are coded by the same single bytes as in ASCII, not as two-byte sequences.


Shift JIS requires an [[8-bit clean]] medium for transmission. It is fully [[backward compatibility|backwards compatible]] with the legacy [[JIS X 0201]] [[single-byte encoding]], meaning it supports [[half-width katakana]] and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of [[code word]]s makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format [[Extended Unix Code#EUC-JP|EUC-JP]], which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 [[code point]]s, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.
Shift JIS requires an [[8-bit clean]] medium for transmission. It is fully [[backward compatibility|backwards compatible]] with the legacy [[JIS X 0201]] [[single-byte encoding]], meaning it supports [[half-width katakana]] and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of [[code word]]s makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format [[Extended Unix Code#EUC-JP|EUC-JP]], which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 [[code point]]s, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.

Revision as of 21:39, 5 December 2013

Shift JIS
MIME / IANAShift_JIS
Language(s)Japanese
StandardJIS X 0208 Appendix 1

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS) is a character encoding for the Japanese language, originally developed by a Japanese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1.

Description

Shift JIS is based on character sets defined within JIS standards JIS X 0201:1997 (for the single-byte characters) and JIS X 0208:1997 (for the double byte characters). The lead bytes for the double byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in JIS X 0201. HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, since the important start and end of HTML tags and fields, <, >, /, " are coded by the same single bytes as in ASCII, not as two-byte sequences.

Shift JIS requires an 8-bit clean medium for transmission. It is fully backwards compatible with the legacy JIS X 0201 single-byte encoding, meaning it supports half-width katakana and that any valid JIS X 0201 string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. On the other hand, the competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 code points, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.

For a double-byte JIS sequence ,[1] the transformation to the corresponding Shift JIS bytes is:

Multiple versions

Many different versions of Shift JIS exist.

There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here — these are really extensions to JIS X 0208 rather than to Shift JIS itself. The most popular extension here is to the Windows-31J, otherwise known as Code page 932, popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". Secondly, Shift JIS has more encoding space than is needed, for JIS X 0201 and JIS X 0208 and this space can and is used for yet more characters. The space, with lead bytes 0xF5 to 0xF9, is used by Japanese mobile phone operators for pictographs for use in E-mail, for example. (KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4).

Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no IANA registration, so there is much scope for confusion, if the extensions are used. Microsoft Code Page 932 is registered separately from Shift JIS.

IBM CCSID 943 has the same extensions as Code Page 932. As with most code pages and encodings, it is recommended by Microsoft, Apple, the Unicode Consortium and most major operating system makers that Unicode be used instead.

Shift JIS byte map

The chart below gives the detailed meaning of each byte in a Shift JIS encoded stream.

First byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2 ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ ¥ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | }
8
9
A
B ソ
C
D
E
F
Second byte
0 1 2 3 4 5 6 7 8 9 A B C D E F
0
1
2
3
4
5
6
7
8
9
A
B
C
D
E
F
 
Non printable ASCII character
Unaltered ASCII character
Modified ASCII character
Single-byte half-width katakana
First byte of a double-byte JIS X 0208 character
Unused as first byte of a JIS X 0208 character
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was odd
Second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was even
Unused as second byte of a JIS X 0208 character

See also

References

  1. ^ j1 and j2 are each in the range 33 to 126 inclusive (i.e., 7-bit character values excluding control characters (0–31 and 127) and space)