Talk:Standard Compression Scheme for Unicode

"1 byte per character"[edit]

When it says 1 byte per character (plus overhead) for many text files, that's exactly what it means. As long as they don't use obscure control characters, all text files in ASCII or ISO-8859-{1,5-9,11} use 1 byte per character plus a couple bytes at the start to set the mode. Not ~1 byte, 1 byte.--Prosfilaes 06:22, 24 March 2006 (UTC)[reply]

You're right; I shouldn't have added the tilde. —Steve Summit (talk) 15:45, 24 March 2006 (UTC)[reply]

ISO-8859-1 you are correct about, but other common encodings like windows-125x and the other parts of ISO-8859 do not map to contiguous unicode code points. Plugwash 23:16, 29 July 2006 (UTC)[reply]

ISO-8859-{1,5-9,11} map to contiguous Unicode code points. I think that's many text files.--Prosfilaes 17:52, 30 July 2006 (UTC)[reply]

the upper half of ISO-8859-5 mostly maps to $400-$45F but there are a number of positions that don't, in partiticular positions A0,AD,F0 and FD Plugwash 18:12, 30 July 2006 (UTC)[reply]

External links[edit]

It's not entirely clear to me what benefit the link to SIL adds -- it's not to a relevant page, and indeed I can't find anything on SIL's site directly relating to SCSU. Remove the link? --Jkew 00:34, 22 September 2006 (UTC)[reply]

Compression vs. Delta-Compression.[edit]

It is the compression that matters, that is the sum of Compression1+Compression2; not Compression2 alone, when using multiple compression algorithms. Comparing the compression gain of BOCU/SCSU and UTF-8/UTF-16 is unfair - because SCSU/BOCU stream is already compressed.^[1]. Please correct if wrong. Thanks.175.157.246.242 (talk) 18:48, 21 October 2015 (UTC)[reply]

References

^ Technical Note 14, Unicode: A surveyof Unicode compressionJanuary 30, 2004 Using bzip2, a compressor which employs the Burrows-Wheeler algorithm, evenabsurdlyinefficient formats—such as representing each character by its full Unicode name (e.g. LATIN CAPITAL LETTER A WITH CIRCUMFLEX)—could be reduced to almostthe same size as more compact formats. Atkin and Stansifer demonstrate that block-sorting compression techniqueseliminate most of the redundancy of the encoding format. The supporting data compares the compressibility of each encoding format with bzip2 and declares a “winner,” which is somewhat misleading since the paper attempts to show that all formats are about equally compressible. (In some cases,the “winner” was only 0.01% smaller than another format!) In contrast, the gzip compression tool, which uses LZ77, generally performed 15% to 25% better on natural-language, small-alphabet textencoded inSCSU or BOCU-1 than on the sametextencoded in UTF-16. The authors claim thatthese differences are notsignificant,but theycan hardly be considered negligible.

[1] Technical Note 14, Unicode: A surveyof Unicode compressionJanuary 30, 2004 Using bzip2, a compressor which employs the Burrows-Wheeler algorithm, evenabsurdlyinefficient formats—such as representing each character by its full Unicode name (e.g. LATIN CAPITAL LETTER A WITH CIRCUMFLEX)—could be reduced to almostthe same size as more compact formats. Atkin and Stansifer demonstrate that block-sorting compression techniqueseliminate most of the redundancy of the encoding format. The supporting data compares the compressibility of each encoding format with bzip2 and declares a “winner,” which is somewhat misleading since the paper attempts to show that all formats are about equally compressible. (In some cases,the “winner” was only 0.01% smaller than another format!) In contrast, the gzip compression tool, which uses LZ77, generally performed 15% to 25% better on natural-language, small-alphabet textencoded inSCSU or BOCU-1 than on the sametextencoded in UTF-16. The authors claim thatthese differences are notsignificant,but theycan hardly be considered negligible.

[1]