Talk:Prediction by partial matching

D. Scott infamous bijective nonsense[edit]

Some PPM algorithms have the useful property of being able to interpret any collection of bytes as valid compressed input. An algorithm with this property is said to be bijective. Bijectiveness has positive implications for compression efficiency and security because there is no way to distinguish random data from valid output.

This is not terminology in general usage; it is the terminology used (misused?) by one particular person, David A. Scott, who has been for years obsessively describing the importance of "bijective compression" and his work on it on comp.compression.* and sci.crypt.*. Bijective compression works, but it is at best mildly useful. It is not, in the real world, so valuable for encryption that the NSA deliberately surpressed it. When he describes it, he nearly always mentions the fact that it is "more efficient" than 'non-bijective' methods, but he does not mention that the savings is usually an amount less than a few bytes. Nor does he acknowledge the major drawback of his method: his "savings" comes from eliminating the internal termination of the data stream. Yet the information of where the message stops, where the decoder must stop reading, obviously needs to come from somewhere. Scott has actually argued that every file worth dealing with is located in a file system that keeps track of this information anyways, a claim which is manifestly untrue.

Obviously, the claims for bijectivity do not go so far in this article, but they do overestimate its usefulness with no mentions of its disadvantages. -- Antaeus Feldspar 19:07, 24 Sep 2004 (UTC)

How widely known are these claims by David A. Scott? Are they published in a real journal (rather than a newsgroup)? If they are not published, they can be removed under Wikipedia:No original research rule. Andris 20:38, Sep 24, 2004 (UTC)

- Well, they're widely known on comp.compression, only because he won't stop promoting it there. I know that he managed to get an article printed in one of the general-audience programmer magazines, where he claimed to be describing the state of the art and described bijectivity as what all those in the know were doing. As for getting it into any peer-reviewed journal? I'd be flabbergasted. -- Antaeus Feldspar 01:36, 25 Sep 2004 (UTC)

I've pulled the content and the NPOV marker. m.e. 10:35, 8 Oct 2004 (UTC)

See "A Bijective String Sorting Transform" whitepaper July 7, 2009. The intent and purpose of bijectiveness is not to increase compression. Notwithstanding this fact, it does so, and beyond savings accounted for by the elimination of an index pointer. It is the result of Lyndon factorization. The attacks above are entirely unwarranted. Mitchell forman (talk) 22:34, 8 January 2010 (UTC)[reply]

 Some people think its a help if you don't to bad.

It has appeared in Dr Dobb's journal. 24.162.223.91 03:54, 12 October 2005 (UTC) D. Scott[reply]

Literature

David A. Scott: "Optimal EOF Handling With Arithmetic Compression", Dr. Dobbs Journal January 2002. [1]

Implementations

BICOM by Matt Timmermans [2]

In defense of bijective compression[edit]

While i don't think this article/talk page is the best place to discuss bijective compression, i must try and clear up some of the statements made above:

Some PPM algorithms have the useful property of being able to interpret any collection of bytes as valid compressed input. An algorithm with this property is said to be bijective.

(I'm only noting this one for completeness.) This statement is misleading, of course: the bijectivity in question doesn't have anything to do with the entropy model (PPM), but is a property of the entropy coder that the model is linked to.

Bijective compression works, but it is at best mildly useful. It is not, in the real world, so valuable for encryption [...]

Bijectivity is probably not interesting to the average user, but i think it's more than just "mildly useful" in the context of cryptography. The fact is that bijective compression is the only kind of compression that can be used with encryption without introducing a weakness. (With compressed plaintext, an attacker can always eliminate all keys with ill-formed decryptions from consideration ("ill-formed" meaning "doesn't decompress and recompress back to itself"). Bijective compressors, though, will decompress and recompress anything back to itself (by definition), which means that they keep the entire keyspace as valid and well-formed as it was before; they don't introduce anything for an attacker to leverage.)

Of course, in practice, all of that only matters to the extent that you actually want to compress your plaintext in the first place. This is a more complicated issue, i think, but the sci.crypt FAQ has this to say (8.5 "How do I use compression with encryption?"):

Compression aids encryption by reducing the redundancy of the plaintext. This increases the amount of ciphertext you can send encrypted under a given number of key bits. (See "unicity distance")

[...]

Compression is generally of value, however, because it removes other known plaintext in the middle of the file being encrypted. In general, the lower the redundancy of the plaintext being fed an encryption algorithm, the more difficult the cryptanalysis of that algorithm.

In addition, compression shortens the input file, shortening the output file and reducing the amount of CPU required to do the encryption algorithm, so even if there were no enhancement of security, compression before encryption would be worthwhile.

Nor does he acknowledge the major drawback of his method: his "savings" comes from eliminating the internal termination of the data stream. Yet the information of where the message stops, where the decoder must stop reading, obviously needs to come from somewhere.

I'm not familiar with "his method", but this is definitely not true of bijective encoding in general. For an example, take a look at Matt Timmermans's bijective implementation of arithmetic coding, which is self-terminating (very efficiently so).

--Piet Delport 06:48, 11 March 2006 (UTC)[reply]

See "A Bijective String Sorting Transform" whitepaper July 7, 2009. The intent and purpose of bijectiveness is not to increase compression. Notwithstanding this fact, it does so, and beyond savings accounted for by the elimination of an index pointer. It is the result of Lyndon factorization. —Preceding unsigned comment added by Mitchell forman (talk • contribs) 22:31, 8 January 2010 (UTC)[reply]

now ZipSMS uses PPM to compress SMS on modern mobile phones[edit]

[3] --SyP 12:39, 2 January 2006 (UTC)[reply]

Text compression[edit]

In the article it says: "...among the best-performing lossless compression programs for English text." Certainly the algorithm does not care which is the language of the text it compresses. :) It should be changed to something more general, but I'm not sure which definition would exactly fit here. If it compresses well English text, then it is only logical that it would compress any other European language as well. But, I cannot be sure it is equally efficient compressing complex script languages like Chinese languages or so. It might be the case, but since I don't know the exact details, and cannot be sure if there are some other languages peculiarities which would make PPM behave differently, I am unsure what to write there instead of the existing sentence. Any suggestions from someone more knowledgeable of this matter? --Arny 15:59, 21 March 2006 (UTC)[reply]

PPM should be equally efficient on any natural language text, regardless of the script, because they all exhibit the kind of contextual redundancy that PPM was designed to exploit. IOW, PPM doesn't really care what the "symbols" actually are, as long as the same symbols tend to occur in localized patterns.

I've changed "English" to "natural language". --Piet Delport 15:48, 22 March 2006 (UTC)[reply]

Actually, I've tested PPMD with computer languages. Today, I compressed a project consisting of C# code, XSD (XML Schema Definition), meta data, some png's and icons. It compressed 5% better than LZMA when doing a solid archive, and 10% better when doing a solid archive per each file type (so one dictionary would be used for *.cs files, another for *.xsd files, etc). LZMA in both cases was done with a solid archive. I did another test with PPMD on windows 2003 server logs, and I can't remember the *exact* amount, but it was shockingly smaller. These tests were done using the program 7zip. I think PPMD excels at MOST kinds of repetitive text, regardless of what kind, compared to LZMA.--Cnadolski 17:39, 21 December 2006 (UTC)[reply]

Applications, example[edit]

This article could use a section explaining where this algorithm is implemented and used in practice, and another section with a worked-out example. -- Beland (talk) 22:38, 27 July 2008 (UTC)[reply]

PPMd vs PPMD[edit]

"PPMD increments the pseudocount of the "never-seen" symbol every time the "never-seen" symbol is used." - is it about PPMD or about PPMd method? `a5b (talk) 06:08, 20 September 2011 (UTC)[reply]