Jump to content

Talk:JBIG2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
[edit]

Link to "JBIG2 primer" seems to be dead, should it be replaced by http://web.archive.org/web/20080109065554/http://jbig2.com/index.html ?

Major flaw of article: no talk about file formats

[edit]

It is just briefly noted that pdf files "may contain JBIG2 compressed data". But other than that, the reader of this article is left clueless as to what other file formats support JBIG2 and if there is something like a native file format for JBIG2. And - on the contrary - if pdf is currently the only file format that can be used to store jbig2 compressed data, then it would be very helpful for the reader to mention that fact as well. --boarders paradise (talk) 13:28, 6 September 2009 (UTC)[reply]

Removed "Disadvantages" section

[edit]

I removed the "Disadvantages" section. While it was well written and had a citation, it does not belong on this page. The description of the edit was "Algorithm can cause corruption of text" which is true of all lossy compression algorithms. But the section about how such corruption could cause incorrect dosages of medicine to be delivered sounded like a bad TV news headline. One could say the same about any kind of data corruption. It could also cause nuclear plants to melt down. If someone applied OCR to a lossy compressed image, then used that to give a dosage of medicine - that would be a worthy news headline -- but not a worthy Wikipedia article.

Last comment -- Just because you read it on Slashdot doesn't mean you should put it on Wikipedia. — Preceding unsigned comment added by 72.37.171.52 (talk) 13:48, 6 August 2013 (UTC)[reply]

Your edit is vandalism section-blanking and has deleted valid information. JBIG2 is not the same as "all lossy compression algorithms". JPEG blurs things; it doesn't put characters through OCR and inadvertently replace a few of them with different characters to alter the meaning of text. Key distinction, and all properly sourced to WP:RS. Furthermore, you attempted to remove the entire section, not just the one mention of transmitting medical information. Please do not remove valid, sourced information from articles. K7L (talk) 13:57, 6 August 2013 (UTC)[reply]
Do not accuse people acting in good faith of "vandalism", see WP:NOTVAND. Clearly OP here had good intentions, (s)he even bothered to write up a rationale for the removal.
I agree that this fact should be kept, but I decided to rewrite it. I clarified that this only happens in JBIG2 "lossy" compression mode and removed the synethsis about "structures being built not to specification or incorrect dosage". Do you think this is better? -- intgr [talk] 15:46, 6 August 2013 (UTC)[reply]
Yes, much better. The blueprints with the wrong numbers weren't WP:OR or synthesis as that is in the Beeb piece, but the key point is that OCR can replace one textual character with a different character, altering the meaning of documents. That is a serious limitation of using such an aggressive approach to data compression and needs to be in the article. K7L (talk) 15:51, 6 August 2013 (UTC)[reply]
I did keep this in the last sentence: "where numbers written on blueprints were altered". I think the original statement "consequences such as structures being built not to specification" is a bit much -- in this case that didn't happen as the corruption was discovered.
Do you want to suggest another way to phrase it? -- intgr [talk] 16:04, 6 August 2013 (UTC)[reply]
This looks good. K7L (talk) 16:08, 6 August 2013 (UTC)[reply]
I was the anon who removed it. The new one is well written and addresses my concerns. Sorry if I was overzealous. — Preceding unsigned comment added by 96.244.74.190 (talk) 14:37, 10 August 2013 (UTC)[reply]
Unfortunately, the text was confused: David Kriesel clearly says that it was noticed in a construction document then found to occur elsewhere, but no specific correlation to document types other than all examples shown are of number substitution. Specifically, “the issue gets even more dangerous if life-important documents are scanned, like medical prescriptions” does not imply that copying of such has been tested, only that it is believed possible that they could be altered as described, with speculation about the effects.
Regarding consequences of incorrect blueprints – I think that it is likely that errors will be detected, although when is uncertain and there is potential for extra materials and/or transport costs to be incurred. Also, what happens if the errors happen such that the resulting document is internally consistent? Dsalt (talk) 14:43, 11 August 2013 (UTC)[reply]
I agree, I would remove the speculation about medical documents entirely, but somehow it always finds its way back into the article. -- intgr [talk] 07:34, 12 August 2013 (UTC)[reply]
[edit]

Hello fellow Wikipedians,

I have just modified 3 external links on JBIG2. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 20:07, 18 November 2017 (UTC)[reply]

Background: Xerox vs JBIG2

[edit]

Some people assume the Xerox implementation problem is in an inherent characteristic of JBIG2, but it's not.

In practically all compression algorithms or formats – for general data, images, audio, or video – the specifications don't strictly dictate or enforce how you compress or encode. Specs generally define constraints the compressed output data should have, and how a decoder should interpreted that data.

Consequently, different encoders implementing the same format can achieve better or worse results in terms of compression efficiency or quality. For some examples, see comparisons of encoders for: general data in ZIP, PNG images, JPEG images, MP3 audio, H264 video.

In JBIG2's more efficient mode for text the encoder's job is to recognize symbols that look the same or similiar enough and store each symbol's graphic representation only once (practically this can be helped by an OCR engine, of which not all are equally accurate, but that's very optional). It's up to the encoder to decide which symbols look similar enough. If it's decided they're dissimilar they can be treated as separate symbols. Alternatively the encoder can switch to a lossless mode (where each instance of using a symbol can include extra "refinement" data, which is combined with the generic stored-once symbol to produce an output that looks exactly like the source). Each encoder implementation is free to use different criteria and logic to make its decisions.

Xerox isn't the creator of the JBIG2 format nor its custodian. The "white paper" they released is simply a loosely-worded PR explainer. They don't even claim that it's JBIG2's fault but some people seem to interpret it as such. The problem in Xerox's scanners is just a problem in their specific implementation. galenIgh 16:21, 4 March 2024 (UTC)[reply]

Before we begin, I feel I must say this...
Character munging may not be an inherent risk to the JBIG2 format per se but based on what I've read -- including the very citation you added -- it *is* an inherent risk to JBIG2's lossy "Pattern Matching & Substitution" mode. Xerox may have implemented PM&S poorly but it is a part of the JBIG2 standard, and due to the nature of how it works, PM&S has a dangerous failure mode compared to other compression schemes. It is thus at least partially JBIG2's fault. Or more accurately & acutely, lossy JBIG2 compression is to blame.

I'm glad you added a citation to back up these claims (which is unconditionally a good reference for the article), but I still have some complaints with the edit:
  1. I suggest rewording relevant paragraph's first sentence to "Some implementations of JBIG2 using lossy compression can potentially alter the characters in documents that are scanned to PDF."
    • The phrasing "Faulty implementations" implies that the problem arose due to an incorrect implementation -- but a compliant implementation of a spec is a compliant implementation, no matter how bad.
    • The phrasing "were reported to alter" as opposed to "can potentially alter" is also troublesome, as "reported" can be misinterpreted as "allegedly".
  2. As far as I can tell, "Mosquito Noise" is specific to video compression, and therefore warrants no mention in relation to image compression.
  3. You quote Kriesel in the article as saying "The error cause is not JBIG2 itself. Because of a software bug, loss of information was introduced where none should have been" (emphasis added) but both the "white paper" and Kriesel make reference to a Xerox compression "Normal" setting and "Higher" setting, wherein choosing "Higher" eliminated occurrences of erroneous character substitution in practice, but according to Xerox PM&S was seemingly still in use in on "Higher" and even "Highest".[1] What is still not clear to me from reading those citations is whether the use of lossy PM&S in these other modes was unintentional (actual bug), or not (simply bad design). What I'm trying to say is, I'm not so sure if Kriesel was being literal or somewhat hyperbolic with the quote "Because of a software bug". So I would maybe remove that quote for now.
I should hope suggestions 1 and 2 are agreeable (I ask here first only to avoid even the appearance of an edit war), but obviously 3 needs to be discussed more.
So Done With Software,
99.146.242.37 (talk), 19:26, 4 March 2024 (UTC) 99.146.242.37 (talk) 19:26, 4 March 2024 (UTC)[reply]
Yeah, good to discuss. Thanks.
I wouldn't call it "inherent risk to JBIG2's lossy mode". That's a bit like saying computers inherently distort information or humans inherently misinterpret characters. Or maybe more specifically, like saying wrong characters are inherent to OCR, or low quality is inherent to lossy codecs. Perhaps a better summary: PM&S, when not implemented well, makes the problem possible. But maybe that sound alarmist, like one should expect JBIG2 to be implemented badly by default.
  1. Perhaps "poor" instead of "faulty"? And true, "reported" might be misinterpreted, so "can [or might?] potentially alter" is indeed better.
  2. Mosquito noise also happens in still images. Supposedly an artifact of DCT.
  3. I don't know what Xerox's implementation does in general or per quality setting. Someone would have to analyze the PDF or JBIG2 data. I'm guessing it also includes lowering the resolution, and small characters at 150 dpi could be difficult for poor algorithms to identify properly. It might also switch to CCITT G4 (less compression but still pretty good). But I don't think it really matters. Xerox's solution, for time or cost reasons, was the quick and dirty one: just dodge the problem altogether at the expense of compression. That's their prerogative but it doesn't reflect on JBIG2, which is the main subject here.

    If you feel it's better to minimize, though Kriesel's quote seems pretty good considering he's an authority of sorts here, how about removing the "software bug" part and leaving only "the error cause is not JBIG2 itself"?

  4. By the way, while JBIG2's major feature hinges on deciding which symbols look effectively the same, the mechanism for that decision isn't a part of JBIG2. Even if there was a reference implementation from the original authors (no idea), it would be just as JBIG2-compliant to classify symbols by any other means: classical techniques, or a human sitting and doing it manually, or sending each symbol to CAPTCHAs on the internet. And nowadays modern machine learning can be just as good as humans or better (try Google's image translate on a line of text where only the top or bottom 1/3 is visible).
galenIgh 21:20, 4 March 2024 (UTC)[reply]
You've misunderstood my statement that Character munging [...] is an inherent risk to JBIG2's lossy "Pattern Matching & Substitution" mode:
I said that it's a risk inherent to "JBIG2's lossy "Pattern Matching & Substitution" mode", emphasis on the PM&S. Not that it's a risk inherent to JBIG2 lossy mode. Apparently the less dangerous SPM compression is also used for lossy encodings (I previously thought it was only for lossless encodings), but that doesn't change how PM&S works.
And it is not like saying "computers inherently distort data"... it is closer to saying "lossy compression inherently distorts data". Which is objectively true; they alter & distort data to compress it better. Usually the data is visually/audibly identical, or at least semantically equivalent, and when it's not the compression artifacts make excessive distortion obvious. Which is why JBIG2's PM&S is problematic, as it can semantically alter the data without any obvious trace of error.
Most importantly, you are completely ignoring the keyword "risk": I did not say that character munging is inherent to JBIG2 using PM&S and will necessarily happen if you use it; I said that it is a risk inherent to using it, i.e. a particularly bad failure mode of JBIG2 PM&S (which is unique, or unusual, among compression methods).
It is not alarmist to say that PM&S allows character munging to happen, and that this is a risk inherent to it, anymore than it is alarmist to say that the C language allows memory mismanagement (mismanagement allowing various exploits), and that this is a risk inherent to managing memory yourself.
Analogous to how C programmers should avoid invoking undefined behavior, users of scanners that use JBIG2 should be careful that their documents are not munged (one simple solution being to not use JBIG2), and implementers of JBIG2 should ensure they minimize incorrect pattern substitutions (which Xerox, apparently, did not).

Anyway, to address the content phrasing:
  1. I still feel "some" is more neutral than "poor" but "poor" is better than "faulty". I'm just going to make the edit with "some" and you can change that to "poor" if you like.
  2. Yeah, I see now you're right about Mosquito noise. Not to mention I forgot about next-gen video-codec based image formats such as AVIF.
  3. Just "the error cause is not JBIG2 itself" by itself reads rather oddly, so I'm almost tempted to say leave the whole quote as is, but since I'm still doubtful about whether it's actually a "bug" or not, I'll still say it's better to minimize the quote. I'll leave that choice and actual edit up to you, though.
    • Also, after learning that SPM is also used in practice for lossy encodings, I think it may be worth investigating which Xerox setting modes used SPM v.s. PM&S. For that matter, Xerox's patch only disabled PM&S, and not SPM, right? Or was it just disabling an internal method used to match the patterns used in PM&S and SPM? Ugh, this is all confusing... You were right, that "white paper" is loosey-goosey PR gunk. A suitable level of detail for CBS News perhaps, but not for Wikipedia
Xerox makes me wanna take a Xanax,
99.146.242.37 (talk), 06:17, 5 March 2024 (UTC)[reply]
Well, I minimized the quote. Don't know if it seems odd, but better than leave people with the wrong impression. Anyone who wants to know more can follow the ref.
The impression I had was that SPM is lossless. Maybe you can mix modes inside a single page, not sure.
If you have a Xerox scanner and really want to explore what it does, there are PDF dissectors around (or even qpdf) but analyzing JBIG2 streams will be more difficult. For what it's worth, I encountered wrong characters in low-DPI documents from a non-Xerox device. galenIgh 13:53, 5 March 2024 (UTC)[reply]

References

  1. ^ Scanning and Compression White Paper by Xerox: "Errors most often occur when scanning stress documents, and when “Normal Quality” is selected. It is possible, however unlikely, for errors to occur in the other modes."