Jump to content

Wikipedia:Reference desk/Archives/Computing/2024 June 6

From Wikipedia, the free encyclopedia
Computing desk
< June 5 << May | June | Jul >> Current desk >
Welcome to the Wikipedia Computing Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.


June 6[edit]

Pdf[edit]

Ho! If I shrink a pdf with Acrobat say I can get it down by 60% say but if I then want to OCR it the size goes up to be even more massive than it was before. Is there a way to avoid this, say, keeping it smallish but also with text recognition? Thank you 2.28.124.7 (talk) 10:40, 6 June 2024 (UTC)[reply]

I am not familiar with the use of Acrobat and am not sure what you mean by "shrinking" a pdf with Acrobat.
Some apps, such as PDFpen, can OCR a bitmap and turn it into a searchable pdf.[1] The output is not much larger than the input – the blow-up in size occurs in the other direction, when a pdf produced by a word processing app is converted into a bitmap. PDFpen is not free; I do not know if there are free apps for this.  --Lambiam 19:35, 6 June 2024 (UTC)[reply]
A scanned PDF is, in essence, a PDF container with a series of high-resolution bitmaps (JPEGs) for each page. A typical OCR-annotation program extracts each JPEG, does optical recognition, and then adds PDF text objects behind the JPEGs (so they're selectable and copyable, but not visible). Those text additions are trivial - typically a few KB at most, per page.
Your problem is twofold - you want to a) downscale the JPEGS and b) add the OCR annotations. These are effectively orthogonal tasks. I've no idea how you're getting the poor results you are, with the Acrobat workflow. But I can do what you want with ghostscript and then ocrmypdf (which uses Tesseract). All of this is free software. For me, in Linux, it's as easy as:
QUALITY=/ebook  # use one one of /screen /ebook /printer /prepress /default  # /screen is very low resolution, /prepress is the highest

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=$QUALITY -dNOPAUSE -dBATCH  -dQUIET -sOutputFile=scaled.pdf test.pdf

ocrmypdf scaled.pdf ocred.pdf
For me, this takes a 2.5Mb scanned pdf test.pdf and the GhostScript (gs) line scales it down to 178Kb. The ocrmypdf command takes that and produces a 181Kb file (a modest addition consistent with the text on that page).
I've no idea how do to any of this with Acrobat. -- Finlay McWalter··–·Talk 20:20, 6 June 2024 (UTC)[reply]
@Finlay McWalter: Cheers! When I said shrink, yeah,, I meant 'compress'. I'll try copying what you've put up here into GS for Win and then stare blankly when, of course, nothing will happen except little squares appear. H'mmm. Your code above, are there meant to be 2X 'one' in the first line? Thanks again! ——Serial Number 54129 13:26, 10 June 2024 (UTC)[reply]
Yes, I should have only one one. I've no idea about GhostScript on Windows, I'm afraid. -- Finlay McWalter··–·Talk 14:53, 10 June 2024 (UTC)[reply]