Jump to content

pdftotext

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Jaleks (talk | contribs) at 21:54, 27 September 2016 (fixed link formatting). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

pdftotext is an open source command-line utility for converting PDF files to plain text files—i.e. extracting text data from PDF-encapsulated files. It is freely available and included by default with many Linux distributions, and is also available for Windows as part of the Xpdf Windows port. Such text extraction is complicated as PDF files are internally built on page drawing primitives, meaning the boundaries between words and paragraphs often must be inferred based on their position on the page.

$ pdftotext file.pdf

This usage produces a text file with the same base name as the input file and the suffix .txt. Wildcards (*), for example $ pdftotext *pdf, for converting multiple files, cannot be used because pdftotext expects only one file name. However, an executable file can be created in Bash that loops into the pdfs and convert each one into text. Finally the executable file should be run at the directory where the pdfs are located.

pdftotext is part of the Xpdf software suite. Poppler, which is derived from Xpdf, also includes an implementation of pdftotext. On most Linux distributions, pdftotext is included as part of the poppler-utils package.[1]

See also

References

  1. ^ "poppler-utils".