Jump to content

Hyphenation algorithm: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Updated links from CPAN to MetaCPAN
Line 39: Line 39:
*{{cite web | title= Hyphenation in Python, using Frank Liang's algorithm | url=http://www.nedbatchelder.com/code/modules/hyphenate.py | accessdate=July 10, 2007 }}
*{{cite web | title= Hyphenation in Python, using Frank Liang's algorithm | url=http://www.nedbatchelder.com/code/modules/hyphenate.py | accessdate=July 10, 2007 }}
*{{cite web | title= Hyphenator.js-Hyphenation in JavaScript, using Frank Liang's algorithm | url=http://code.google.com/p/hyphenator/ | accessdate=January 3, 2008 }}
*{{cite web | title= Hyphenator.js-Hyphenation in JavaScript, using Frank Liang's algorithm | url=http://code.google.com/p/hyphenator/ | accessdate=January 3, 2008 }}
*{{cite web | title= Tex::Hyphen - Perl implementation of TeX82 Hyphenation rules | url=http://search.cpan.org/dist/TeX-Hyphen/ }}
*{{cite web | title= Tex::Hyphen - Perl implementation of TeX82 Hyphenation rules | url=https://metacpan.org/module/TeX::Hyphen }}
*{{cite web | title= phpSyllable - PHP implementation of Frank Liang's algorithm | url=https://github.com/vanderlee/phpSyllable/ }}
*{{cite web | title= phpSyllable - PHP implementation of Frank Liang's algorithm | url=https://github.com/vanderlee/phpSyllable/ }}



Revision as of 21:50, 16 September 2013

A hyphenation algorithm is a set of rules (especially one codified for implementation in a computer program) that decides at which points a word can be broken over two lines with a hyphen. For example, a hyphenation algorithm might decide that impeachment can be broken as impeach-ment or im-peachment, but not, say, as impe-achment.

One of the reasons for the complexity of the rules of word-breaking is that different 'dialects' of English tend to differ on the rule: American English tends to work on sound, while British English tends to look to the origins of the word and then to sound. There are also a large number of exceptions, which further complicates matters.

Some rules of thumb can be found in the reference "On Hyphenation – Anarchy of Pedantry". Among algorithmic approaches to hyphenation, the one implemented in the TeX typesetting system is widely used. It is thoroughly documented in the first two volumes of Computers and Typesetting and in Frank Liang's dissertation.[1] Contrary to the belief that TeX relies on a large dictionary of exceptions, the point of Liang's work was to get the algorithm as accurate as he practically could and keep any exception dictionary small. In TeX's original hyphenation patterns for US English, the exception list contains fourteen words.[2]

Hyphenation in TeX

Ports of the TeX hyphenation algorithm are available as libraries for several programming languages, including Perl, Ruby, Haskell, Python, and PostScript, and TeX itself can be made to show hyphens in the log by using the \showhyphens command. Note however that TeX does not set out to find all hyphenation points of a word, and is therefore unsuitable for applications such as associating lyrics with musical notes.

In LaTeX hyphenation correction can be added by user using:

\hyphenation{words}

The \hyphenation command declares allowed hyphenation points, where words is a list of words, separated by spaces, in which each hyphenation point is indicated by a - character. For example

\hyphenation{fortran er-go-no-mic}

declares that in the current job "fortran" should not be hyphenated, and that if "ergonomic" must be hyphenated, to do so at the indicated points.[3]

However, there are several severe limitations: On one hand, there can be only one \hyphenation command given in a (La)TeX file, so one cannot use e.g. style file which has some predefined \hyphenation data, and add more exceptions in each individual LaTeX document. What is worse, the \hyphenation command only accepts ASCII letters (by default), so it cannot be used to correct hyphenation for words with non-ASCII characters (esp. ä,ö,ü,é,è,à,ç,...) which are very common in almost all languages except English. (Workarounds might exist, though.)[4]

References

  1. ^ Liang, Franklin Mark. "Word Hy-phen-a-tion by Com-pu-ter". PhD dissertation, Stanford University Department of Computer Science. Report number STAN-CS-83-977, August 1983.
  2. ^ "The Plain TeX hyphenation tables". Retrieved June 23, 2009. {{cite web}}: Check date values in: |accessdate= (help)
  3. ^ \hyphenation on Hypertext Help with LaTeX
  4. ^ TeX FAQ: Accented words aren’t hyphenated, ibid How does hyphenation work in TeX? and references therein