Line wrap and word wrap: Difference between revisions

Content deleted Content added

Inline

Revision as of 06:35, 6 March 2007

Word wrap refers to a feature supported by most text editors that allows them to insert soft returns (or hard returns for some text editors) at the right-side margins of a document. As the full view of the text would show excessively long text strings, word wrapping confines text to the viewable window, allowing text to be edited or read from top to bottom without any left-to-right scrolling.

In word processors, word wrapping usually does not cause actual soft returns to be inserted into the actual document, but while editing the viewed text will be displayed as if soft returns had been inserted. If the user changes the margins, the editor will either automatically reposition the soft returns to ensure that all the text will "flow" within the margins and remain visible, or provide the typist some convenient way to reposition the soft returns.

Word boundaries, hyphenation, and hard spaces

The soft returns are usually placed after the end of complete words, or after the punctuation that follows complete words. However, word wrap may also occur following a hyphen.

Word wrap following hyphens is sometimes not desired, and can be avoided by using a so-called non-breaking hyphen instead of a regular hyphen. On the other hand, when using word processors, invisible hyphens, called soft hyphens, can also be inserted inside words so that word wrap can occur following the soft hyphens.

Sometimes, word wrap is not desirable between words. In such cases, word wrap can usually be avoided by using a hard space or non-breaking space between the words, instead of regular spaces.

Word wrapping in text containing Chinese, Japanese, and Korean

In Chinese, Japanese, and Korean, each Han character is normally considered a word, and therefore word wrapping can usually occur before and after any Han character.

Under certain circumstances, however, word wrapping is not desired. For instance,

word wrapping might not be desired within personal names, and
word wrapping might not be desired within any compound words (when the text is flush left but only in some styles).

Most existing word processors and typesetting software cannot handle either of the above scenarios.

CJK punctuation may or may not follow rules similar to the above-mentioned special circumstances; such rules are usually referred to by the Japanese term kinsoku shori (禁則処理, literally “prohibition rule handling”).

A special case of kinsoku shori, however, always applies: line wrap must never occur inside the CJK dash and ellipsis. Even though each of these punctuation marks must be represented by two characters due to a limitation of all existing character encodings, each of these are intrinsically a single punctuation mark that is two ems wide, not two one-em-wide punctuation marks.

Algorithm

greedy algorithm

The naive way to solve the problem is to use a greedy algorithm that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as Microsoft Word and Open Office. The following pseudocode implements this algorithm:

SpaceLeft := LineWidth
for each Word in Text
    if Width(Word) > SpaceLeft
        insert line break before Word in Text
        SpaceLeft := LineWidth - Width(Word)
    else
        SpaceLeft := SpaceLeft - (Width(Word) + SpaceWidth)

Where LineWidth is the width of a line, SpaceLeft is the remaining width of space on the line to fill, SpaceWidth is the width of a single space character, Text is the input text to iterate over and Word is a word in this text.

optimal solution

TeX uses a different "breaking algorithm" that considers the entire paragraph as a whole, breaking it into lines in a way that is often considered "more esthetically pleasing" than the greedy algorithm. (TeX also uses a hyphenation algorithm to break words across lines).

While the greedy algorithm is often adequate, it doesn't give the optimal solution if you want the remaining space on the end of each line to be as small as possible. Consider the following text:

aaa bb cc ddddd

If the cost function of a line is defined by the remaining space squared, the greedy algorithm would yield a sub-optimal solution for the problem (for simplicity, consider a fixed-width font):

------    Line width: 6
aaa bb    Remaining space: 0 (cost = 0 squared = 0)
cc        Remaining space: 4 (cost = 4 squared = 16)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

Summing to a total cost of 17, while the optimal solution would look like this:

------    Line width: 6
aaa       Remaining space: 3 (cost = 3 squared = 9)
bb cc     Remaining space: 1 (cost = 1 squared = 1)
ddddd     Remaining space: 1 (cost = 1 squared = 1)

The difference here is that the first line is broken before bb instead of after it, yielding a better right margin and a lower cost 11.

To solve the problem we need to define a cost function $c(i,j)$ that computes the cost of a line consisting of the words $Word[i]$ to $Word[j]$ from the text:

$c(i,j)=(LineWidth-SpaceWidth(j-i)-\sum _{k=i}^{j}Width(Word[k]))^{P}$

Where $P$ typically is $2$ or $3$ . There are some special cases to consider: If the result is negative (that is, the sequence of words cannot fit on a line) it needs to return $\infty$ .

The cost of the optimal solution can be defined as a recurrence:

$f(j)={\begin{cases}c(1,j),&{\mbox{if }}c(1,j)<\infty \\{\begin{matrix}\min \end{matrix}}_{1\leq k<j}\{f(k)+c(k+1,j)\},&{\mbox{if }}c(1,j)=\infty \end{cases}}$

Which can be greatly optimized using dynamic programming. In terms of implementation, it seems that the computation of c(i, j) is unnecessary when c(i, k) < 0 (where k < j) , it will be infinite anyway.

External links

Knuth's algorithm:

"Knuth & Plass line-breaking Revisited"
Knuth & Plass linebreaking revisited
"tex_wrap": "Implements TeX's algorithm for breaking paragraphs into lines." Reference: "Breaking Paragraphs into Lines", D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.
Text::Reflow - Perl module for reflowing text files using Knuth's paragraphing algorithm. "The reflow algorithm tries to keep the lines the same length but also tries to break at punctuation, and avoid breaking within a proper name or after certain connectives ("a", "the", etc.). The result is a file with a more "ragged" right margin than is produced by fmt or Text::Wrap but it is easier to read since fewer phrases are broken across line breaks."
Wrap.py : Break paragraphs into lines, attempting to avoid short lines. "We use the dynamic programming idea of Knuth-Plass to find the optimal set of breaks according to a penalty function that penalizes short lines quadratically"
adjusting the Knuth algorithm to recognize the "soft hyphen".
Knuth's breaking algorithm. "The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.)" ; part of Google Summer Of Code 2006
"Bridging the Algorithm Gap: A Linear-time Functional Program for Paragraph Formatting" by Oege de Moor, Jeremy Gibbons, 1999

@@ Line 24: / Line 24: @@
 ==Algorithm==
+=== greedy algorithm ===
 The naive way to solve the problem is to use a [[greedy algorithm]] that puts as many words on a line as possible, then moving on to the next line to do the same until there are no more words left to place. This method is used by many modern word processors, such as [[Microsoft Word]] and [[Open Office]]. The following pseudocode implements this algorithm:
@@ Line 36: / Line 37: @@
 Where <code>LineWidth</code> is the width of a line, <code>SpaceLeft</code> is the remaining width of space on the line to fill, <code>SpaceWidth</code> is the width of a single space character, <code>Text</code> is the input text to iterate over and <code>Word</code> is a word in this text.
+=== optimal solution ===
-While this solution might be satisfactory for some purposes, it doesn't give the optimal solution if you want the remaining space on the end of each line to be as small as possible. Consider the following text:
+[[TeX]] uses a different "breaking algorithm" that considers the entire paragraph as a whole,
+breaking it into lines in a way that is often considered "more esthetically pleasing" than the greedy algorithm. (TeX also uses a [[hyphenation algorithm]] to break words across lines).
+While the greedy algorithm is often adequate, it doesn't give the optimal solution if you want the remaining space on the end of each line to be as small as possible. Consider the following text:
  aaa bb cc ddddd
@@ Line 71: / Line 77: @@
 Which can be greatly optimized using [[dynamic programming]]. In terms of implementation, it seems that the computation of c(i, j) is unnecessary when c(i, k) < 0 (where k < j)
 , it will be infinite anyway.
+== External links ==
+Knuth's algorithm:
+* [http://defoe.sourceforge.net/folio/knuth-plass.html "Knuth & Plass line-breaking Revisited"]
+* [http://pbw.livejournal.com/ Knuth & Plass linebreaking revisited]
+* [http://oedipus.sourceforge.net/texlib/ "tex_wrap": "Implements TeX's algorithm for breaking paragraphs into lines."] Reference: "Breaking Paragraphs into Lines", D.E. Knuth and M.F. Plass, chapter 3 of _Digital Typography_, CSLI Lecture Notes #78.
+* [http://search.cpan.org/~mward/Text-Reflow-1.05/Reflow.pm Text::Reflow - Perl module for reflowing text files using Knuth's paragraphing algorithm.] "The reflow algorithm tries to keep the lines the same length but also tries to break at punctuation, and avoid breaking within a proper name or after certain connectives ("a", "the", etc.). The result is a file with a more "ragged" right margin than is produced by fmt or Text::Wrap but it is easier to read since fewer phrases are broken across line breaks."
+* [http://www.ics.uci.edu/~eppstein/PADS/Wrap.py Wrap.py : Break paragraphs into lines, attempting to avoid short lines.] "We use the dynamic programming idea of Knuth-Plass to find the optimal set of breaks according to a penalty function that penalizes short lines quadratically"
+* [http://www.nabble.com/Initial-soft-hyphen-support-t2970713.html adjusting the Knuth algorithm] to recognize the [[Hyphen#Hyphens_in_computing | "soft hyphen"]].
+* [http://wiki.apache.org/xmlgraphics-fop/KnuthsModel Knuth's breaking algorithm.] "The detailed description of the model and the algorithm can be found on the paper "Breaking Paragraphs into Lines" by Donald E. Knuth, published in the book "Digital Typography" (Stanford, California: Center for the Study of Language and Information, 1999), (CSLI Lecture Notes, no. 78.)" ; part of [http://wiki.apache.org/xmlgraphics-fop/GoogleSummerOfCode2006/FloatsImplementationProgress Google Summer Of Code 2006]
+* [http://citeseer.ist.psu.edu/23630.html "Bridging the Algorithm Gap: A Linear-time Functional Program for Paragraph Formatting"] by Oege de Moor, Jeremy Gibbons, 1999
+Other word-wrap links:
+* [http://www.efg2.com/Lab/Library/Graphics/CircleWordWrap.htm circular word wrap]
+* [http://www.codecomments.com/message230162.html the reverse problem -- picking columns just wide enough to fit (wrapped) text]
+* [http://api.kde.org/3.1-api/kdeui/html/classKWordWrap.html KWordWrap Class Reference] used in the KDE GUI
+* [http://www.leverkruid.eu/GKPLinebreaking/elements.html "Knuth linebreaking elements for Formatting Objects"] by Simon Pepping 2006. Extends the Knuth model to handle a few enhancements.
+* [http://wiki.apache.org/xmlgraphics-fop/PageLayout/ "Page breaking strategies"] Extends the Knuth model to handle a few enhancements.
+* [http://www.techwr-l.com/techwhirl/archives/0504/techwhirl-0504-00203.html "a Knuth-Plass-like linebreaking algorithm] ... The *really* interesting thing is how Adobe's algorithm differs from the Knuth-Plass algorithm. It must differ, since Adobe has managed to patent
+its algorithm (6,510,441)."[http://www.techwr-l.com/techwhirl/archives/0504/techwhirl-0504-00206.html ]
+* [http://blogs.msdn.com/murrays/archive/2006/11/15/lineservices.aspx "Murray Sargent: Math in Office"]
 [[Category:Text editor features]]