Sentence boundary disambiguation

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary identification is challenging because punctuation marks are often ambiguous. For example, a period may denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence. About 47% of the periods in the Wall Street Journal corpus denote abbreviations.[1] As well, question marks and exclamation marks may appear in embedded quotations, emoticons, computer code, and slang.

Languages like Japanese and Chinese have unambiguous sentence-ending markers.

Strategies[edit]

The standard 'vanilla' approach to locate the end of a sentence:

(a) If it's a period, it ends a sentence.
(b) If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.

This strategy gets about 95% of sentences correct.[2]

Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a maximum entropy model.[3] The SATZ architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.

Software[edit]

Perl compatible regular expression ("pcre")

  • ((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
  • $sentences=preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/",$text,-1, PREG_SPLIT_DELIM_CAPTURE);  //(for php)

Online use, libraries, and api

Toolkits that include sentence detection

See also[edit]

References[edit]

  1. ^ E. STAMATATOS, N. FAKOTAKIS, AND G. KOKKINAKIS. "1 AUTOMATIC EXTRACTION OF RULES FOR SENTENCE BOUNDARY DISAMBIGUATION". University of Patras. Retrieved 2009-01-03. 
  2. ^ "Doing Things with Words, Part Two: Sentence Boundary Detection". Retrieved 2009-01-03.  |first1= missing |last1= in Authors list (help)
  3. ^ "A Maximum Entropy Approach to Identifying Sentence Boundaries". Retrieved 2009-01-03.  |first1= missing |last1= in Authors list (help)

External links[edit]