Statistically Improbable Phrases

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Statistically Improbable Phrases, Statimprophrases or SIPs constitute a system developed by to compare all of the books they index in the Search Inside! program and find phrases in each that are the most unlikely to be found in any other book indexed.[1] The system is used to find the most nearly unique portions of books for use as a summary or keyword.[clarification needed]


The Statistically Improbable Phrases of Darwin's On the Origin of Species are: temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species and mongrel offspring.[2]

See also[edit]

  • Googlewhack — a pair of words occurring on a single webpage, as indexed by Google
  • tf-idf — a statistic used in information retrieval and text mining.