Statistically Improbable Phrases
Statistically Improbable Phrases, Statimprophrases or SIPs constitute a system developed by Amazon.com to compare all of the books they index in the Search Inside! program and find phrases in each that are the most unlikely to be found in any other book indexed. The system is used to find the most nearly unique portions of books for use as a summary or keyword.
The Statistically Improbable Phrases of Darwin's On the Origin of Species are: temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species and mongrel offspring.
- Googlewhack — a pair of words occurring on a single webpage, as indexed by Google
- tf-idf — a statistic used in information retrieval and text mining.