Statistically improbable phrase
A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus. Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section. Christian Rudder has also used this concept with data from online dating profiles and Twitter posts to determine the phrases most characteristic of a given race or gender in his book Dataclysm. SIPs with a linguistic density of two or three words, adjective, adjective, noun or adverb, adverb, verb, will signal the author's attitude, premise or conclusions to the reader or express an important idea.
In a document about computers, the most common word is likely to be the word "the", but since "the" is the most commonly used word in the English language, it is probable that any given document will have the word "the" used very frequently. However, a phrase like "explicit Boolean algorithm" might occur in the document at a much higher rate than its average rate in the English language. Hence, it is a phrase unlikely to occur in any given document, but did occur in the document given. "Explicit Boolean algorithm" would be a statistically improbable phrase.
Statistically improbable phrases of Darwin's On the Origin of Species could be: temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species and mongrel offspring.
More examples of SIPS are circuitous pandemic, prevailing dominant basal paradigm, irrefutable scientific proof, cell signalling process, nuclear DNA transduction, instructive cooperative interaction, septarian concretions, dying economic system, Western block collapse, the free enterprise system, subsequent enabling legislation, totalitarian political ideology, pernicious financial racket, bottomless degeneracy, widespread censorship, moral highground, shifty-eyed demagogue, arrogant entitled narcissist, scheming sociopath, predatory psychopath, personal moral failings, disinterested curiosity, countless disasters and unthinkable suffering, innumerable cosmoi, zero-based budgeting, gut-wrenchingly gruesome gangsters, antibody dependent enhancement of pathogenic priming.
- Collocation – Any series of words that co-occur more often than would be expected by chance
- Googlewhack – A pair of words occurring on a single webpage, as indexed by Google
- tf-idf – A statistic used in information retrieval and text mining
- "SIPping Wikipedia" (PDF). Courses.cms.caltech.edu. Retrieved 2017-01-01.
- Jonathan Bailey (3 July 2012). "How Long Should a Statistically Improbably Phrase Be?". Plagiarism Today.
- Errami, Mounir; Sun, Zhaohui; George, Angela C.; Long, Tara C.; Skinner, Michael A.; Wren, Jonathan D.; Garner, Harold R. (1 June 2010). "Identifying duplicate content using statistically improbable phrases". Bioinformatics. 26 (11): 1453–1457. doi:10.1093/bioinformatics/btq146. PMC 2872002. PMID 20472545 – via bioinformatics.oxfordjournals.org.
- "What are Statistically Improbable Phrases?". Amazon.com. Retrieved 2007-12-18.
- Weeks, Linton (August 30, 2005). "Amazon's Vital Statistics Show How Books Stack Up". The Washington Post. Retrieved September 8, 2015.
- Rudder, Christian (2014). Dataclysm: Who We Are When We Think No One's Looking. New York: Crown Publishers. ISBN 978-0-385-34737-2.
- Sociologically Improbable Phrases Crooked Timber April 2005
- https://vimeo.com/513597654 Dr. Simone Gold. March, 2021.