Stylometry

Stylometry is the application of the study of linguistic style, usually to written language, but it has successfully been applied to music^[1] and to fine-art paintings^[2] as well.

Stylometry is often used to attribute authorship to anonymous or disputed documents. It has legal as well as academic and literary applications, ranging from the question of the authorship of Shakespeare's works to forensic linguistics.

History

Stylometry grew out of earlier techniques of analyzing texts for evidence of authenticity, authorial identity, and other questions. An early example is Lorenzo Valla's 1439 proof that the Donation of Constantine was a forgery, an argument based partly on a comparison of the Latin with that used in authentic 4th Century documents.

The modern practice of the discipline received major impetus from the study of authorship problems in English Renaissance drama. Researchers and readers observed that some playwrights of the era had distinctive patterns of language preferences, and attempted to use those patterns to identify authors in uncertain or collaborative works. Early efforts were not always successful: in 1901, one researcher attempted to use John Fletcher's preference for "'em," the contractional form of "them," as a marker to distinguish between Fletcher and Philip Massinger in their collaborations—but he mistakenly employed an edition of Massinger's works in which the editor had expanded all instances of "'em" to "them".^[3]

The basics of stylometry were set out by Polish philosopher Wincenty Lutosławski in Principes de stylométrie (1890). Lutosławski used this method to build a chronology of Plato's Dialogues.

The development of computers and their capacities for analyzing large quantities of data enhanced this type of effort by orders of magnitude. The great capacity of computers for data analysis, however, did not guarantee quality output. In the early 1960s, Rev. A. Q. Morton produced a computer analysis of the fourteen Epistles of the New Testament attributed to St. Paul, which showed that six different authors had written that body of work. A check of his method, applied to the works of James Joyce, gave the result that Ulysses was written by five separate individuals, none of whom had any part in A Portrait of the Artist as a Young Man.^[4]

In time, however, and with practice, researchers and scholars have refined their approaches and methods, to yield better results. One notable early success was the resolution of disputed authorship in twelve of The Federalist Papers by Frederick Mosteller and David Wallace.^[5] While questions of initial assumptions and methodology still arise (and, perhaps, always will), few now dispute the basic premise that linguistic analysis of written texts can produce valuable information and insight. (Indeed, this was apparent even before the advent of computers: the successful application of a textual/linguistic approach to the Fletcher canon by Cyrus Hoy and others yielded clear results in the late 1950s and early '60s.) An example of a modern study is the analysis of Ronald Reagan's radio commentaries of uncertain authorship.^[6] The stylometric (actually, handwriting analysis - see: Primary Colors) analysis of the controversial, pseudonymously authored book Primary Colors, performed by Vassar professor Donald Foster^[7] in 1996, brought the field to the attention of a wider audience after correctly identifying the author as Joe Klein.

In April 2015, researchers using stylometry techniques identified a play, Double Falsehood, as being the work of William Shakespeare.^[8] Researchers analyzed 54 plays by Shakespeare and John Fletcher and compared average sentence length, studied the use of unusual words and quantified the complexity and psychological valence of its language.

Methods

Modern stylometry draws heavily on the aid of computers for statistical analysis, artificial intelligence and access to the growing corpus of texts available via the Internet. Software systems such as Signature^[9] (freeware produced by Dr Peter Millican of Oxford University), JGAAP^[10] (the Java Graphical Authorship Attribution Program—freeware produced by Dr Patrick Juola of Duquesne University), stylo^[11] (an open-source R package for a variety of stylometric analyses, including authorship attribution) and Stylene^[12] for Dutch (online freeware by Prof Walter Daelemans of University of Antwerp and Dr Véronique Hoste of University of Ghent) make its use increasingly practicable, even for the non-expert.

Whereas in the past, stylometry emphasized the rarest or most striking elements of a text, contemporary techniques can isolate identifying patterns even in common parts of speech.

Writer invariant

The primary stylometric method is the writer invariant: a property held in common by all texts, or at least all texts long enough to admit of analysis yielding statistically significant results, written by a given author. An example of a writer invariant is frequency of function words used by the writer.

In one such method, the text is analyzed to find the 50 most common words. The text is then broken into 5,000 word chunks and each of the chunks is analyzed to find the frequency of those 50 words in that chunk. This generates a unique 50-number identifier for each chunk. These numbers place each chunk of text into a point in a 50-dimensional space. This 50-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two literary works are placed on the same plane, the resulting pattern may show if both works were by the same author or different authors.

Neural networks

Neural networks have been used to analyze authorship of texts. Text of undisputed authorship are used to train the neural network through processes such as backpropagation, where training error is calculated and used to update the process to increase accuracy. Through a process akin to non-linear regression, the network gains the ability to generalize its recognition ability to new texts to which it has not yet been exposed, classifying them to a stated degree of confidence. Such techniques were applied to the long-standing claims of collaboration of Shakespeare with his contemporaries Fletcher and Christopher Marlowe,^[13]^[14] and confirmed the view, based on more conventional scholarship, that such collaboration had indeed taken place.

A 1999 study showed that a neural network program reached 70% accuracy in determining authorship of poems it had not yet analyzed. This study from Vrije Universiteit examined identification of poems by three Dutch authors using only letter sequences such as "den".^[15]

One problem with this method of analysis is that the network can become biased based on its training set, possibly selecting authors the network has more often analyzed.^[15]

Genetic Algorithms

The genetic algorithm is another artificial intelligence technique used in stylometry. This involves a method that starts out with a set of rules. An example rule might be, "If but appears more than 1.7 times in every thousand words, then the text is author X". The program is presented with text and uses the rules to determine authorship. The rules are tested against a set of known texts and each rule is given a fitness score. The 50 rules with the lowest scores are thrown out. The remaining 50 rules are given small changes and 50 new rules are introduced. This is repeated until the evolved rules correctly attribute the texts.

Rare Pairs

One method for identifying style is called "rare pairs", and relies upon individual habits of collocation. The use of certain words may, for a particular author, idiosyncratically entail the use of other, predictable words.

Authorship attribution in instant messaging

The diffusion of Internet has shifted the authorship attribution attention towards online texts (web pages, blogs, etc.) electronic messages (e-mails, tweets, posts, etc.), and other types of written information that are far shorter than an average book, way more informal and much richer in terms of expressive elements like colors, layout structures, fonts, graphics, emoticons, etc. Efforts to take into account such aspects at the level of both structure and syntax were reported in.^[16] In addition, content-specific and idiosyncratic cues (e.g., topic models and grammar checking tools) were introduced to unveil deliberate stylistic choices.^[17]

Nowadays, one of the most important authorship attribution challenges is the identification of people involved in chat (or chat-like) conversations.^[18]^[19] The task has become important after that social media have penetrated the everyday life of many people and have offered the possibility of interacting with persons that hide their identity behind nick-names or potentially fake profiles. Standard stylometric features have been employed to categorize the content of a chat ^[20] or the behavior of the participants,^[21] but attempts of identifying chat participants are still few and early. Furthermore, the similarity between spoken conversations and chat interactions has been neglected while being a key difference between chat data and any other type of written information.

Notes

^ Westcott, Richard (15 June 2006). "Making hit music into a science". BBC News.
^ "Internet Archive Wayback Machine". Web.archive.org. 2006-06-30. Retrieved 2012-10-15. {{cite web}}: Cite uses generic title (help)
^ Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 171.
^ Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 196.
^ F. Mosteller and D. Wallace (1964). Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley.
^ Edoardo M. Airoldi, Stephen E. Fienberg, Kiron K. Skinner (July 2007). "Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses" (PDF). PS: Political Science & Politics. 40 (3): 501–506. doi:10.1017/S1049096507070874.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Author Unknown by Gavin McNett Salon November 2, 2000
^ "Study finds a disputed Shakespeare play bears the master's mark". LATimes.com. 2015-04-10. Retrieved 2015-04-13.
^ "The Signature Stylometric System". PhiloComp. Retrieved 2014-01-03.
^ "JGAAP". JGAAP. 2012-09-04. Retrieved 2012-10-15.
^ "stylo". CSG. 2014-10-24. Retrieved 2014-10-24.
^ Daelemans, Walter and Hoste, Véronique (2013). STYLENE: an Environment for Stylometry and Readability Research for Dutch (Technical report). CLiPS Technical Report Series. ISSN 2033-3544.{{cite tech report}}: CS1 maint: multiple names: authors list (link)
^ [1] Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher Matthews RAJ & Merriam TVN Lit Linguist Computing (1993) 8 (4): 203-209. doi: 10.1093/llc/8.4.203
^ [2]Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe Merriam TVN & Matthews RAJ Lit Linguist Computing (1994) 9 (1): 1-6
^ ^a ^b JF HoornZ, SL Frank, W Kowalczyk and F van der Ham (2012-09-03). "Neural network identification of poets using letter sequences". Llc.oxfordjournals.org. Retrieved 2012-10-15.{{cite web}}: CS1 maint: multiple names: authors list (link)
^ de Vel, O.; Anderson, A.; Corney, M.; Mohay, G. (2001-12-01). "Mining e-Mail Content for Author Identification Forensics". SIGMOD Rec. 30 (4): 55–64. doi:10.1145/604264.604272. ISSN 0163-5808.
^ Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan (2009-02-01). "Automatically Profiling the Author of an Anonymous Text". Commun. ACM. 52 (2): 119–123. doi:10.1145/1461928.1461959. ISSN 0001-0782.
^ Cristani, Marco; Roffo, Giorgio; Segalin, Cristina; Bazzani, Loris; Vinciarelli, Alessandro; Murino, Vittorio (2012-01-01). "Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging". Proceedings of the 20th ACM International Conference on Multimedia. MM '12. New York, NY, USA: ACM: 1121–1124. doi:10.1145/2393347.2396398. ISBN 9781450310895.
^ Roffo, G.; Cristani, M.; Bazzani, L.; Minh, Ha Quang; Murino, V. (2013-12-01). "Trusting Skype: Learning the Way People Chat for Fast User Recognition and Verification". 2013 IEEE International Conference on Computer Vision Workshops (ICCVW): 748–754. doi:10.1109/ICCVW.2013.102.
^ "Classification of Instant Messaging Communications for Forensics Analysis - TechRepublic". TechRepublic. Retrieved 2016-01-26.
^ Zhou, L.; Zhang, Dongsong (2004-01-01). "Can online behavior unveil deceivers? - an exploratory investigation of deception in instant messaging". Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004: 9 pp.–. doi:10.1109/HICSS.2004.1265079.

References

Brocardo, Marcelo Luiz; Issa Traore; Sherif Saad; Isaac Woungang (2013). Authorship Verification for Short Messages Using Stylometry. IEEE Intl. Conference on Computer, Information and Telecommunication Systems (CITS).
Can F, Patton JM (2004). "Change of writing style with time". Computers and the Humanities. 38 (1): 61–82. doi:10.1023/b:chum.0000009225.28847.77.
Brennan, Michael Robert; Greenstadt, Rachel. "Practical Attacks Against Authorship Recognition Techniques". Innovative Applications of Artificial Intelligence.
Hope, Jonathan (1994). The Authorship of Shakespeare's Plays. Cambridge: Cambridge University Press.
Hoy C (1956–62). "The Shares of Fletcher and His Collaborators in the Beaumont and Fletcher Canon". Studies in Bibliography. 7–15.
Juola, Patrick (2006). "Authorship Attribution" (PDF). Foundations and Trends in Information Retrieval. 1: 3. doi:10.1561/1500000005.
Kenny, Anthony (1982). The Computation of Style: An Introduction to Statistics for Students of Literature and Humanities. Oxford: Pergamon Press.
Romaine, Suzanne (1982). Socio-Historical Linguistics. Cambridge: Cambridge University Press.
Samuels, M. L. (1972). Linguistic Evolution: With Special Reference to English. Cambridge: Cambridge University Press.
Schoenbaum, Samuel (1966). Internal Evidence and Elizabethan Dramatic Authorship: An Essay in Literary History and Method. Evanston, IL, USA: Northwestern University Press.

External links

[1] Westcott, Richard (15 June 2006). "Making hit music into a science". BBC News.

[2] "Internet Archive Wayback Machine". Web.archive.org. 2006-06-30. Retrieved 2012-10-15. {{cite web}}: Cite uses generic title (help)

[3] Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 171.

[4] Samuel Schoenbaum, Internal evidence and Elizabethan dramatic authorship; an essay in literary history and method, p. 196.

[5] F. Mosteller and D. Wallace (1964). Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley.

[6] Edoardo M. Airoldi, Stephen E. Fienberg, Kiron K. Skinner (July 2007). "Whose Ideas? Whose Words? Authorship of Ronald Reagan's Radio Addresses" (PDF). PS: Political Science & Politics. 40 (3): 501–506. doi:10.1017/S1049096507070874.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[7] Author Unknown by Gavin McNett Salon November 2, 2000

[8] "Study finds a disputed Shakespeare play bears the master's mark". LATimes.com. 2015-04-10. Retrieved 2015-04-13.

[9] "The Signature Stylometric System". PhiloComp. Retrieved 2014-01-03.

[10] "JGAAP". JGAAP. 2012-09-04. Retrieved 2012-10-15.

[11] "stylo". CSG. 2014-10-24. Retrieved 2014-10-24.

[12] Daelemans, Walter and Hoste, Véronique (2013). STYLENE: an Environment for Stylometry and Readability Research for Dutch (Technical report). CLiPS Technical Report Series. ISSN 2033-3544.{{cite tech report}}: CS1 maint: multiple names: authors list (link)

[13] [1] Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher Matthews RAJ & Merriam TVN Lit Linguist Computing (1993) 8 (4): 203-209. doi: 10.1093/llc/8.4.203

[14] [2]Neural Computation in Stylometry II: An Application to the Works of Shakespeare and Marlowe Merriam TVN & Matthews RAJ Lit Linguist Computing (1994) 9 (1): 1-6

[oxfordjournalsllc-15] JF HoornZ, SL Frank, W Kowalczyk and F van der Ham (2012-09-03). "Neural network identification of poets using letter sequences". Llc.oxfordjournals.org. Retrieved 2012-10-15.{{cite web}}: CS1 maint: multiple names: authors list (link)

[16] Vel, O.; Anderson, A.; Corney, M.; Mohay, G. (2001-12-01). "Mining e-Mail Content for Author Identification Forensics". SIGMOD Rec. 30 (4): 55–64. doi:10.1145/604264.604272. ISSN 0163-5808.

[17] Argamon, Shlomo; Koppel, Moshe; Pennebaker, James W.; Schler, Jonathan (2009-02-01). "Automatically Profiling the Author of an Anonymous Text". Commun. ACM. 52 (2): 119–123. doi:10.1145/1461928.1461959. ISSN 0001-0782.

[18] Cristani, Marco; Roffo, Giorgio; Segalin, Cristina; Bazzani, Loris; Vinciarelli, Alessandro; Murino, Vittorio (2012-01-01). "Conversationally-inspired Stylometric Features for Authorship Attribution in Instant Messaging". Proceedings of the 20th ACM International Conference on Multimedia. MM '12. New York, NY, USA: ACM: 1121–1124. doi:10.1145/2393347.2396398. ISBN 9781450310895.

[19] Roffo, G.; Cristani, M.; Bazzani, L.; Minh, Ha Quang; Murino, V. (2013-12-01). "Trusting Skype: Learning the Way People Chat for Fast User Recognition and Verification". 2013 IEEE International Conference on Computer Vision Workshops (ICCVW): 748–754. doi:10.1109/ICCVW.2013.102.

[20] "Classification of Instant Messaging Communications for Forensics Analysis - TechRepublic". TechRepublic. Retrieved 2016-01-26.

[21] Zhou, L.; Zhang, Dongsong (2004-01-01). "Can online behavior unveil deceivers? - an exploratory investigation of deception in instant messaging". Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004: 9 pp.–. doi:10.1109/HICSS.2004.1265079.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]