Enron Corpus

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.[2] A copy of the database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.[3] He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer analysis of language. The corpus is "unique" in that it is one of the only publicly available mass collections of "real" emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.[3] In 2010, EDRM.net published a revised version 2 of the corpus.[4] This expanded corpus, containing over 1.7 million messages, is now available on Amazon S3 for easy access to the research community. Jitesh Shetty and Jafar Adibi from the University of Southern California processed this corpus in 2004 and released a MySQL version[5] of it and also published some link analysis results based on this.[6][7]


  1. ^ Klimt, Bryan; Yiming Yang. "The Enron Corpus: A New Dataset for Email Classification Research". CiteSeerX: 
  2. ^ "The Enron Email Corpus" Retrieved March 5, 2011.
  3. ^ a b Markoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software". New York Times March 5, 2011. p A1.
  4. ^ Socha, George. "EDRM Enron Email Data Set v2 Now Available". www.edrm.net. 
  5. ^ "Enron processed database"
  6. ^ "Enron Link Analysis"
  7. ^ "[1]"

External links[edit]