Enron Corpus

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.[2] A copy of the database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.[3] He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer analysis of language. The corpus is "unique" in that it is one of the only publicly available mass collections of "real" emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.[3] In 2010, EDRM[disambiguation needed] published a revised version 2 of the corpus.[4] This expanded corpus, containing over 1.7 million messages, is now available on Amazon S3 for easy access to the research community.

References[edit]

  1. ^ Klimt, Bryan; Yiming Yang. The Enron Corpus: A New Dataset for Email Classification Research. CiteSeerX: 10.1.1.61.1645. 
  2. ^ "The Enron Email Corpus" Retrieved March 5, 2011.
  3. ^ a b Markoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software". New York Times March 5, 2011. p A1.
  4. ^ Socha, George. "EDRM Enron Email Data Set v2 Now Available". www.edrm.net. 

External links[edit]