Enron Corpus

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

The Enron Corpus is a large database of over 600,000 emails generated by 158 employees[1] of the Enron Corporation and acquired by the Federal Energy Regulatory Commission during its investigation after the company's collapse.[2]


The Enron data was originally collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by Joe Bartling,[3] a litigation support and data analysis contractor working for Aspen Systems, now Lockheed Martin, whom the Federal Energy Regulatory Commission (FERC) had hired to preserve and collect the vast amounts of data in the wake of the Enron Bankruptcy in December 2001. In addition to the Enron employee emails, all of Enron's enterprise database systems,[4] hosted in Oracle databases on Sun Microsystems servers, were also captured and preserved including its online energy trading platform, EnronOnline.

Once collected, the Enron emails were processed and hosted in litigation platform Concordance, and then iCONECT, for the investigative team from the Federal Energy Regulatory Commission, the Commodity Futures Trading Commission, and Department of Justice investigators to review. At the conclusion of the investigation, and upon the issuance of the FERC staff report,[5] the emails and information collected were deemed to be in the public domain, to be used for historical research and academic purposes. The email archive was made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email of over 160GB made it impractical to use. Copies of the collected emails and databases were made available on hard drives.

A copy of the email database was subsequently purchased for $10,000 by Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.[6] He released this copy to researchers, providing a trove of data that has been used for studies on social networking and computer analysis of language.


The corpus is unique in that it is one of the only publicly available mass collections of real emails easily available for study, as such collections are typically bound by numerous privacy and legal restrictions which render them prohibitively difficult to access.[6] In 2010, EDRM.net published a revised version 2 of the corpus.[7] This expanded corpus, containing over 1.7 million messages, is now available on Amazon S3 for easy access to the research community. Jitesh Shetty and Jafar Adibi from the University of Southern California processed this corpus in 2004 and released a MySQL version[8] of it and also published some link analysis results based on this.[9]


  1. ^ Klimt, Bryan; Yiming Yang (2004). "The Enron Corpus: A New Dataset for Email Classification Research": 217–226. CiteSeerX Cite journal requires |journal= (help)
  2. ^ "The Enron Email Corpus Archived 2011-03-08 at the Wayback Machine" Retrieved March 5, 2011.
  3. ^ Bartling, Joe (September 3, 2015). "The Enron Data Set - Where Did It Come From?". Bartling Forensic and Advisory. Retrieved September 3, 2015.
  4. ^ "FERC: Industries - Enron's Energy Trading Business Process and Databases". www.ferc.gov. Retrieved 2015-09-02.
  5. ^ FERC Staff Report - Price Manipulation in Western Markets - Findings at a Glance (3-26-2003)
  6. ^ a b Markoff, John. "Armies of Expensive Lawyers, Replaced by Cheaper Software". New York Times March 5, 2011. p A1.
  7. ^ Socha, George. "EDRM Enron Email Data Set v2 Now Available". www.edrm.net. Archived from the original on 2011-09-04. Retrieved 2012-09-03.
  8. ^ "Enron processed database"
  9. ^ Shetty, Jitesh; Adibi, Jafar (2005). "Discovering important nodes through graph entropy the case of Enron email database". Proceedings of the 3rd international workshop on Link discovery - LinkKDD '05. pp. 74–81. doi:10.1145/1134271.1134282. ISBN 978-1595932150.

External links[edit]