Jump to content

BookCorpus

From Wikipedia, the free encyclopedia

BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords.[1] It was the main corpus used to train the initial GPT model by OpenAI,[2] and has been used as training data for other early large language models including Google's BERT.[3] The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy.[3]

The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books".[4] The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service.[5] The dataset was initially hosted on a University of Toronto webpage.[5] An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created.[1] Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.[5][1]

References

[edit]
  1. ^ a b c Bandy, Jack; Vincent, Nicholas (2021). "Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus". NeurIPS.
  2. ^ "Improving Language Understanding by Generative Pre-Training" (PDF). Archived (PDF) from the original on January 26, 2021. Retrieved June 9, 2020.
  3. ^ a b Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805v2 [cs.CL].
  4. ^ Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books". 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile. pp. 19–27.
  5. ^ a b c Lea, Richard (28 September 2016). "Google swallows 11,000 novels to improve AI's conversation". The Guardian.