Text Creation Partnership
The Text Creation Partnership (TCP) is a not-for-profit organization based in the library of the University of Michigan since 2000[update]. Its purpose is to produce large-scale full-text electronic resources (especially in the humanities) on behalf of both member institutions (particularly academic libraries) and scholarly publishers, under an arrangement calculated to serve the needs of both, and in so doing to demonstrate the value of a business model that sees corporate and non-profit information-providers as potentially amicable collaborators rather than as antagonistic vendors and customers respectively.
TCP has sponsored four text-creation projects to date. The first and the largest is "EEBO-TCP (Phase I)" (2001–2009), an effort to produce structurally marked-up full-text transcriptions of 25,000+ of the roughly 125,000 books to be found either in the Pollard and Redgrave and Wing short-title catalogues of early English printed books, or among the Thomason Tracts, that is, from among nearly all books, pamphlets, and broadsides published in English or in England before 1700. The books were selected and transcribed from the digital scans produced by ProQuest Information and Learning, and distributed by them as a web-based product under the name "Early English Books Online" (EEBO). The scans from which the texts were transcribed were themselves made from the microfilm copies made over the years by ProQuest and its antecedent companies, including the original University Microfilms, Inc. EEBO-TCP Phase I concluded at the end of 2009, having transcribed about 25,300 titles, and immediately moved into EEBO-TCP Phase II (2009–), a sequel project dedicated to converting all the remaining unique English-language monographs (roughly 45,000 additional titles).
The third TCP project was Evans-TCP (2003–2007, with some ongoing work through 2010), an effort to transcribe 6,000 of the 36,000 pre-1800 titles listed in Charles Evans' American Bibliography, and distributed, again as page images scanned from microfilm copies, by Readex, a division of NewsBank under the name "Archive of Americana" ("Early American Imprints, series I: Evans, 1639–1800"). Evans-TCP has produced e-texts of nearly 5,000 books.
The final TCP project was ECCO-TCP (2005–2010, with some work ongoing), an effort to transcribe 10,000 eighteenth-century books from among the 136,000 titles available in Thomson-Gale's web-based resource, "Eighteenth-Century Collections Online" (ECCO). ECCO-TCP ran out of funding in 2010 after transcribing about 3,000 (and editing about 2,400) titles.
The TCP is overseen by a Board of Directors, drawn chiefly from senior library administrators at partner institutions, representatives of the corporate partners, and the Council on Library and Information Resources (CLIR). The Board is assisted in matters of selection and scholarship by an academic advisory group that includes faculty in the fields of early modern English and American studies.
The TCP has informal ties to a number of University-based scholarly text projects, especially in helping to provide them with source texts with which to work. Institutions represented include Northwestern University (IL), Oxford University (UK), Washington University (St. Louis), the University of Sydney (Australia), the University of Toronto (ON), and the University of Victoria (BC). TCP has also worked with students by sponsoring an Undergraduate Essay Contest every year, convening task forces on the uses of TCP texts in pedagogy, and appealing to scholars and students for ideas on selection and use.
Text production is managed through the University of Michigan's Digital Library Production Service (DLPS), with its extensive experience in the production of SGML/XML-encoded electronic texts. DLPS is assisted by Oxford University's Bodleian Digital Libraries Systems & Services (BDLSS). Small part-time production operations have also been started within two other libraries: the Centre for Reformation and Renaissance Studies in Pratt Library (Victoria University in the University of Toronto), specializing in Latin books; and the National Library of Wales (Llyfrgell Genedlaethol Cymru) in Aberystwyth, specializing in Welsh books.
All four TCP text projects are very similar. In each case:
- The TCP produces text from commercial image files that have in turn been created from microfilm copies of early books.
- The commercial image providers receive what is in effect a full-text index to their image product for much less than it would cost to produce themselves: value added to their product.
- The partner libraries actually own, rather than simply license, the resultant texts, and are free (subject to some conditions) to mount the texts themselves in whatever system they like, or use the texts internally as a tool of scholarship and teaching.
- The texts are created according to library-determined standards, uniform across multiple data-sets and potentially cross-searchable.
- Because they are created collaboratively, the texts are relatively inexpensive (on a per-book basis) and become more so with each library that joins the partnership.
- The texts will eventually be made freely accessible to the public at large.
- The selection of texts to convert, though differing from project to project, in each case follows similar principles: variety, significance, representative quality, avoidance of duplication; specific requests from faculty or scholarly initiatives at member institutions are also generally honored.
- TCP has been hitherto primarily interested in creating texts, not in creating a "product"; though texts from all three projects are or will be mounted on servers at the University of Michigan library, the Michigan site is not the official TCP site: any partner library with adequate resources and safeguards may do the same. EEBO-TCP texts, for example, are served by Michigan, ProQuest, the Oxford University Digital Library, and the University of Chicago.
All four TCP text projects are produced in the same way and to the same standards, which are documented, at least in part, on the TCP web site.
- Accuracy. The TCP strives to produce texts that are as accurately transcribed as possible, with a specified overall accuracy rate of 99.995% or better (i.e. one error or fewer per 20,000 characters).
- Keying. Given the nature of the material, the only method found to deliver such accuracy economically has been to have the books keyed by data conversion firms under contract.
- Quality control. Accuracy of transcription and aptness of markup are assessed in all cases by a group of library-based proofers and reviewers managed by the University of Michigan DLPS.
- Encoding. All resultant text files are marked up in valid SGML or XML (SGML is archived, XML is exported) conforming to a proprietary "Document Type Description" (DTD) derived from the P3/P4 version of the Text Encoding Initiative (TEI) standard.
- Purposeful markup. Compared to the full TEI, the TCP DTD is very simple and intended to capture only the features most useful for intelligible display, intelligent navigation, and productive searching. The TCP practice is to capture, so far as feasible, the overall hierarchical structure of each book (parts, sections, chapters, etc.); the features that tend to mark the beginnings and ends of divisions (headings, explicits, salutations, valedictions, datelines, bylines, epigraphs, etc.); the most significant elements of discourse and organization (paragraphs in prose, lines and stanzas in verse, speeches, speakers, and stage directions in drama, notes, block quotes, sequential numerations of all kinds); and only the most essential aspects of physical formatting (page breaks, lists, tables, font changes).
- Fidelity to the original. In each case, the text is intended to represent the book as originally printed, so far as that is possible. Printer's errors are preserved, hand-written changes are ignored, duplicate scans are omitted, out-of-order images are keyed in the intended order, and most of the unusual characters of the original are preserved.
- Ease of reading and searching. At the same time, though the transcriptions are carried out character-by-character, TCP, on the theory that all transcription is a kind of translation from one symbolic system to another, tends to define characters in terms more of their meaning than of their form, and to map eccentric letter-forms to meaningful modern equivalents, generally in keeping with the Unicode definition of "character."
- Languages. Though most of the TCP texts are in English, many are not. Books and divisions of books not in English are tagged with an appropriate language code, but are not otherwise distinguished.
- Omitted material. The TCP produces Latin-alphabet text. Non-textual material such as musical notation, mathematical formulae, and illustrations (except for any text they may contain) are omitted and their locations marked with a special tag. Extended text in non-Latin alphabets (Greek, Hebrew, Persian, etc.) is also omitted.
Accomplishments and prospects
As of April 2011, the TCP had created about 40,000 searchable, navigable, full-text transcriptions of early books, a database of unmatched scope, scale, and utility to students in many fields. Whether it will be able to go on to produce the remaining 38,000 texts included in its ambitious recent plans (for EEBO-TCP Phase II) will depend on the validity of its original vision, arising from the theory that libraries could and should cooperate to become producers and standard-setters rather than consumers; and that universities and commercial firms, despite their very different life-cycles, constraints, and motives, could join in durable partnerships of benefit to all parties.
- For an overview see Blumenstyk, Goldie (August 10, 2001). "A Project Seeks to Digitize Thousands of Early English Texts". Chronicle of Higher Education: A47. Retrieved 2007-01-04.
- Beamish, Rita (July 29, 1999). "Online Archive Will Preserve Earliest English Books". New York Times. Retrieved 2007-01-04.
- Main (Michigan) TCP web site
- Oxford TCP web site
- Internal TCP documentation
- Demonstration sites (open to the public) for
- Database-access sites (open to members of partner institutions) for