Text Encoding Initiative
Tenth Anniversary User Conference


DO DIGITAL LIBRARIES NEED THE TEI? A VIEW FROM THE TRENCHES.
LeeEllen Friedland (National Digital Library Program, Library of Congress)



As we mark the tenth anniversary of the Text Encoding Initiative (TEI) it seems especially appropriate to consider how digital libraries and electronic text projects have evolved in that time and whether that evolution has impact on the success of--or future prospects for--the application of the TEI in the humanities. Some relevant questions include: are there significant differences between digital library projects and scholarly electronic text projects and, if so, does the TEI serve the resulting range of needs adequately? Are the core principles of the TEI still appropriate and relevant? And finally, how easy is it to actually employ the TEI in a workaday text conversion and encoding program? In this paper, the issues cited above, and related matters, will be discussed from the perspective of the National Digital Library Program at the Library of Congress, which implemented a TEI-based DTD in 1993.

The National Digital Library Program at the Library of Congress is on an ambitious course to digitize millions of primary source items from a broad range of historical collections. The Library has been digitizing historical materials since 1990, when the American Memory pilot program developed collections, initially on CD-ROM, to explore the potential audiences, uses, and enthusiasm for digital resources on American history and culture. An end-user evaluation was conducted in 1992-93 in forty-four test sites around the United States including school, college, university, state, and public libraries. Students, teachers, library staff, and the general public were surveyed about their experiences using the digitized materials, their interest in different types of content, and various issues relating to the delivery systems. Since the completion of the pilot program in 1995, the digitizing activity of LC's National Digital Library Program has expanded to include an ever broader mix of historical source materials, on-line access via the World Wide Web, and generally incorporate the digitizing process into the larger work and mission of the institution.

The nature of the digital library program at LC has been shaped, from the inception of the American Memory pilot, by the ideas and priorities of Librarian of Congress, James H. Billington. It was he who established the focus on American materials (including both materials about the United States and American imprints), and this remains the general subject orientation of the program today. Though this is a broad subject, the Library collects materials in more than four hundred languages, and "American" materials certainly do not represent the institution's holdings in a comprehensive way. However, the multiple formats of LC collection materials are well-represented in the digital library program. Though the state of relevant technologies (and related factors) greatly influences the quantity and pace of digitizing work that is done with materials in different formats, the NDLP digitizes all types of printed matter, manuscript materials, prints, photographs, sound recordings, and motion pictures.

Another important characteristic of LC's program that has been defined by the Librarian's priorities that the digitizing effort facilitate greater access to the Library's treasures for a larger audience than has traditionally used the collections in the past. Indeed, the Librarian has often used phrases like "getting the champagne out of the bottle" when describing program goals. As part of this philosophy of providing greater access, the Librarian has emphasized kindergarten through high school (K-12) students and educators, and the general public; post-secondary students and educators, and the scholarly community have not been discounted, but neither have they been made priorities.

These issues have guided the general development of the American Memory pilot and current NDLP digital production. They have also had an impact on the design and implementation of the NDLP text-conversion program. From the beginning, certain priorities have been clear. Perhaps most fundamental was the idea that machine-readable texts were an important and new, or at least not commonly available, type of access to be provided for these historical materials. The ability to search the full text of a document and thus preserve its intellectual content and, often, context (represented by the document structure, formatting, and other presentational features), was seen as highly desirable. The emphasis on historical materials, of course, guaranteed a wild range of text sources. But in the context of digitizing materials in multiple formats, including printed matter, typed and handwritten manuscripts, printed illustrations of many types, photographs, sound recordings, and motion pictures, printed matter was--and is-- just another type of stuff. We were forced to seek certain economies, and not only the financial and work-effort kind. It was clear from the beginning that we needed to seek conceptual economies that would allow us to make progress on all fronts and do the best job possible with each one, considering the current state of technology. Ours is a fully integrated digital library project, not just an electronic text project. And yet, being a library, no less a library with 525 miles of bookshelves, we took our responsibility to adequately represent text materials quite seriously.

In developing our SGML document type definition (DTD), we confronted some hard choices and made what might be characterized as some very library-like decisions. Our reasoning followed these lines: We are the custodians of these text materials and we want to provide improved access to them. We don't presume to know what all Library users might want to do with these texts, and, even if we could anticipate every user's needs, we couldn't possibly accommodate all of them.

We didn't want to have to force different document types into a single content model. Nor did we want to have a baker's dozen of DTDs and match up every document with the best suited DTD, or require that kind of sophisticated decision making from data-entry technicians who were unlikely to possess the appropriate training. We knew that we would provide digital images of the original pages of text materials and that we wanted the texts to faithfully retain original errors. We had neither the staffing nor time (nor mission) to aspire to the standards of a typical documentary editing project, and, therefore, had to be prepared to allow all texts to be converted and marked up without benefit of expert examination to, for example, interpret handwriting, identify subtle boundaries of text subsections, or distinguish between different types of names or dates.

In response to these issues, we sought to cultivate some creative and flexible middle ground. Little did we realize that we would find that middle ground in the TEI. There was an uncanny congruence between the encoding principles derived during the American Memory document analysis and the TEI guidelines. This should not be so surprising, however, since both projects were firmly rooted in the careful analysis of a broad range of humanities texts. Though LC staff did not expect to become TEI converts, we knew what we wanted and what types of capabilities we had to have. It could be argued that the unexpected result--great compatibility and congruence between the American Memory DTD and the TEI--underscores the appropriateness of the TEI gestalt for use in the humanities. The descriptive flexibility afforded by the TEI is profoundly important and, this author would argue, developing a digital library of historical materials in the humanities would be impossible without it. One might say, in summary, that the American Memory DTD is the TEI writ small.

After more than four years of a reasonably successful implementation and continuing evolution of LC's text-conversion program, observations on a number of matters are timely. This paper will discuss in detail such issues as the contrasts between digital libraries and electronic text projects (using the NDLP as the main point of reference), including the former's goal to provide encoded texts to a broad, unspecialized audience; the compromises required by integrating text files into complex systems architecture and multi-media Web presentations; and the everyday realities of balancing text encoding within a wide array of traditional library work necessary for preparing historical materials for digitization. In addition, discussion will focus on several of the founding principles articulated in the TEI, including the intention to (1) "provide a standard format for data interchange in humanities research" (is there any such thing--outside tiny tribes of specialists?), (2) "suggest principles for encoding of texts in the same format" (encoding principles are invaluable, but, one might argue, the concepts of formats are too canonical), and (3) "include a minimal set of conventions for encoding new texts in the format" (but is the minimal set simple enough for a broad program?).


Back to Technical Program