Text Encoding Initiative
Tenth Anniversary User Conference


An SGML/HTML Electronic Thesis and Dissertation Library

Janet Erickson
University of Michigan
Abstract for TEI10 Conference
20 October 1997

Introduction

The Electronic Thesis and Dissertation Project (ETD) was launched in 1987 at an Ann Arbor meeting arranged by UMI and attended by representatives of Virginia Tech, the University of Michigan, SoftQuad, and ArborText. Virginia Tech funded the development of a Document Type Definition (DTD) for dissertations and theses; SoftQuad's Yuri Rubinski wrote the initial DTD. The project continued at VT (http://scholar.lib.vt.edu/theses/) with collaboration from the Coalition for Networked Information, the Council of Graduate Schools, and UMI, among others. Since 1994, many of VT's students have submitted their dissertations and theses in Adobe's Portable Document Format (PDF). As of January 1997, VTs students were required to submit their projects in electronic form rather than in paper. The long-term plan is to have the theses and dissertations submitted in both PDF and SGML. VT is waiting for suitable software to develop before requiring submissions to be in SGML format.

Virginia Tech's Electronic Thesis and Dissertation Project is part of the larger National Digital Library of Theses and Dissertations (http://www.ndltd.org/), funded by a grant from the U.S. Dept. of Education. VT has been joined by approximately 15 other universities in supporting this effort. The Southeastern Universities Research Association (SURA) has also provided funding for the ETD Project.

The aim of my project is to describe a potential online library of dissertations and theses at the University of Michigan. The focus is on the SGML markup of sample dissertations using the TEI DTD and an HTML-based user interface for searching and retrieval. For this project, I acquired four dissertations to show the breadth of types that would need to be covered by the selected DTD. The first is the doctoral dissertation of Rebecca Price-Wilkin on the architectural history of a French church (The Late Gothic Abbey Church of Saint-Riquier). Price-Wilkin's document is used as an example of an image-rich dissertation. The second is the doctoral dissertation of David Ruddy on the medieval travelogue Mandeville's Travels (Scribes, Printers, And Vernacular Authority: A Study in the Late-Medieval And Early-Modern Reception of Mandeville's Travels). Ruddy's will show how historic texts can be represented in this model.

A third dissertation is from Michele Tepper, and is titled "The Mind of His Own Country": Nation and the Embodiment of Culture in Modernist Literature. It demonstrates the diversity that can be found within a dissertation, as each chapter can be thought of as an independent unit. Fourth, I acquired the dissertation of William Wheeler on global warming and agriculture from the University of Pennsylvania's School of Agriculture Economics and Rural Sociology. This document, titled Three Essays on Discounting and the Evaluation of Global Warming Policies, contains many tables, graphs, and formulas, allowing me to speak to these important issues.

In addition to the selection of a DTD and markup of the sample dissertations, this paper will address e-thesis investigations and discussions, TEI header and the ETD front matter, searching of dissertations, why e-theses should use SGML, and other important issues.

Selection of a DTD

The first challenge to the project was selection of a Document Type Definition (DTD), or the rules by which markup would be applied to each document. The ETD project at Virginia Tech had completed a DTD (the ETD DTD) for use on theses and dissertations. I initially elected to use this DTD for many reasons. It seemed rather simple, such that students with a passing knowledge of HTML markup could learn to use it with minor difficulty. Also, it had been designed with the material in mind, so there was some anticipation of suitability to the task. Lastly, I had made the faulty assumption that the DTD had been used and perfected such that it would be the most efficient way to mark up these dissertations.

As I investigated a beta version of the ETD Project's DTD, many limitations became apparent. This DTD provided few attributes, most of which were 'id's. This limits the complexity that is presented to a novice user, but also reduces the DTD's flexibility. A line break or <br> was not available and there was no additional containment for the different parts of a <head> such as a subtitle. You could also only have only one <head> element in each division. Footnotes were referenced with <link> elements that includes a required IDREF; the corresponding <footnote> element was not included in any content model in the beta DTD.

Use of this DTD was also hampered by the complexity of the four dissertations selected for the project. One of the documents has several lists and an introduction that precede the chapters; the ETD DTD does not have structures for these, so these lists would have to go into the first <chapter> element. The <chapter> element did not include attributes other than 'id,' so a 'type' attribute could not be added for clarification. After the chapters, there is a conclusion, followed by the figures and illustrations. This same dissertation has 150 illustrations and figures. There is no clear indication of what structure within the DTD to use for the images, though the <mm> element (multimedia object) is a likely candidate. It was, however, not referenced in any content model other than its own so that applications of the ETD DTD could not use the <mm> element. In sum, the ETD DTD, as I found it in early March 1997, was insufficiently tested and has enough problems that it could not be used for this project.

The Electronic Theses and Dissertations DTD underwent some revision in early 1997, and version 1.0 was released in March of that year. Several problems with the beta were fixed, such as adding the footnote element as an inclusion to the root ETD element and the mm element to the content models for chapter and elements further down the tree. A natural break was added as element br. The new DTD retains the requirement that text tagged as a poetic verse must contain at least one stanza tag. Additional flexibility in the verse content model would be useful, such as adding line as an option in verse rather than only allowing it as a subelement of stanza. The basic form of an ETD is as follows (NB: all elements are required unless otherwise noted):

The root element is ETD with main elements front, body, and back

front includes:
title, author, submission, school, degree, major, approvals, date, city, state, keywords, copyright, abstract, grant (zero or one), dedication (zero or one), acknowledgments (zero or one)

body includes:
chapter (one or more) with structural subelements such as section and p

back includes:
bibliography, appendix (zero or more), vita

While footnotes were added as an inclusion in ETD 1.0, the inclusion is at the root level, such that footnotes may be entered in any location. Moving this inclusion to the body element would prevent users from adding footnote text to the header, for example.

Many of the elements in ETD-ML have been borrowed from HTML. Examples of this include br for line breaks, tt for typed text, pre for preformatted text, a for anchors with the href attribute, and ordered and unordered lists. The similarities between ETD-ML and HTML make conversion between the two quite simple.

Because of the limitations of the beta ETD DTD, I selected an alternative DTD. I am most familiar with the Text Encoding Initiative (TEI) DTD; it has the flexibility to deal with the complexities of many documents, including dissertations. Markup of all four dissertations was done using the TEI Lite DTD. I discovered the new version of the ETD-ML DTD too late to change all of the dissertations to that DTD. Further electronic dissertation projects may well be able to use this DTD successfully. Virginia Tech lists only one ETD submitted in SGML; unfortunately the author has restricted it to on-campus use.

E-Thesis Investigations Outside UM

University of Waterloo Survey

The University of Waterloo, Canada, set up a team to "explore the governance issues and technical feasibility of submission, storage and distribution of ETDs at the University of Waterloo." This Electronic Theses Project Team (ETPT) began by investigating current projects for electronic submission of theses. To this end, ETPT sent out notices to email groups and mailing lists, requesting responses to its Web-based survey. The survey was completed by 29 organizations mostly in Canada and the United States. It addressed issues of governance, intellectual property, submission, access, storage, and social and philosophical issues. (http://library.uwaterloo.ca/~uw-etpt/)

Of these 29 institutions, only five currently accept electronic submission of theses and dissertations, Defense Technical Information Center, Simon Fraser University, Univ. of Manitoba, UMI Dissertations Publishing, and Virginia Tech. Of the remaining 24 groups, 17 are considering electronic submission projects. Adobe Acrobat, alone or alongside Postscript files, is the archival format used by four of the five institutions accepting electronic theses (Simon Fraser did not indicate a format). Only Virginia Tech accepts SGML files.

Discussion Lists on E-Theses

While much of the dialogue on electronic theses has taken place at the local and regional level, Internet discussion lists have more recently joined in. Virginia Tech hosts ETD mailing lists on general e-thesis topics, library discussions, technical issues, graduate school discussions, and project evaluation (http://www.ndltd.org/listserv/index.htm). The Digital Libraries Research mailing list (DIGLIB) and Electronic Text Centers (ETEXTCTR) list have also hosted conversations on the topic. More recently, the thread was brought over into TEI's general list. What follows are brief descriptions of the electronic thesis and dissertation-related threads on each list.

ETD General List

This is a very slow discussion list; the last message I received from it was Sept. 17, 1997. In this message Neil Kipp of the ETD effort at Virginia Tech provided data on VT's public collection of theses and dissertations:

N = 352 ETDsmean = ~2.7 Mbsd = 4987.93 kb
min = 87 kb max = ~40 Mbmedian = ~1 Mb

(N. Kipp, ETD-L Listserv, 17 Sept 1997)

DIGLIB List

In early 1997, the Digital Libraries Research mailing list briefly addressed the issue of electronic submission of dissertations. Postings were from organizations in Holland, Mexico, the United States, and Canada. The University of Waterloo posted a request for participants in its ETD survey, the results of which I've already addressed. (C. Jewell, DIGLIB mailing list, 12 Feb 1997) Other messages on the list detailed plans for implementing electronic dissertation submission at CINVESTAV-IPN, Mexico City, (F.M. De La Vega, DIGLIB mailing list, 23 Jan 1997) and current efforts at the US Naval Postgraduate School using Postscript files. (R. Norris, DIGLIB mailing list, 23 Jan 1997)

ETEXTCTR List

The Cornell-hosted ETEXTCTR discussion list addressed e-theses at length in August 1997. After questions of 'why don't we make theses available on the Web,' Perry Willett of Indiana University was first to ask the more complex question of what it means to have an e-dissertation. What guidelines, format, and requirements would there be? Would students be required to learn SGML or HTML or will any electronic format do? What are the implications for archival storage -- something which many students and administrators don't consider? (P. Willett, ETEXTCTR , 4 Aug 1997) Hope Greenburg of the University of Vermont agreed that perhaps this was the time to reexamine the function and practice of theses and dissertations before building a model for them. "The work is designed to be printed on paper, bound, catalogued, read by a committee, and eventually by other interested scholars. To accommodate these readers and cataloguers the document has taken a particular form." (H. Greenburg, ETEXTCTR, 06 Aug 1997). Matthew Kirschenbaum from the University of Virginia joined in, describing his list of online projects as "theses and dissertations intended to be native to electronic media." (M. Kirschenbaum, ETEXTCTR, 12 Aug 1997)

ETEXTCTR on DTDs for E-Theses

A DTD has been under development at VT because they feel that "perhaps ETDs should actually be available in a variety of formats to serve a variety of purposes." (G. McMillan, ETEXTCTR, 4 Aug 1997) Willett's response noted the benefits of having everyone use the same DTD, such as the ETD, in terms of uniform documentation and long-term preservation and usability. He found Kirschenbaum's page of electronic projects to be somewhat alarming due to the varied formats and media in which the projects were published, "none of which sounded very permanent." (P. Willett, ETEXTCTR, 4 Aug 1997)

Some of my thoughts on the suitability of ETD's DTD were shared Julia Flanders: "I looked at the ETD DTD and my impression was that at its strongest it reproduced the features of the TEI (basic structural elements, bibliographical information, some useful phrase-level elements) but that in places it takes a presentation rather than descriptive approach to text markup." Flanders went on to illustrate this point with ETD's definitions of the BR and Q elements, both of which provided a formatting-based description rather than a structural or functional one. (J. Flanders, ETEXTCTR, 6 Aug 1997) She later added that style sheets are the proper layer in which to handle presentational information. (J. Flanders, ETEXTCTR, 7 Aug 1997)

Kirschenbaum contributed a paper to the ACH-ALLC proceedings, quoted in an ETEXTCTR posting, in which he distinguished the electronic instantiation of a dissertation from its print counterpart. To him, the term ETD applies to

...any thesis or dissertation that is submitted, archived, and distributed solely or at least primarily in an electronic format. Such a dissertation might be written on any conceivable subject, and need avail itself of no method of presentation or organization that could not be duplicated on paper....a dissertation which is not only submitted, archived and accessible solely in an electronic format, but which is also self-conscious of its medium and which uses an electronic environment to support scholarship which could not be undertaken in print. (M. Kirschenbaum, ETEXTCTR, 12 Aug 1997)

I have already mentioned the alarm that Kirschenbaum's listing of e-theses caused due to the varied media and formats it encompassed. It is my impression that Kirschenbaum's idea of exclusively electronic projects is not what is currently proposed within the more mainstream ETD efforts. The need for assorted and potentially outmoded software to interpret these projects restricts rather than enhances their accessibility. I am aware of no responses in ETEXTCTR to Kirschenbaum's posting.

Both TEI and ISO 12083 were suggested on this list as alternatives to the DTD prepared at Virginia Tech. Jean-Claude Gu&eacute;don first suggested use of the 12083 DTD and XML for Internet publishing. He felt that 12083, as it is designed for books and could be supplemented for flexibility, was a good base for further DTD development. (J. Gu&eacute;don, ETEXTCTR, 1, 5, and 6 Aug 1997) Adherence to an existing DTD was not supported by John Lamp, University of Tasmania. Lamp wrote that "The book DTD in ISO12083 is an example of an all encompassing DTD designed by a committee for general purposes, but without encapsulating the essence of thesis structure." (J. Lamp, ETEXTCTR, 6 Aug 1997) Nick Finke, University of Cincinnati, agreed that ISO12083 is inappropriate for e-theses. He reasoned that this DTDs purpose is to "facilitate the publication of the data it contains in hardcopy codex form" rather than for assisting academic research. (N. Finke, ETEXTCTR, 6 Aug 1997)

ETEXTCTR and TEI

Opinions were mixed as to the applicability of TEI to electronic dissertations. Alongside his critique of ISO12083, John Lamp described TEI as especially suitable for its purposes but "massive overkill" for theses and general research. (J. Lamp, ETEXTCTR, 6 Aug 1997) While TEI may not be the precise tool for all dissertation, Perry Willett defended its use for such projects, as the level of encoding is up to the individual. One successful element of TEI that could be used or copied for dissertations is the header, "which allows for the detailed recording of bibliographic information and metadata, crucial for the longevity of any electronic file." (P. Willett, ETEXTCTR, 6 Aug 1997) According to Gail McMillan, the ETD project had considered other DTDs and did not find a good match and therefore had to develop their own. They are "aware of the TEI and plan on the ETD DTD being compatible." (G. McMillan, ETEXTCTR, 5 Aug 1997) Her article "Electronic Theses and Dissertations: Merging Perspectives" compares elements from the TEI header, WATERS technical report description fields, and MARC record fields in order to derive the latter from the former sources. (http://scholar.lib.vt.edu/theses/GailsCCQarticle.html)

ETEXTCTR on SGML Tagging

Greg McGowan, Univ. of Cincinnati, noted that if students are expected to do the work of tagging their dissertation in SGML, the process of doing so would need to be simple and clear to avoid poor tagging and ensure consistency in markup within and between documents. Current dissertation writers must already comply with the presentational demands of their graduate school, with margins, spacing, etc. held to certain formats. Such requirements might disappear, McGowan asserted, as the e-thesis develops into a wholly electronic beast. (G. McGowan, ETEXTCTR discussion list, 6 Aug 1997) Julia Flanders' (J. Flanders, ETEXTCTR discussion list, 6 Aug 1997) described why she thinks the ETD has formatting definitions -- it is "well-intended to help the thesis writer avoid dealing with complicated element definitions and presentational uncertainty" though at the same time it encouraged the view that markup is used for presentation rather than as an analytical tool. (J. Flanders, ETEXTCTR discussion list, 6 Aug 1997) In responding to the question of how to ensure tagging consistency among students, the University of Montreal's Jean-Claude Gu&eacute;don stated that universities should do the tagging. Again, good, cheap, conversion tools are necessary for this to work. (J. Gu&eacute;don, ETEXTCTR discussion list, 6 Aug 1997)

TEI-L

As the posts from ETEXTCTR slowed, Julia Flanders brought the topic of e-theses to the TEI list. Flanders called for a discussion of using TEI for dissertations, addressing a group that is known to be familiar with TEI's function, grammar, and peculiarities. She felt that scholars who learn to use TEI for their research will be "smarter thinkers about text encoding and electronic texts generally, which is a Good Thing." Flanders pointed out that the main difficulties with using TEI for dissertations lie in its perceived size and unwieldiness and the steep learning curve for already busy students (J. Flanders, TEI-L, 8 Aug 1997) Greg McGowan, a denizen of both the TEI and ETEXTCTR lists, pushed the discussion by asking why students should care about electronic texts and why the task of markup should be theirs to complete. (G. McGowan, TEI-L, 8 Aug 1997) He was answered variously. Flanders asserted that students will care about text encoding in the same way they care about providing measurements in standard units, as it is part of the "accepted mode of communication and makes life easier. (J. Flanders, TEI-L, 11 Aug 1997) Mavis Cournane, a PhD candidate at University College Cork, thought students might be convinced that SGML would prevent their work from becoming a victim of the technology they used to create it. (M. Cournane, TEI-L, 12 Aug 1997) Eastern New Mexico Univ. English instructor Jesse Swan suggested that "all intellectuals need to understand how their ideas are filtered through, perhaps even shaped by, textual representations and expectations." (J. Swan, TEI-L, 11 Aug 1997)

Swan and others on the list expressed the view that TEI or TEILite would in fact be well used in preparing electronic dissertations. Unlike ETEXTCTR's postings, messages to TEI-L spent little time on the meta issues of what is a thesis and why shouldn't TEI be used. Rather, the posts dealt with how to move along plans for helping students and universities use TEI for theses. Cournane provided some criticism of TEILite for theses, noting especially how cumbersome it is to encode a bibliography and to provide examples of SGML tagging within the text.

TEI Markup

The process used to mark up each dissertation was slightly different depending on its length and complexity. Two of the four began as Microsoft Word for Macintosh files; Price-Wilkin's dissertation was in Word for Windows NT/95 format and Wheeler's was in WordPerfect 7 for Windows 95. I began with Tepper's dissertation, as it was the shortest, had no images, and had a reasonable number of footnotes. For this document, I used the SGML tools bundled with WordPerfect 7 for Windows 95. It was a simple process of cutting and pasting from the Rich Text Format (RTF) that I had created from the MSWord for Macintosh version to the SGML document instance. The only difficulties lay in inserting lines of poetry, as WP's software does not have the split function that SoftQuad's Author/Editor does. The split function allows the user to surround a larger portion of text as a particular element then split that section into smaller versions of the same element. In WP, each line had to be separately tagged with an <L>, making for a tedious but effective process.

Ruddy's dissertation is both longer and has more extensive footnotes than does Tepper's. It also includes Middle English characters that need to be referred to by character entity reference, such as the thorn (þ). The manner in which I processed this dissertation was determined by the substantial number and size of the footnotes. I saved the document as RTF, uploaded the file to a UNIX machine, then used a short Perl program to automatically mark up the text. This processing relied on my ability to distinguish among the various RTF codes, which start with a curly brace then the codes for describing the text from that point to the ending curly brace (e.g., {/footnote Source cited above.}). Unlike SGML, the ending marker is generic, not indicating the element to which it refers. Because of this, some guesses had to be made on where the footnotes ended. Occasionally the Perl algorithm failed and notes had to be indicated by hand. Also, some special characters were missed in the processing or were transposed to another character in changing from one platform to another. Some of these transposed or missing characters were not fixed in the final version. Missing characters would have to be found in the original word-processed file and then placed in the SGML version. My final step in converting from RTF to SGML eliminated all RTF codes that had not been identified and converted in earlier steps, thus the characters they were meant to represent were lost.

As I moved through the various dissertations, they became more complex, added more elements, and I learned more efficient ways of processing them. Wheeler's dissertation on agricultural economics had more than 50 complex equations. The document was created in WordPerfect 7 so the formulas were done in WP's equation editor. At this point, I could redo the equations using TeX, a mathematical representation language. I would need to learn TeX to do this and would also have to understand the formulas well enough that in redoing them I could make accurate reproductions. As I did not have the time or expertise in mathematics to do this, I chose instead to take advantage of another WP feature: automatic conversion to HTML. As part of this conversion, WP changed the equations to GIF images and the other formatting to HTML codes. These codes were regular and more easily identified than the RTF codes used in processing Ruddy's dissertation. Thus, with another Perl program, I was able to change the markup from the HTML DTD to the TEI Lite DTD. This again required some clean-up, though hand-processing was quite limited in comparison to previous efforts. ISO characters were successfully changed by WP from internal coding to character entity references. Final tweaking of the document was done using Author/Editor and PSGML through EMACS.

Price-Wilkin's dissertation presented the most complex challenges. It includes several indices that were included in the front matter, and appendices in the back matter. Between these were an introduction, four chapters, and a conclusion, followed by a substantial number of figures, illustrations, and tables. Each figure and illustration was provided in three forms: a thumbnail, 100 dpi images for on-screen viewing, and 300 dpi for printing. I chose to use the 100 dpi images in the SGML version to decrease the download time. TEI Lite has apparatus to enclose a thumbnail image in a reference to the larger image, but I chose not to do this to simplify the processing and re-conversion to HTML for normal browsers. The images provided with Price-Wilkin's dissertation were in JPEG format, which is not supported in older versions of the SGML browser Panorama. I used a UNIX shell script to convert these images to GIF format, which is understood by both the new and old versions of Panorama.

Again, I used WordPerfect as a first step to convert from RTF to HTML. As the HTML produced by WP is quite generic (e.g., what should be a <H1> in HTML is converted as <P><STRONG>, everything seemed to be a <P>), it was difficult to identify such structures as bibliographies, lists, and quotes. Due to the length of this dissertation and time constraints, some of these structures remain as <P>s. Another useful enhancement of the markup would be links from the references to figures and illustrations within the text to the appropriate image in that section of the document.

TEI Header and ETD Front Matter

ETD-ML's bibliographic elements closely match those that appear on the title page of a typical dissertation. Thus ETD-ML uses a familiar and dissertation-specific terminology that makes the addition of bibliographic data a fill-in-the-blank operation. The meta data required in the ETD DTD is quite structured, with no flexibility in the ordering or use of all but three elements. The TEI header element is quite complex and flexible. It has four principal components: the file description for the full bibliographic description of an electronic file, an encoding description for noting the relationship between the electronic and source text, a profile description to describe non-bibliographic information about a text, such as language; and revision description for the history of the file. Only the file description element is required in all TEI headers, while the others are optional. The file description is further broken down into a title statement, edition statement, type and extent of file, publication and distribution, series statement, notes statement, and source description. Several of the elements within the TEI header could be used instead of ETD-ML's meta information. The following table shows the elements that ETD-ML requires and in the order in which they appear. Alongside these are corresponding elements from the TEI header.

Mapping of ETD-ML Meta Information to TEI Header

ETD-ML TEI
What is encoded? Element Location Element Location
text's title title Front title fileDesc
titleStmt
text's author author Front author fileDesc
titleStmt
type of document submission Front
school to which the text was submitted school Front
degree for which the text was submitted, such as Master of Arts degree Front
the department to which the text was submitted major Front
names of people on the approval committee approvals Front <respStmt> <resp>thesis approval</> <name>John Smith</></> fileDesc
editionstmt
date of the text date Front date fileDesc
editionstmt
edition
or
fileDesc
publicationStmt
city where the dissertation was defended city Front pubplace (city included with state) fileDesc
publicationStmt
state where the dissertation was defended state Front pubplace (city included with state) fileDesc
publicationStmt
query-oriented keywords or phrases to describe text keywords Front keywords profileDesc
textClass
copyright notice copyright Front availability fileDesc
publicationStmt
text's abstract abstract Front <div type=abstract> front
grant information required by some granting institutions grant(optional) Front funder fileDesc
titleStmt
author's dedication of the work dedication (optional) Front note fileDesc
notesStmt
author's acknowledgments acknowledgments (optional) Front note fileDesc
notesStmt

Type of document, school, degree, and major are pieces for which I have found no specific match within the TEI header. Our options include settling on a common usage for an existing tag or expansion of the TEI header with dissertation-specific elements. Extraction of bibliographic information from the TEI header is already being done, and use of this header should be preferred to reliance on ETD's front matter.

Searching Dissertations

Dissertations submitted to the University of Michigan can currently be located through the MCAT database on the University's MIRLYN online catalog. Works from all U.S. universities can be searched through the Dissertation Abstracts International (DAI) CD-ROM from UMI and the print version of DAI. The Dialog commercial database also contains the DAI. Each of these sources provides varied access points to the document surrogates; the actual dissertation is not searchable. MCAT allows the user to search author, title, and abstract fields of UM dissertations. The print DAI includes keyword and author indexes for all dissertations done in the U.S. and selected ones from other countries. The CD-ROM allows searching on title, author, institution, year, UMI order number, keyword, and advisor's name for more recent dissertations (post-1980). Abstracts are not available on the CD-ROM for dissertations completed between 1861 and 1979.

The interface for this project builds on what is available on the CD-ROM. Therefore, users will be able to search on the following fields: title, author, institution, year, UMI order number, advisor's name, keyword in the title or abstract, and keyword in the text. Each of these fields should be combinable using boolean operators, though this version does not implement that option. The results from a search will be presented in various ways, depending on the number of resulting hits and the fields searched.

Retrievals for searches on title, author, institution, year, UMI number, or advisor's name are all done in the same manner. If there is only one result, the user will see its title and abstract and the choice between viewing the dissertation in HTML or SGML. If HTML is selected, the user would see a list of the parts, such as chapters and their titles (if available) from which to select, and the choice of seeing the whole document in HTML. If SGML is selected, the whole document is sent to the user's SGML browser. Because of the size of some dissertations, delivery of only part of the SGML document to the user is desirable. The delivery of single chapters in DIVs is complicated by IDs and IDREFs that cross DIV boundaries, such as pointers that are linked to notes via ID/IDREF. The single result keyword query displays its result in the same manner as the other query types, except that a list of the hits within the dissertation replaces its abstract in the display. These hits are shown using keyword in context (KWIC), so that 20 characters of context surround each.

If a search results in more than one hit, the user will see a list of titles and abstracts in date order (most recent first) with the choice of HTML or SGML. As above, choosing SGML would download the whole dissertation to the user's desktop; selection of HTML would provide a table of contents, allowing the user the choice of seeing the whole document or only a portion of it. Table of Contents items will include the introduction, chapters, bibliography, and appendices; each is indicated by the 'type' attribute on the numbered DIV element. The ability to automatically generate this TOC list is dependent on consistent markup in the choice of attribute values.

The dissertations will be stored and searched as SGML. Results retrieved as SGML can be sent to the user's desktop without modification. When whole document results are retrieved as HTML, an intervening Perl program will convert those SGML tags with corresponding HTML codes to that markup; other SGML tags will be stripped out of the results. If a user selects a part of the document to retrieve (e.g., Chapter 1 only) in a non-keyword search, the full DIV of the appropriate section will be converted to HTML and returned. This retrieval will utilize the region-generating functions of the OpenText indexing software.

The multiple result keyword query displays its results in the same manner as the other query types, except that a list of the hits within each dissertation replaces the abstracts in the display. These hits are shown using keyword in context (KWIC), so that each is surrounded by 20 characters of context. On-the-fly conversion of the SGML to HTML means that the HTML does not have to be regenerated if changes are made in the SGML source and that only one copy of each dissertation needs to be stored on the system.

For two of the dissertations (Price-Wilkin and Tepper), I chose to include both the footnotes and the text as divisions in an overall division with 'type' equal to 'chapter.' With this in mind, a keyword search that had hits within the text sub-division of the chapter would need to retrieve the larger chapter in order to include the notes with the text. The position of the notes does not complicate matters in Ruddy's dissertation, as the pointer to each note is immediately followed by the note itself. The Wheeler dissertation needs some revision in this regard, as the notes for all three chapters are set off in a DIV0 of their own at the end of the complete document. As there are only 20 footnotes in this dissertation, they could easily be moved to follow the appropriate pointers.

Users can also browse the collection by year or subject. The other available fields, such as author, can be searched to produce pertinent lists to browse. Browsing by year lets the user see what has been added to the collection recently. Subject browsing makes it easier to see the latest research in an area. In the ETD model, initial subject keywords are assigned by the author and placed in a <keyword> element in the document header. From these keywords, indexers within the library would assign controlled vocabulary subject headings, such as Library of Congress Subject Headings. For ease of subject access and indexing, an alternative approach would be more appropriate. Each dissertation will be listed under the department to which it was submitted, with inter-departmental dissertations listed in each sections. This provides a simple mechanism for determining where to put a dissertation. This system of indexing may not scale well to a larger collection. As the collection grows, indexing could be done with a combination of department and specialization within the department. In browsing the Year List or Subject List, the user could select either SGML or HTML Retrieval of the whole or part of a dissertation proceeds as above.

Why Use SGML?

The use of SGML markup on dissertations allows far more complex searching. For fully marked up documents, searches can be made on bibliographic citations or such citations could be extracted from each dissertation to create a citation database as a secondary product. Because the whole dissertation would be online, it could be searched and retrieved, rather than searching and retrieving only the limited document surrogate (title and abstract), then waiting for delivery of the complete dissertation. Logical divisions within the text can be marked up; this structure can be utilized for retrieval of these smaller portions of a document to reduce download time. SGML is also independent of platform, such that a single document can be shown successfully on any number of computers without conversion. It lacks the proprietary coding that makes word-processed documents difficult to transfer between applications and platforms. As Web technology improves, the raw SGML will become even more useful as it will translate to this new system equally well.

When the ETD project was first suggested in 1987, the World Wide Web and HTML did not exist. Teaching graduate students how to do SGML markup would have been quite a challenge at that point. Since then, the Web has introduced students to HTML. HTML is easy to learn, has few rules, and a few simple tags. Many word processors can already produce HTML without the user knowing anything about markup. One might be tempted to leave the dissertations in HTML format as it is currently so much easier to produce than SGML. HTML markup is generic, with few ways to distinguish between the various parts of a document. It does not allow textual divisions that allow you to section off pieces to be retrieved. The automatic markup done by these word processors is simplistic, with all text blocks tagged as <P> and headings as <P><STRONG>. With this type of generic tagging, it is difficult to pinpoint variant structures for application of SGML markup. In addition, though a user may specify what each word-processing style maps to in HTML, the conversion does not always produce HTML markup that conforms to the user-set mappings.

Markup Variation

One problem with using SGML is the variety in markup that can be produced by multiple users. The variety in markup that can be produced by multiple users is only enhanced with the necessary flexibility of TEI. As can be seen in the various ways that I have presented notes within each dissertation, there is a great deal of variation available to the user. In this case, notes were placed immediately following the pointer, at the end of the chapter, and the end of the document. Even with a document population of only four, there were significant variations in my markup due to the origins of the documents and the processes used to convert them to SGML. What is most striking about this is that fact that a single SGML tagger allowed this much variety in markup. One can only imagine the variations in markup among graduate students.

One way to avoid significant variation in applying markup is to have a central office for converting word-processed dissertations to SGML. Within the University, there is already standardization required in preparation and formatting of dissertations. Dissertation printouts submitted to Rackham are reviewed for compliance with these standards. With SGML, a stylesheet attached to the SGML document would impose these formatting rules. The SGML DTD would impose some restrictions on how markup could be applied to a dissertation, and this markup would be reviewed. As SGML conversion tools grow more sophisticated and simplified, it will be easier to rely on the markup output by these tools. Still, the who and how of this review process needs to be clarified.

This solution does not address the causes of tagging variation. This can be ascribed to the purpose of TEI itself. It is flexible in order to encompass the varieties there are in the codex form and all other forms that TEI is able to reproduce in an electronic environment. There are simply many choices when one wants to tag a particular textual or structural characteristic. Perhaps a more restrictive tag set broken off from even TEI Lite would be appropriate for graduate student or graduate school use. Simplicity in markup make texts easier to convert to other formats. If taggers could be counted on to encode a document characteristic in a certain manner, related items could better be co-located in text retrieval.

Other Important Decisions

Special characters

The Price-Wilkin and Ruddy dissertations include a number of non-ASCII characters. Price-Wilkin's cites many French texts; Ruddy's includes both modern and medieval Western European characters. The Western European characters should be available in ISO Latin-1 and therefore available via standard HTML character entity references. The Middle English characters, such as the yogh will have to be dealt with in some other way -- perhaps GIFs -- until Unicode is the norm. For this project, the Latin-1 characters were changed to character entity references in the conversion process; the remaining special characters will show up incorrectly in the Ruddy's dissertation. As the dissertation project expands into other areas of the University, the issue of non-ASCII characters will become even more pressing.

Integrating Current Technology

Mathematicians and economists, among others, have developed mechanisms for presenting formulas and equations. The TeX formula description language is frequently used to typeset formulas. It is also the language referred to specifically in the ETD DTD as a way to process equations outside of the SGML document. As noted above, the equations used in Wheeler's dissertation were created with WordPerfect's equation editor and converted to GIFs. This is a temporary, insufficient solution to the problems of presenting mathematics online. The Graduate School will have to work with the departments to determine whether students will be required to submit their formulas as images, in TeX format, or in some other manner. Another area where practice varies is in selecting image types. Price-Wilkin used JPEG images for the photographs and diagrams in her dissertation. A common version of Panorama (vers. 1.50), an SGML browser, is unable to display JPEG images without use of an external viewer. Consequently, the images were converted to GIFs for this project. Version 2.0 of Panorama is able to handle JPEG and many other image formats internally, so this should not be a long-term problem.

Conclusion

A library of dissertations in SGML is a feasible endeavor. TEI or TEI Lite can be used as the DTD for these documents, though their complexity and detail may be overwhelming for the new user. Some of the difficulties I encountered in using the ETD DTD version 0.9.5 (November 1996) have been settled with the release of the ETD-ML 1.0. It is structurally straightforward and both the structure and element names fall into a pattern familiar to most graduate students. With the popularity of HTML, the task of teaching students how to use SGML is simplified.

Conversion to SGML from a word-processed format would add a step to the dissertation submission process. The additional hoop may be resisted by already-burdened students. Display of non-ISO Latin-1 characters and mathematic formulas still pose significant problems for implementation of this system. These problems are outweighed by the benefits of having the University's dissertations available online. The SGML structure provides multiple access points and comprehensive searching, while archiving the documents in a format that is platform-independent.

There are many other issues remaining to be resolved in creating an SGML-based digital library of theses and dissertations. There is always the question of who will do the actual markup of the texts. Also, it must be determined just how much markup a dissertation needs, based on an assessment of the uses to which the documents will be put and the extent to which context-sensitive searching is desired. TEI is certainly a powerful and flexible DTD, while the simpler ETD DTD may be sufficient for display and storage of dissertations.


References and Resources


Back to Technical Program