Text Encoding Initiative
Tenth Anniversary User Conference


Delivering Electronic Text Over the Web - The Current and Planned Practices of the Oxford Text Archive

Alan Morrison and Jacob Fix

1 Introduction

In the summer of 1996, the Oxford Text Archive was appointed as one of five Service Providers for the UK-based, national Arts and Humanities Data Service (AHDS). The AHDS has a broad remit, requiring it to work collaboratively on behalf of the academic community to:

The Oxford Text Archive specializes in the area of electronic texts, and strongly advocates the use of TEI-conformant SGML. The majority of the Archive's collection is now stored as TEI Lite texts, and it is these materials that we seek to distribute as part of our contribution to the workings of the AHDS. However, we also need to operate within the framework of the AHDS, which will mean making our holdings accessible via the AHDS' integrated catalogue (which aims to integrate the collections held at all the Service Providers), and catering for the requirements of end users who may have little knowledge of either SGML or TEI.

The Archive is currently facing the dual challenge of how to make our texts accessible through the AHDS catalogue (which is likely to offer little or no support for TEI-conformant texts), and how to deliver our texts to end-users in a format they will find useful for their purposes. We believe that the issues faced by the Oxford Text Archive are in many ways more crucial that those confronting the other Service Providers, as many of those cater for dedicated, subject-oriented communities (e.g. archaeologists, historians etc.), whereas electronic texts are frequently of interest to a broad range of humanities disciplines, and beyond.

2 Current practice - how texts are managed at the Oxford Text Archive

In the past, electronic texts have been deposited with the Oxford Text Archive on a largely ad hoc basis. However, the formation of the AHDS has meant that certain types of grant-holder (e.g. recipients of awards from the British Academy, Leverhulme Trust etc.), are now actively encouraged, if not obliged, to consider offering any scholarly electronic texts produced as a direct or indirect result of funded research, for deposit with the appropriate AHDS Service Provider (thus a report detailing an archaeological excavation will more properly be offered to the Archaeology Data Service than the Oxford Text Archive, even if it consists solely of electronic text). Moreover, the fact that the Archive has been operating for over twenty years has meant that even though many of our deposits are offered as TEI Lite-conformant texts, there is notable variation in the content (both markup and data) of the TEI Headers that have been used.

Over the past twelve months the Archive has been working to develop a consistent TEI Header structure, which we hope will be appropriate for all our current and future needs. To this end, the Archive has also organized a meeting of representatives from some of the major TEI-aware, scholarly electronic text creators and users, (to take place in Oxford, in September 1997) to see if we can come to any sort of consensus regarding the application of TEI Headers. (A report on this meeting has been submitted for consideration for this conference).

Rather than maintain a separate catalogue of the Archive's holdings, our intention is to use the information stored in the texts' TEI Headers to assist in the identification and retrieval of resources (an intention which also implies close control over the data content of each TEI Header through the use of conventional library authority files, controlled vocabulary lists etc.). Ultimately, we would like to use the Headers to store sufficient information to support a complete document management system, to assist with collections management, document revision, and so on.

At the time of writing, the only publicly accessible catalogue of our holdings is a collection of automatically generated HTML files, available through the Archive's home page, which are produced from the ageing database system we currently use to manage our collection. This online catalogue allows the identification of texts by language, author, and title, but in an extremely rudimentary fashion; it does not take full advantage of the facilities now available through most web browsers, nor does it allow end-users direct access to the wealth of metadata held in the texts' TEI Headers.

3 Making texts more accessible -- the AHDS integrated catalogue, and beyond

Each of the five AHDS Service Providers have adopted their own approach to cataloguing their collections. In part this is a reflection of the varied nature of the collections concerned (everything from digitized video and geospatial data, to digitized images and machine-readable population census data), but it also forms the basis of one of the objectives of the AHDS, namely to explore the practical issues involved in developing an integrated catalogue of diverse distributed resources. Whatever the local intentions of the Oxford Text Archive, it is important that any cataloguing-related activities that we undertake do not limit our full participation in the development of the AHDS' integrated catalogue.

Between April and June of this year, under the auspices of the AHDS and UKOLN (the UK Office for Library and Information Networking), each of the Service Providers organized a meeting of specialists and end-users with a particular interest in subject-specific metadata to assist initial resource discovery. In each case, discussions centred on the usefulness (or otherwise) of the Dublin Core element set to capture the basic metadata considered essential to enable an end-user to find and identify a particular electronic resource as being of possible relevance to his/her area of concern. The Oxford Text Archive was well-placed in these discussions, because the Dublin Core is reasonably well-suited to describing text-like resources (as opposed to, say, digitized sound recordings), and the mapping of information from a TEI Header to a Dublin Core record is a straightforward process (assuming that the required data has actually been recorded in the Header). Work is currently underway to review the findings of these metadata meetings, to identify the set of elements within the Dublin Core that can be supported across the domains of all the Service Providers, and which will thus form the basis of the minimal set of information that each Service Provider agrees to make available through the AHDS' distributed catalogue. (At the time of writing, it is planned that the catalogue will be based upon a network of Z39.50-compliant client/servers).

As well as offering our collection through the AHDS, the Archive is also working on the automatic generation of MARC records from TEI Headers. These records will be loaded into Oxford University's OPAC, so that library users will be made aware of the Archive's holdings alongside the conventional resources that are also available. Moreover, because Oxford's OPAC forms a crucial part of the CURL (Consortium of University Research Libraries) OPAC, information about the Archive's holdings will be readily available nation-wide (in addition to the online catalogue available via our web pages, mentioned above).

The two approaches outlined above effectively disregard much of the valuable metadata information that is contained within the TEI Headers of the Archive's collection. With this in mind, we have also been exploring the development of a PAT/web gateway, which will allow users to make full use of OpenText's powerful search and retrieval engine (PAT), via an easy-to-use web front-end. This has the potential to allow users to search across the collection for any information likely to be stored in a TEI Header (within the constraints of the documentation practices adopted by the Archive), and retrieve texts accordingly, which is clearly much more powerful than the search and retrieval facilities offered by a conventional library catalogue.

4 Delivering texts over the web current and future practice

At present, the vast majority of the Archive's holdings are only made available via either public or private ftp. A limited number of requests for materials to be supplied on disk, magnetic tape, or CD-ROM are met each year, but we intend to phase-out this service in the very near future. There are two distinct problems caused by the Archive's ftp service: the arrangements for distributing texts to which usage restrictions apply (i.e. those which are not freely available) are cumbersome and labour-intensive; many new internet users are unfamiliar with ftp, and are unable to configure their web browsers appropriately.

Once a user has identified a text which may be of interest, either via the AHDS' distributed catalogue, Oxford University's OPAC, or by directly searching our online resources, there are a variety of ways by which that resource may be delivered. Under existing practice, a user who retrieves a text from our public ftp site will be supplied with a copy of the raw ASCII version of the file, complete with TEI Lite- conformant SGML markup. In common with a number of other electronic text centres, we have also been experimenting with delivering accompanying catalog and style files, such that if an end-user's web browser is configured to launch a suitable SGML-aware application (such as Panorama or Multidoc Pro), a formatted, browsable version of the text can be made available.

The above scenario assumes that the end-user has installed an appropriate SGML-aware application, be it a browser or some other sort of application, and has configured his/her web browser appropriately. However, it is our belief that this is still outside the reach of many users of the Archive (particularly, we suspect, those that come through a less direct route, such as via the AHDS' catalogue), and so for the foreseeable future we are likely to offer a number of alternatives. Perhaps the most user-friendly of these is to support an on-the-fly conversion of TEI Lite to HTML, which will allow the end-user to browse a reasonably well- formatted version of the text, and as more web browsers offer support for such features as cascading stylesheets, we will have even greater control over the final appearance of the text. We have also experimented with conversion to other formats, such as RTF, on the grounds that this will at least provide end-users with something that they find more familiar/acceptable, even though this is at the high price of discarding all descriptive markup. An alternative approach is to focus less on the needs of users who want to read/browse texts, but concentrate instead upon other things that they might wish to do with a text (which might in fact be the reason why they sought-out an electronic text in the first place), such as some form of text analysis. To this end, we have been investigating offering a more sophisticated PAT/web gateway, which will allow users to perform elementary searches and analyses of texts, without having to download the entire text(s) to their own machine, install and configure analysis software, and so on.

We are closely monitoring the development of XML, as we feel that this is directly relevant to the work of the Oxford Text Archive. For example, the widespread use of XML-aware web browsers would certainly encourage us to offer XML versions of our collection over the web (indeed, it is perhaps conceivable that if XML lives up to expectations, we may move to using SGML as purely an archival format, and make available on the web nothing but valid XML documents). Similarly, we are keen to explore the possibilities of using the Document Style Semantics and Specification Language (DSSSL) in the online delivery of our texts, although at the time of writing we have only carried out some initial work using a DSSSL formatting engine (James Clark's JADE), to convert files from TEI Lite into other formats.

However, in addition to all the possible web-based delivery mechanisms outlined above, the Oxford Text Archive will need to conform to AHDS practice with regard to rights management. Closely coupled with the AHDS, integrated catalogue will be a user registration and authentication mechanism, which will enable (and require) users to register to use the holdings of the various AHDS Service Providers. Although the exact nature and functionality of this service is, at present, slightly unclear, it is highly probable that this will have an impact upon the methods used to deliver electronic texts to the end-users of the Archive.

References

Sperberg-McQueen, C.M. and Burnard, Lou. 1994. Guidelines for Electronic Text Encoding and Interchange (TEI P3). Chicago and Oxford: Text Encoding Initiative

Arts and Humanities Data Service (AHDS) URL http://ahds.ac.uk/

UK Office for Library and Information Networking (UKOLN) URL http://www.ukoln.ac.uk/

Dublin Core URL http://www.ukoln.ac.uk/metadata/resources/dc.html

ISO/IEC 10179:1996 Document Style Semantics and Specification Language (DSSSL)


Back to Technical Program