Text Encoding Initiative
P3 tells us that the Text Encoding Initiative (TEI)
Writing System Declaration (WSD) "is an auxiliary document which
provides information on the methods used to transcribe portions of text in
a particular language and script." (
The first of these goals, the documentation of transcription systems for the edification of anticipated human readers, is straightforward, and anyone conversant with WSD architecture can examine a document and its accompanying WSDs and work out the mapping employed to document different textual representations. The second goal, however, the use of WSDs in automated transducing, rendering, and other processing operations, is less accessible. Despite the indisputable documentation and processing value of a formal writing system description, the WSD appears to be one of the least-used features of the TEI guidelines. Thus, although a few WSDs are available as part of the electronic TEI distribution, these pertain primarily to straightforward natural writing systems, and they apparently exist exclusively for documentation purposes, for we know of no projects other than our own where WSDs of any sort are employed in automated document processing. That the WSD is not in wide use is clear from examining the WSD support files available on the TEI HTTP and FTP servers, where the teiwsd2.dtd file includes a declaration for a wdgis2.ent system identifier, which is, in fact, misnamed on the servers as teiwdgis2.dtd. That this error in the distribution could have gone unnoticed until the present suggests that these files have not been in significant demand by users.
One goal of the present paper is to illustrate how the authors have used WSDs to support not only the documentation, but also the transformation of TEI-conformant documents. Mavis Cournane will discuss the encoding of hellenized Hebrew and latinized Greek in a Latin context and David J. Birnbaum will discuss the encoding of the principal manuscript witnesses to the early East Slavic Rus' Primary Chronicle.
For my work with early Cyrillic writing, which is extremely variable
and which lacks adequate support in any international character, glyph, or
SGML entity set standards, I first developed an SDATA character entity
set. (For general background see especially
<!ENTITY aos SDATA "[aos ]" --Cyrillic small letter a, alternate (early)-->.In this example, I have chosen to use the character entity aos to represent a particular early Cyrillic letter, and I have defined its SDATA replacement text as the string
[aos ]. I then declare chsl.ent in the DTD subset of my main document with
<!ENTITY chsl.ent PUBLIC "-//DJB/ENTITIES General Church Slavonic/EN"> %chsl.ent;adding the appropriate PUBLIC entity entry to my CATALOG file. This combination of incantations enables me to use early Cyrillic entities in my document, satisfying my basic encoding need, but it is not sufficient to render the document in any meaningful way. For example, the SDATA entity declarations will instruct a parser to substitute for an aos entity the declared
[aos ]replacement string, and the resulting
[pos ][oos ][vos ][jatos ][sos ][tos ][fjeros ]is hardly more legible than the original
&pos;&oos;&vos;&jatos;&sos;&tos;&fjeros;.Needless to say, I do not want this replacement string to appear in my final form documents.
If one is operating without a WSD, the correct way to cope with this
situation is to create a new set of entity declarations, removing the
default replacement text (such as
[aos ] and substituting for
it something that will be rendered properly on a local system, such as a
system-specific numerical character reference. This is the "display
entity set" discussed in [Goldfarb, 504]. The wrong way to cope, which I mention
as a caution, is to feed the document to an SGML parser with the default
ISO (or ISO-like) entity set and then reparse the output to replace the default
replacement text with something locally appropriate. There is nothing sacrosanct
about the replacement text in the ISO registered entity sets, and the SGML
parser should be allowed to identify and replace character entities for local
use directly. The value of the ISO registered entity sets is not that they
standardize a replacement text for each entity, but that they standardize the
inventory and names of common constituents (primarily characters) of basic
writing systems. If, on the other hand, one uses a WSD to
generate final form output, the default character entity sets may be left
as they are, and the proper local replacement text may be specified in the
WSD. This strategy is
A sample character entry in an early Cyrillic WSD looks like:
<character class=lexical> <form string='' entityLoc='aos' ucs-4='0430' afiicode='10993'> <desc>Cyrillic small letter a, alternate (early)</desc> </form> <character>In this example, we declare that the entity aos corresponds to a particular standardized ISO 10646 (UCS-4, Unicode) character, that this character may be represented only by an entity (not by a string), and that it should be rendered with the standardized AFII (Association for Font Information Interchange) code 10993. Unlike the bare SDATA entity set, which specifies a single replacement string for each declared character entity, the WSD associates separate character and glyph information with each entity. This flexibility is valuable when dealing with writing systems that do not observe a strict one-to-one correspondence of characters (units of information) and glyphs (units of presentation), i.e., writing systems where the same underlying letter may be written in different ways, or where the same written mark may represent more than one underlying letter. This sort of many-to-many correspondence is precisely what one finds in early Cyrillic writing.
The early Cyrillic sample text I used for this project benefits from two simplifying assumptions. First, I assume that each manuscript observes a single, consistent WSD. In fact, early Cyrillic writing was so poorly standardized that different scribes within the same manuscript might use different inventories of early Cyrillic letters and observe different rules about which letters functioned as variants of which others. Second, more generally, I assume that in a critical edition of an early Cyrillic work, all manuscript witnesses share a WSD. In fact, early Cyrillic was a supranational writing system with many local varieties, and a writing system for early Bulgarian documents might differ in several places from one for early East Slavic documents. If one views document encoding as a way to make explicit one's analysis of written sources, and WSD encoding as a way to make explicit one's analysis of the writing system(s) underlying these sources, it would not be unusual for a single early Cyrillic document to require different WSDs for different scribes, however inconvenient this might prove.
I developed and published a WSD-type approach to encoding the
intricacies of early Cyrillic orthography in a long article last year in
My sample text is a fragment of the Rus' Primary Chronicle, the earliest east Slavic chronicle text, which purports to trace the history of the world from the creation through the establishment of the Rurik dynasty in Kiev and other cities. The sample chosen for the purpose of testing the WSD architecture is a single set of brief parallel readings from a critical edition currently under preparation under the general editorship of Donald Ostrowski, of the Harvard Ukrainian Research Institute, and encoded using the TEI parallel-segmentation architecture:
<p lang="chsl"> <app n="1,1"> <rdggrp type="manuscript"> <rdggrp type="northern"> <rdg wit='lav'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> <rdg wit='tro'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> <rdggrp type="rad-aka"> <rdg wit='rad'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;<hi rend="superscript">&mos;<hi> &pos;&oos;&vos&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;.</rdg> <rdg wit='aka'>&sos;&eos &nos;&aos<lb>&chos;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idec3os;&juos;.</rdg> </rdggrp> </rdggrp> <rdggrp type="southern"> <rdg wit='ipa'>&sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&fjeros; &pos;&oos;<lb>|&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;.</rdg> <rdg wit='xle'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&bjeros; <lb> &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idec3os;&juos;.</rdg> </rdggrp> </rdggrp> <rdggrp type="edition"> <rdggrp type="published"> <rdg wit='byc'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> <rdg wit='lix'>&Sos;&eos; &nos;&aos;&chos;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> <rdg wit='sax'>&Sos;&eos; &nos;&aos;&chos;&fjeros;&nos;&jatos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> </rdggrp> <rdg wit='ost'>&Sos;&eos; &nos;&aos;&chos;&fjeros;&nos;&eos;&mos;&bjeros; &pos;&oos;&vos;&jatos;&sos;&tos;&fjeros; &sos;&idecos;&juos;</rdg> </rdggrp> </app> </p></p>
The mechanisms for encoding of an early Cyrillic critical edition, and the development of an SDATA entity set and a WSD that document that edition, are not complicated,although they may prove somewhat cumbersome due to the absence of specialized tools. The preparation of the main SGML document, the SDATA entity set, and the WSD fulfills the first function of the WSD mentioned earlier; to document the transcription system in a way that will provide human readers with access to a structured description of this system. This type of encoding fulfills fully the mandate of the Text Encoding Initiative, in that it yields a text that has been encoded according to the TEI guidelines, but it does not provide a document that can be used in its raw SGML form by Slavists who are not also competent SGML engineers. In an attempt to render my SGML documents more accessible to Slavist colleagues, I undertook to process the WSD so as to provide different views of an orthographically-complex SGML document.
Two general strategies presented themselves immediately:
Each of these strategies was applied to three types of problems:
The input files, OmniMark scripts, and output files used in this project are available on the World Wide Web at http://clover.slavic.pitt.edu/~djb/sgml/tei10/.
The second part of this presentation deals with the use of the WSD to document and map a correct character encoding for non-Latin characters in complex multilingual texts. The problem to be addressed is this: the TEI's encoding defaults are ASCII characters, but in the case of such texts as the 11th century Irish poem Adelphus Adelpha Mater, non-ASCII (including non-Latin) characters are also needed, and, furthermore, these characters occur outside their usual domains of application. This section of the presentation will look specifically at the encoding problems exemplified by the presence in the Adelphus text of a Latin base combined with individual words in transliterated hellenized Hebrew and latinized Greek. The use of Greek letters to render Hebrew text and Latin letters to render Greek text is clearly different from the monolingual and monoalphabetic Slavic Cyrillic material discussed above, but the fundamental encoding problem is comparable: researchers need access to both character-level and glyph-level information, where neither level is entirely dependent on the other.
In the case of Adelphus Adelpha Mater, it was decided to hard-code the original Hebrew and Greek words into the poem via the markup. [Note: I am particularly grateful to Professor Lewis M. Barth, Hebrew Union College, for his help in identifying the Hebrew characters and for suggested corrections to the Hebrew words. I am also grateful to Ms. Sinead O'Sullivan, St. Anne's College, Oxford, for identifying the Greek characters.] This encoding was achieved by modifying the TEI DTD to include a specification for the attribute 'reg' on the element <frn>, which is used to identify words in a foreign language.
<L N="19"><FRN LANG="he" reg="&vavhb;&reshhb;&vethb;&gimelhb;">Gibro</FRN> <FRN LANG="el" reg="&pgr;&rgr;&xgr;&ogr;&ngr; &agr;&ggr;&agr;&thgr;&ogr;&ngr;"> praxon agathon</FRN> </L>
Here, character entities for Hebrew and Greek are contained in the 'reg' attributes attached to the <frn> element. The element <frn> uses the 'lang' attribute to identify the languages concerned, with the values of either "he", for Hebrew or "el" for Greek. These character entities are associated with a WSD in the TEIheader:
<profiledesc> <langusage> < language lang="he" wsd="foo">Some of the words are in Hebrew. </language> < language id="el" wsd="bar">Other words are in Greek.</language> </langusage> </profiledesc>
As was noted above, one of the primary problems encountered in implementing
the WSD is that the documentation in the TEI Guidelines is more than a
little confusing. No practical examples of an implemented WSD are given;
neither are various practical objectives, such as the need to enable
mapping between characters in coded character sets (ISOheb and ISOgrk in
this case) and glyphs in the fonts one wants to use, and to do so without
sacrificing document portability. The fact that most applications exist in
a pre-Unicode universe makes this task all the more problematic. On the
most fundamental level, the general practical problem with the use of WSDs
is that there is no software currently available that can handle them
automatically. The proper display of Hebrew and Greek character sets in a
character-mode SGML editor such as Emacs (psgml-mode) is elusive because of
the editor's limited facilities for character replacement. Some graphical
SGML editors do permit the specification of a particular font for the
display of particular elements and attributes, and this should, in theory,
enable Hebrew and Greek entities such as
&xgr; to be rendered
with correct Hebrew and Greek fonts. Unfortunately, not all editors
provide this facility of entity replacement from attributes. [Note: SoftQuad's
graphical SGML editor, Author/Editor, does not support this type of
relacement.] SGML browsers such as Panorama Pro permit entity substitution
in the before and after replications of attribute values, which means that
in these browsing environments attribute values may invoke a specific font
that produces the desired substitution, but in these cases it is the
character entity set declared in the main SGML document, rather than the
more powerful Writing System Declaration, that governs that substitution.
And alongside these screen-rendering problems, attempts to print an SGML
file containing Hebrew and Greek characters are frustrated by the absence
of print-rendering software capable of processing the WSDs.
As was noted above, the WSD, unlike the SDATA entity sets specified in the
DTD for the principal SGML document, provides not only for the
documentation of the UCS (Unicode) value of a character, but also for other
mappings. In the case of the Hebrew character whose symbolic
&alephhb;, it provides for a formal UCS code 05D0, and an afiicode
(in this case E140). The remainder of this paper describes the procedure
undertaken to enable storage of the text in TEI form, with automated
conversion (using Ominmark) to a form suitable for generating printed (TeX)
Omnimark coped successfully, and even elegantly, with all of the tasks it was set, but the batch approach undertaken here is ultimately capable only of generating multiple static views, without real support for dynamic inquiry. For example, there are situations where a Slavist may wish to conflate variants of character <foo> during searching, while maintaining a distinction between glyphic variants of character <bar>, and the potential number of such hybrid views is for all practical purposes unlimited. The strategies discussed here provide the user with access to character-based, glyph-based and mixed views of the input text, but they do not support access to ad hoc combinations of character-level and glyph-level information. The development of SGML tools capable of supporting dynamic WSD access will greatly enhance the utility of WSDs for scholars who work with orthographically complex writing systems.
Last Modified: Wednesday, 03-Nov-1999 14:08:29 EST
This page is hosted by the STG.