TEI Encoding and Syntactic Tagging of an Old French text


Dominique Estival & Nick Nicholas

Department of Linguistics & Applied Linguistics
The University of Melbourne
Parkville, Victoria 3052


October 1997.

Table of Contents

1. Introduction

This paper reports on one of the concrete outcomes of a current research project undertaken at the University of Melbourne, on the Computational Modelling of Syntactic Change, whose main objective is to study diachronic syntax and ultimately model a flexible human processor capable of dealing with language change. This goal is still far away; in the short term we are collecting historical texts, and are tagging them syntactically. We have so far produced a TEI-conformant version of an Old French text, La Vie de Saint Louis, and we are in the process of adding syntactic tags to this text. Those syntactic tags are derived from the Penn-Helsinki coding scheme and have been translated into TEI.

Thus this paper addresses two issues: the development of a TEI encoding for the text, and the adaptation of the Penn-Helsinki syntactic coding scheme.

2. Brief description of the project

One of the fundamental goals of the research done in the dominant syntactic framework of Principles & Parameters (P&P) paradigm (or Minimalist theory, Chomsky 1993) remains accounting for and explaining the "logical problem of language acquisition" (Hornstein & Lightfoot 1981), in terms of parameters of Universal Grammar which a child sets while learning a language. A number of attempts have been made at implementing Principle-Based parsers (e.g. Abney 1989, Stabler 1992), and researchers can thus investigate whether these implementations shed any light on how language is acquired by children during First Language Acquisition.

Our goal parallels this, but takes as a starting point another fundamental observation about language: all languages change. There has been increasing research in the past two decades in diachronic syntax and in the study of syntactic change (e.g. Li 1977, Lightfoot 1979). While several researchers have been studying the consequences of particular aspects of syntactic theories for a theory of syntactic change (e.g. Harris & Campbell 1995, Lightfoot 1991), no attempt has yet been made to map out the implications of those theoretical choices for the processing aspects of syntactic change.

More specifically: even if it is clear that, in a P&P approach, syntactic change is predicted to be a possible (but not necessary) consequence of a switch in the setting of a parameter, it is nevertheless not yet evident:

The goal of our research project is to elucidate these questions, and the main hypothesis underlying this research is that the human parsing mechanism (the HSPM of Crocker 1996) must be robust enough to allow processing of diverging grammars while still being constraining enough to reject unacceptable input.

As part of this project, we have obtained from Tony Kroch at the University of Pennsylvania access to the syntactically tagged Penn-Helsinki Parsed Corpus of Middle English (http://www.ling.upenn.edu/mideng/). This corpus uses the syntactic tagging system developed at the University of Pennsylvania, applied to the Middle English portion of the Helsinki Corpus of Historical English (Kytö & Rissanen 1988). We have also begun a collaboration with Barbara Vance at the University of Indiana, who works on Old French syntax, and are currently encoding and tagging a text obtained from her, having adapted the Penn-Helsinki coding schema for that purpose.

3. SGML encoding of Joinville

3.1. Description of the text

The text we obtained from Vance is Jehan de Joinville's La Vie de Saint Louis, a biography of King Louis IX of France, written around 1305. This text is written in a mixture of standard (Île-de-France) and Champenois French, and is extremely important for historical and literary, as well as for linguistic reasons. [1]

Historically, Joinville's text is the major source of information on Louis IX, one of the preeminent rulers of medieval Western Europe. From a literary point of view, not only is this one of the first major prose texts in French (prose had started being used only a century earlier, in such works as Geoffroy de Villehardouin's La Conqueste de Constantinople), but it also marks the first incidence in French of literature written from a personal vantage point. It is an at times chatty and anecdotal biography, written by a close personal friend of Louis IX.

Linguistically, Joinville's text is situated at the cross-over point between Old and Middle French; it is conventionally regarded in French linguistic histories (e.g. Pope (1952 [1934]) as the last of the Old French texts. As a result, it is transitional in many respects. For instance, the Old French declension system, which separates Old French from Middle French, is already severely curtailed in Joinville. Because this is a text written in a period of morphological and syntactic transition, we hope that Joinville's text will afford us insights about linguistic change and the flexibility of a human parser in the face of such change.

The Joinville text was supplied to us by Vance in Macintosh Microsoft Word 5 format. It had been scanned and proofread under her supervision. The version of the text used was Corbett (1977), the latest critical edition of the work. [2] In our tagging of the text, we ensured that it was tagged as Corbett's edition stands: the typographical conventions of Corbett's text, including his use of italics and paragraph and line breaks, are clearly noted as such, and distinguished from the information given about the original text, such as the page breaks in manuscript A of the Vie (see section 3.2).

The text of La Vie de Saint Louis survives in three manuscripts: A (Brussels manuscript), B (Reims manuscript), and L (Lucques manuscript). Manuscript A has long been regarded as the most authoritative source of the work, and Corbett concurs in this judgement. Furthermore, as Corbett wished to minimise the amount of guesswork involved in his critical edition, his edition stays fairly close to A; when he emends the text in A with reference to the BL text tradition, the emendations are given in square brackets. This distinction between A and BL readings has been retained in our formatted text, with use of the <add> and <corr> tags. [3] The reading in manuscript A which Corbett rejects in emendation is given through the <sic> attribute.

Corbett's critical apparatus appears as endnotes, rather than footnotes. The text of the endnotes was not included in the document we received from Vance, but space has been allocated for these endnotes in our document, through the <note> element. Since we also have access to Corbett's edition in printed form, and since the syntactic tagging is proceeding on the basis of the critical edition rather than the variant manuscript readings, we have not attempted to incorporate the critical apparatus into the text at this stage; however, this can readily be done at a later date.

3.2. TEI encoding

The TEI tagging of the text began in October 1996 by converting the Microsoft Word text into HTML, exploiting the commonalities between HTML and SGML. Work proceeded in some fifteen distinct stages, requiring occasional small computer programs, written in C, to automate formatting changes. The major stages involved included the following:

3.3. Differences with Helsinki encoding

One of our goals is to build a corpus of historical texts, encoded and tagged in similar ways to facilitate comparison between them, and available to historians and literary analysts as well as to linguists and philologists. We thus decided not to adopt the text mark-up system employed for the Helsinki corpus, because it does not adhere to a currently commonly accepted scheme, and because it is quite minimalist in the type of formatting it allows. The goal of the creators of the Helsinki corpus (which predates TEI and SGML, having been completed substantially by the late '80s, but starting in the early part of the decade) was exclusively linguistic, so that much of the formatting information which other scholars might find of interest was simply not relevant to their aims.

We illustrate this point by presenting a comparison of the way the Joinville text would have been formatted in the Helsinki corpus scheme, the way it would appear in the Penn-Helsinki scheme and the way we have tagged it in TEI. [4]

Because of the flexibility and the standardisation of TEI Lite, we decided to employ TEI as our tagging system, thereby guaranteeing for other researchers accessibility and easy processing of the text to be produced. The Joinville text has now been fully tagged for typographical formatting in TEI Lite.

This work, including proofreading of the entire text for such matters as hyphenation and italics, continued from mid-October 1996 until the start of February 1997. The text is currently 750 K in size (from an original HTML file of 490 K). Its SGML was grammar-checked using the Macintosh implementation of the TCL YASP parser.

Work now underway involves tagging the text for syntactic information. Although this syntactic tagging is taking place within the framework of TEI (rather than PH) text-markup, we are using a version of the PH syntactic tags, which we have modified for Old French.

Since in the rest of the project we will employ the software developed at the University of Pennsylvania for use with the PH-tagged corpus, we are also in the process of writing a translator between our TEI-based and the PH syntactic tagging schemes.

4. Syntactic coding

As mentioned earlier, our research programme involves the computational analysis of texts as facilitated by the addition of syntactic tags. Syntactic tagging is still a young discipline, particularly in diachronic linguistics, and the major venture of this kind to date is the Penn-Helsinki Corpus. While this corpus has established a standard in tagged diachronic corpora in linguistics, there are several problems with the scheme as it stands. In particular, the PH tags were devised specifically for Old and Middle English. Although the syntactic ontology they delimit is not in itself anglo-centric, several of the distinctions nonetheless need to be modified before the coding scheme can be applied to Late Old French.

We first give a brief description of the PH tagging scheme and show how the tags have been translated into SGML in section 4.1., and then present the new tagset developed for Old French in section 4.2.

4.1. Translation of Penn-Helsinki (PH) tags into SGML

The PH tagset conveys six kinds of information:

The following provides a brief example of a text tagged in the PH tagset; it is an excerpt from the Middle English text The Ancrene Riwle:[5]

( [+ For ] [s uh <P I.44> an ] [at schal ] [v halde ] [d +tuttere ] [p
efter 1[B [c +tet ] [s heo ] [b best ] [at mei ] [p wi+d hire ] [v
seruin ] [d +teo inre ] . 
{For each one shall hold the outer
according that she best may serve with it the inner} )

We decided to encode the entire Joinville document, including its syntactic tagging, in TEI, and then to translate the TEI text to the PH tagset, in order to manipulate the text using the software designed at the University of Pennsylvania. In this manner, the formatting information which is not included in the PH scheme does not become lost.

The issue then arises of how to render syntactic tags in TEI. TEI has three tags which could be used to that end: <s>, <seg>, and <interp>. Of these, <s> is used to delimit orthographic sentences only, and does not allow for embedding of other <s> units within it; the only attribute it allows is the number of the current sentence. So <s> is obviously not appropriate as the general solution to the syntactic tagging problem. It is appropriate for sentence numbering, which is how PH requires the text to be organised; however, since the orthographic sentence is itself considered a syntactic unit in PH, it would be better to handle all constituents uniformly by means of the same tag. [6]

Both of the two remaining tags (<seg> and <interp>) allow embedding and can take on arbitrary type attribute values. Of the two, <seg> additionally requires that its constituents be properly nested (no branch-crossing), and that its type attribute have a single coded value. The <interp> tag does not have these restrictions: it allows a hierarchy of interpretation types to be set up in one part of the document, and referred to anaphorically from another. This is why <interp> is explicitly suggested in Burnard & Sperberg-McQueen (1995) as a means of encoding syntactic information. However, the actual encoding of constituents becomes unnecessarily unwieldy: since <interp> itself is an empty element, constituents would still have to be delimited using something akin to <seg>, with the <ana> attribute connecting the two. Furthermore, the strict nesting required by <seg> is not only consistent with standard syntactic constituency, but also useful in error-checking the tagging.

We have thus elected to use the <seg> element to encode syntactic tagging. The information on the meaning of the various tags is not encoded anaphorically within the document, although this can easily be done if it proves necessary. The ontology of information encoded in the PH tagset (see next section) is simple enough that it can be rendered as attributes of the <seg> element, as follows:

We see that the ontology of syntactic tags required is just simple enough to be encompassed within the attributes of the <seg> element. Any additional complexity would necessitate the more complex hierarchies encodable by the <interp> element, although the <seg> elements would still be required around the constituents to index the <interp> definitions.

4.2. Adaptation of the PH coding scheme to Old French

Our guiding principle throughout the modification of the PH tagset has been not to modify the meanings of the existing tags, thereby ensuring that future comparisons between different texts which have been tagged remain meaningful. Thus, we have limited ourselves to making additions to the existing tagset where necessary. Some of these additions are motivated by differences between Middle English and Old French syntax. Other additions concern distinctions possible in Middle English but which were not considered important enough to record in the PH corpus, and which must be made in Old French (e.g. the use of different prepositions with infinitive complements, or the coindexing of discontinuous constituents).

We list below a few examples of the tags which had to be created for Old French.

5. Tools for tagging

The entire document of the Joinville text, including its syntactic tagging, is encoded in TEI, and then translated to the PH tagset, in order to manipulate the text using the software designed at the University of Pennsylvania.

Both the typographical and syntactic tagging of the text have been undertaken entirely on microcomputers. For formatting, TEI tags were either automatically entered through translation of the Microsoft Word original document to HTML, through small programs automating further tag translation, or entered manually. This task was completed relatively quickly.

Syntactic tagging, on the other hand, is a long and tedious process, which cannot be automated. So it proves desirable to make the task easier through a graphical visual interface. Such an interface usually involves highlighting text sections to be tagged, and specifying the appropriate tag either by pressing a button on screen or by keying in a short command. Owing to the limited availability to date of SGML editors on microcomputers, the syntactic coding in this project has been carried out on HTML editors we have customised for this task, namely PageSpinner on the Macintosh, and HTMLedPro on PC.

Similar functionalities had to be provided for both platforms. For example, in PageSpinner, a phrase is tagged as a direct object by highlighting it with the mouse and pressing control-D, or by selecting direct object from the Custom Tags menu. In HTMLedPro, the equivalent task is performed either by pressing alt-D, by selecting direct object from the Custom Tags menu, or by pressing the button on screen marked d.

A challenge in this project has been finding a microcomputer HTML editor allowing a sufficiently large number of custom tags, in order to deal with the Penn-Helsinki tagset. The two editors we chose are, to our knowledge, the only non-commercial instances for their respective platforms allowing such customisation to our satisfaction. Other editors either do not allow any customisation of the tagset, or if they do, severely restrict the number of custom tags (for example, to six or ten, whereas the number required for our task is closer to forty), or do not allow the custom tags to bear complex structure (for instance, attributes).

6. Conclusion

6.1. What can be learned?

Converting the Microsoft Word document we received into TEI was relatively straightforward conceptually; the prevalence of HTML formatting tools, and RTF-to-HTML converters in particular, was of great help. Similarly, preparing the software needed to automate significant portions of formatting, such as the numbering of sections, sentences, and pages, did not prove a major challenge. However, the process was structurally complex, consisting of many discrete stages.

In order to modularise the task, and to make sure that particular editorial choices could be undone, it was necessary to keep these stages separate, and to produce and save a new document version for each major formatting task. This approach proved crucial when, for example, the decision to remove all line-final hyphens and manually reinsert morphological hyphens, was rescinded in favour of manually removing all line-final hyphens. By that stage formatting had proceeded along several other stages; but because an earlier version of the document with hyphenation in place was intact, it was possible to write software to automatically reinsert line-final hyphens as appropriate.

An uncharacteristic aspect of our project is its reliance to date on microcomputers (although a workstation has more recently been obtained for use in the project). Clearly this kind of formatting task needs to be carried out using locally available computational resources and, particularly in humanities departments, workstation-level computers are not always readily available; even if they are, they are not necessarily accessible by the research assistants who are most likely to be doing the tagging. The wide availability and increasing computational power of microcomputers makes them well-suited to the task, but since SGML has been primarily associated with UNIX-driven machines, several issues arise.

First, although SGML parsers are available for both Macintosh and PC microcomputers, the choice of software available is not as wide as for workstations. This paucity is even more acutely felt for SGML browsers.

Second, the computational power of microcomputers is still limited compared to workstations ­ particularly the microcomputers typically available in a humanities department. In parsing the formatting-tagged Joinville text, for example, the resources of the computer we were using were severely taxed, even after the capacity variables in the SGML prologue were increased to the maximum allowed by the parser (the maximum TOTALCAP value allowed was a million, although the value the parser actually reported for the text was closer to one and a half million). This was in fact one of the deciding factors in some of our formatting choices. For instance, explicit coindexing within SGML using <interp> and <ana> alongside <seg> in syntactic tagging, instead of the attributes of <seg> alone, would have caused a massive increase in the number of variables the parser would have had to deal with, with no perceptible gain towards the aims of our project. As it stands, parsing the 750 K SGML document on a Macintosh LC475 computer running TCL YASP takes about four minutes. A microcomputer is thus clearly inadequate for the type of syntactic work envisaged, and we intend to move up to a workstation for that phase of research.

The relative lack of SGML resources for microcomputers is even more acutely felt in syntactic tagging. The formatting-tagging part of the project is based on regular expression substitution, and can make use of small programs modifying the HTML source. By contrast, syntactic tagging cannot be automated, and necessitates a proper SGML editor. Budget constraints meant it was impractical to purchase commercial microcomputer SGML software; so we have customised HTML editors on both Macintosh and PC. The fact that only one editor on each platform could cope with our customisation demands indicates the problems researchers can run into when they cannot work within the commonly assumed world of workstation-based SGML.

6.2. Foreseen Results and Future tasks

In the long run, we hope that this project will lead to a better understanding of language variation and language change, and ultimately to the integration of language variability in the modelling of Natural Language Processing systems. This should then result in systems whose behavior is closer to that of human beings in real language situations than the current rigid systems with no, or limited, ability to deal with variable input. Systems with built-in variability are more likely to be successful as components of intelligent computer applications which must interact with human users, or in software which must be trained to treat different types of texts.

A more immediate result is what we have already produced: a tagged electronic text which can be used by other researchers studying language change. Researchers in language change, particularly in diachronic syntax, need more electronic texts and, preferably, syntactically tagged texts.

We intend to extend our study further to texts from the Penn-Helsinki Middle English collection, and continue the collaboration with Vance. We also hope that other researchers who have been informally pooling and exchanging texts will, like Vance and Kroch, give us access to their texts with the prospect of getting in return a TEI-encoded and syntactically tagged version of those texts. This would lead to a more general extension of the PH tagset to other types of texts and to other languages.

We conclude by noting that, in addition to encoded and tagged texts, researchers in diachronic syntax need tools to be available so they can actually use those texts effectively.


We wish to thank the University of Melbourne for the two grants (a Special Interest Grant and a Small Australian Research Council Grant) awarded to the first author, which have allowed us to undertake this project; Tony Kroch and Barbara Vance for giving us access to their data; and Katia Margolis for her help with the syntactic coding of Old French.



[1] La Vie de Saint Louis is available in English translation as Shaw (1963).
[return to text]

[2] However, the best-known editions of the work remain those by Wailly in 1867, 1868, and 1874.
[return to text]

[3] The <resp> attribute was given as CORB, indexing the editorial decision as made by Corbett.
[return to text]

[4] Unlike the Penn-Helsinki (PH) tagging system, the Helsinki tagset does not include sentence numbering. On the other hand, the Helsinki tagset retains formatting information which is discarded in PH.
[return to text]

[5] The digraphs +t and +d denote thorn and eth.
[return to text]

[6] However, we can readily reinstate the <s>, as orthographic sentences are clearly distinguished from other constituents by attribute.
[return to text]

[7] Cf. in Latin Accusativum cum infinitivo construction: dixit canem carnem edere (he said the dog eats meat) with both canem "(dog)" and carnem "(meat)" in the accusative case.
[return to text]

10 Jan 1998