Text Encoding Initiative
Tenth Anniversary User Conference


Textual Variation and Version Control in the TEI

David Smith
Perseus Project
Tufts University
dasmith@perseus.tufts.edu

Version control and textual variation arise from two separate traditions of practice. Version control, also known as revision control, originated in the context of collaborative software development and document creation, but treatment of textual variation comes from the much older world of manuscript and edition collation. Version control systems, which broadly speaking keep a record of authors' changes during a document's revision, are optimized for speed and automatic processing, whereas the textual variants must deal with the multiple complexities of manuscript traditions, editorial emendations, and ambiguous meanings. Unlike the user of a software version control system, the textual critic need not determine, though she may speculate about, the primacy of one version either in time or importance. [1.1] The Text Encoding Initiative Guidelines for encoding critical apparatus (Chapter 19) draw heavily on the text collation tradition and provide useful tools for basic text variation at the word and character level, but they fail to address the need for encoding variation in text structures other than, or larger than, the words and punctuation of a document. With software version control systems, the problem is often reversed: multiple variants within one line are represented as if they were one. The principles behind the design of software version control systems, nevertheless, can inform our work with tagging textual variants, and lead to some solutions for tagging larger structural variation. These problems with version control and textual variation presented themselves in my work for the Perseus Project, and Perseus texts will illustrate the principal issues.

Determining the atomic unit of variation is a critical step in designing a variant system, and is to a large extent affected by the aims of the project. For the Perseus Project's variant work, whole words and single pieces of punctuation are atoms, so that words differing by one letter, or the capitalization of one letter, are deemed to be as different as words that share no common letters. Allowing words to be split, and encoding variations at the character level would perhaps capture some useful information, but consider that characters are atomic only in the world of electronic and printed texts. An encoding of writing strokes and their variations could prove useful for some purposes. In any event, a strict definition of the atomic units of variation allows for consistent interpretation of pre-electronic apparatus critici, which often combine several variants into one for compactness.

Having determined the atoms on which our variants will operate, we must define the operations. The TEI Guidelines provide structures for insertion, deletion, and replacement, which may be seen as deletion followed by insertion. Unlike the user of a software version control system, however, the textual critic need not determine, though she may speculate about, the primacy of one version either in time or importance, and replacement becomes a useful basic operation. But the Guidelines do not address one frequently occurring issue in textual criticism: the transposition of text from one context to another. Like replacement, transposition may be seen as deletion and then insertion, but the context of deletion is now different from the context of insertion. Since existing version systems have no storage facility to identify the deleted from the inserted text, the textual critic is left without useful information. Before the advent of computer encoding, classical scholars recognized that the individual parts of the texts they edited should be uniquely identifiable even if the lines had moved from their original positions and created counterintuitive line numbering. Modern editions of Euripides' Iphigeneia at Aulis, for example, have moved a scene of the play from near the beginning to the very beginning. These editions, therefore, begin with line 49 and proceed to line 114 before returning to line 1, but references a century and a half old, from before this textual transposition, point to the same text despite relocation. Previous version control systems rejected a transposition primitive as too expensive in an environment of continuous revision, [1.2] but such a situation does not obtain for the normal textual critic.

The fundamental insight of text markup, that a document's content ought to be separate from its structure and appearance, has proven to be quite powerful and flexible. The Standard Generalized Markup Language (SGML) places this further constraint on text structures: that each element in the structure be fully contained by its parents and fully contain its children. Such a markup scheme has problems with texts that exhibit multiple overlapping structures; neverthless, SGML is an attractive model for text markup since single-hierarchy SGML texts are much easier to process, and generally acceptable workarounds exist for most cases of overlap. Encoding variant versions of a document, however, presents problems not only with overlapping data hierarchies, but with overlapping structures. The SGML model's separation of content ("text") and structure ("tags") prohibits the use of tags to mark up other tags.

Consider two examples from the works of Christopher Marlowe that illustrate this problem of structural, as opposed to data, variation.[1.3] Proper names are normally italicized in Renaissance texts, but in 1 Tamberlaine 1.2.166, all four octavos read:

<L>In thee (thou valiant man of <name>Persea</name>)</L>

Modern editions have made this line consistent with other occurrences of "Persea":

<L>In thee (thou valiant man of <name 
rend="ital">Persea</name>)</L>

If we include a hypothetical edition with modernized spelling, we might encode the passage thus:

<L>In thee (thou valiant man of <app>
<lem><name rend="ital">Persea</name></lem>
<rdg wit="O1 O2 O3 O4"><name>Persea</name></rdg>
<rdg wit="Mod"><name>Persia</name></rdg>
</app>)</L>

It is unclear how to indicate the difference in typeface in the first two readings while showing that the text itself is identical.

The textual variants cross more than one structural boundary in the next example. In Dido 3.1.173, the majority of witnesses read:

<sp who="Dido"><speaker>Dido</speaker>
...
<L>I shall betray myselfe:—<name>Æneas</name> speake,</L>
<L>We two will goe a hunting...</L>

but McKerrow's edition reads:

<sp who="Dido"><speaker>Dido</speaker>
...
<L>I shall betray myselfe:—<name>Æneas</name>—</L></sp>
<sp who="Aene"><speaker>Æn.</speaker>
<L part="F">Speake!</L></sp>
<sp who="Dido"><speaker>Dido</speaker>
<L>We two will goe a hunting...</L>

One word has been broken off the end of Dido's line and assigned to Aeneas. Even more common are the phenomena of an entire speech's being assigned to another speaker, or scenes breaking at varying places in the text. It is obviously ungrammatical in SGML to combine these two variant versions thus:

<sp who="Dido"><speaker>Dido</speaker>
...
<L>I shall betray myselfe:—<name>Æneas</name><app>
  <lem> speake,</lem>
  <rdg wit="McK">
    —</sp>
    <sp who="Aene"><speaker>Æn.</speaker>
    <L part="F">Speake!</L></sp>
    <sp who="Dido"><speaker>Dido</speaker>
  </rdg>
</app>
<L>We two will goe a hunting...</L>

The SGML system simply has no way of knowing that the <app>, <lem>, and <rdg> tags are on a meta-level above and commenting on the <sp> and <L> tags.

SGML does not prohibit all meta-tagging, and one possible solution to the problem would be to use marked sections. The above example with the overlapping speaker assignments could be encoded with the TR paramater standing for the "textus receptus" and McK for McKerrow's edition.

<![ %TR; [
<sp who="Dido"><speaker>Dido</speaker>
]] >...<![ %TR; [
<L>I shall betray myselfe:—<name>Æneas</name>]] ><app>
  <lem><![ %TR; [speake,]] ></lem>
  <rdg wit="McK"><![ %McK; [
    —</sp>
    <sp who="Aene"><speaker>Æn.</speaker>
    <L part="F">Speake!</L></sp>
    <sp who="Dido"><speaker>Dido</speaker>
  ]] ></rdg>
</app><![ %TR [
<L>We two will goe a hunting...</L>
]] >

The SGML application would then set the parameter for the desired readings to INCLUDE and for the undesired readings to IGNORE. There are several problems with this use of marked sections. The parameter entities lack the elegance of the witness lists in the <lem> and <rdg> tags since they are not additive. If, for example, one piece of text existed in both the first and second octavos, and the user wanted to see the first octavo text, the application could not simply define the O1 parameter as INCLUDE; the O2 parameter's IGNORE would block it. The encoder would be reduced either to defining parameters for every combination of agreeing witnesses, or to tagging each witness's readings separately. More seriously, using marked sections would require two passes over the text‹once to select the appropriate set of readings and again to parse those readings. The application that dealt with an individual text would not necessarily have any information about the variant versions. In the example above, the application parsing the textus receptus would not know that there was a variant reading in McKerrow.

Instead of clumsily separating structure and data with SGML marked sections, we could use an existing version control system. We could efficiently encode differences from a base text and quickly retrieve the desired set of readings. The version control system would be able to produce a requested version on demand, and would have no problems dealing with the tags, since it would be at a level above the tags. But while version control systems are widely available for software projects, they are unsatisfactory for scholarly purposes for three reasons. First, and most basically, source code version systems take the line of text as the atomic unit of variation. The widely used and adapted Revision Control System (RCS) produces its diffs so that "if a single character in a line is changed, the edit scripts consider the entire line changed." The encoder of textual variants ought to be able to specify variation with arbitrary accuracy, not just the accuracy of one line text. Second, since most version systems operate in an environment of continuous revision, they are expected to serialize the changes to a document, resolve conflicting modifications without user intervention, and do all of this as quickly as possible. Speed, in fact, is the principal reason for RCS's use of the line as the atomic unit of variation. Third, and most obviously to an SGML user, source code version systems encode their information in customized formats, so one cannot, for example, use generic SGML query tools to extract data about different versions of a document.

The TEI Guidelines were not designed with any one software package or protocol in mind and unlike commercial version software, have not been optimized for speed of retrieval. The encoder using the Guidelines may record variation down to the level of individual characters; variant readings may be specified with arbitrary fineness, may overlap, may be prioritized and sub-categorized at the discretion of the encoder, and queried using SGML tools. In most texts, the percentage of variants that have to handle the the meta-tagging issue is small; the great preponderance of variants are of the Persea/Persia kind, not of the variant speech assignment kind. It seems best, therefore, not to introduce a separate version control meta-level above the SGML tagging level. Any version system must, nevertheless, be able to reproduce a complete single version as well as providing information about the differences between versions. To meet these requirements, I have extended the TEI with two new tags‹<tagStart> and <tagEnd>‹that belong to the fragmentary class in the textual critical subsystem of the TEI. Their declarations are:

<!ELEMENT tagStart  - O     EMPTY                          >
<!ATTLIST tagStart  gi       CDATA   #REQUIRED
                    atts     CDATA   #IMPLIED              >

<!ELEMENT tagEnd    - O      EMPTY                         >
<!ATTLIST tagEnd    gi       CDATA   #REQUIRED             >

These tags belong to the fragmentary element class and are thus allowed inside only <lem> and <rdg> elements. The <tagStart> element stands in for the actual start tag that is varying, and the <tagEnd> for the end tag. The tag being replaced should have its generic identifier in the gi attribute and its own attributes encoded in <tagStart>'s atts attribute. The example above from Dido would thus read:

<sp who="Dido"><speaker>Dido</speaker>
...
<L>I shall betray myselfe:—<name>Æneas</name><app>
  <lem> speake,</lem>
  <rdg wit="McK">
    —<tagEnd gi="L"><tagEnd gi="sp">
    <tagStart gi="sp" atts="who='Aene'">
    <tagStart gi="speaker">Æn.<tagEnd gi="speaker">
    <tagStart gi="L" atts="part='F'">Speake!<tagEnd gi="L"><tagEnd gi="sp">
    <tagStart gi="sp" atts="who='Dido'">
    <tagStart gi="speaker">Dido<tagEnd gi="speaker">
  </rdg>
</app>
<L>We two will goe a hunting...</L>

An individual version‹say, McKerrow's readings‹could be produced from this combined version by a pre-order traversal of the SGML parse tree that ignores data not in readings assigned to McKerrow, expands <tagStart> and <tagEnd> elements to the indicated generic identifiers, and echoes all other elements, CDATA, and SDATA.

One of the hardest problems in encoding variant texts is transposition. A passage may be identical in two witnesses, for example, except that two lines of poetry are reversed in one witness. The TEI Guidelines provide a convenient structure for copying elements with the copyOf attribute. The variant readings can then be encoded as copies (but in different places) of the original data. In Marlowe's Edward II 5.6.55, Chappell's edition reverses some words.

<L><app>
<lem><seg id=v1>For my sake</seg> <seg id=v2>sweete sonne</seg></lem>
<rdg><seg copyOf=v2></seg>, <seg copyOf=v1></seg></rdg>
</app> pittie <name>Mortimer</name>.</L>

The tagging indicates that the text has not changed, but merely moved.

The ease with which we can represent this sort of inter-variant communication makes SGML and the TEI Guidelines a good basis on which to build a textual variant system, which more closely meets the needs of the editors of variant literary texts than available version control systems. With some extensions, the TEI can be made to encode more sophisticated variant structures and to satisfy the requirements, though not the efficiency, of a full-fledged version control system.

Notes:

1.1. The textual critic does not need to know, when encoding the text, whether one reading is earlier than another, or more reliable. At a minimum, she knows that the readings are different.

1.2. David Durand's Palimpsest system has a copying primitive that can be used to represent data movement generally, but it is not a "leading brand" version control system. Even so, it is designed for text editing, rather than encoding historical variations.

1.3. The Perseus Project has prepared an electronic edition of Marlowe with variants; to view these texts online, see http://www.perseus.tufts.edu/Texts/Marlowe.html.

References

David G. Durand, "Palimpsest, a Data Model for Revision Control". Proceedings of the CSCW '94 Workshop on Collaborative Hypermedia Systems, Chapel Hill, North Carolina, USA. GMD Studien Nr. 239. Gesellschaft für Mathematik und Datenverarbeitung MBH 1994.

Available via anonymous FTP at ftp.darmstadt.gmd.de within the file /pub/wibas/CSCW94/workshop-proc.ps.Z

C. M. Sperberg-McQueen Lou Burnard Guidelines for Electronic Text Encoding and Interchange Text Encoding Initiative Chicago, Oxford April 8, 1994

Walter F. Tichy RCS‹A System for Version Control Department of Computer Sciences, Purdue University West Lafayette, Indiana 47907 1995/06/01


Back to Technical Program