Text Encoding Initiative
Tenth Anniversary User Conference


TEI and the Encoding of the Physical Structure of Books

Syd Bauman
Women Writers Project
Brown University
syd_bauman@brown.edu

Terry Catapano
Library Studies
Rutgers University
thc@eden.rutgers.edu

The Text Encoding Initiative Guidelines "do not address the encoding of physical description of textual witnesses: the materials of the carrier, the medium of the inscribing implement,. . . the organisation of the carrier materials themselves (as quiring, collation, etc.), authorial instructions or scribal markup, etc." P3 section 18.4 It might therefore be assumed that one cannot use TEI P3 for such purposes. It is not, however, that such features cannot be encoded using TEI, just that specific guidelines are currently unavailable. We discuss why one might wish to encode this information, demonstrate two TEI-conformant methods for the encoding of the physical structure of a codex, and discuss possible advantages and disadvantages of each.

The arrangement of the text by bound page sequence allows a user to effectively reconstruct the experience of reading the text as it appears in a particular document. However, if one is interested in reconstructing the process by which the book was printed, the page order encoding offers little help. The re-arrangement of the pages of text as imposed for printing may make apparent places where the text was affected by typographical exigencies. It is also useful in electronic bibliographic analysis

The relationships between and among gatherings, sheets, formes, leaves, and pages can be somewhat confusing if not considered carefully, as each page has a number of relationships to encode. In a folio in fours, the book is made up of sheets folded once, then sewn into two-sheet gatherings. Each sheet has two adjacent pages of printed text on each side. See figures 1 and 2.

First Gathering of a Folio-in-Fours, Top View, and Figure 2: Page Arrangement on the First Sheet of a Folio-in-Fours

A gathering of a folio in fours has the following order and relationships:

	GATHERING A
	  LEAF A1       [on sheet A1, conjugate to LEAF A4]
	    PAGE A1r    [adjacent to PAGE A4v on outer forme of sheet A1]
	    PAGE A1v    [adjacent to PAGE A4r on inner forme of sheet A1]
	  LEAF A2       [on sheet A2, conjugate to LEAF A3]
	    PAGE A2r    [adjacent to PAGE A3v on outer forme of sheet A2]
	    PAGE A2v    [adjacent to PAGE A3r on inner forme of sheet A2]
	  LEAF A3       [on sheet A2, conjugate to LEAF A2]
	    PAGE A3r    [adjacent to PAGE A2v on inner forme of sheet A2]
	    PAGE A3v    [adjacent to PAGE A2r on outer forme of sheet A2]
	  LEAF A4       [on sheet A1, conjugate to LEAF A1]
	    PAGE A4r    [adjacent to PAGE A1v on inner forme of sheet A1]
	    PAGE A4v    [adjacent to PAGE A1r on outer forme of sheet A1]

The same pages ordered by sheet/forme would have the following order:

	GATHERING A
	  SHEET A1
	    OUTER FORME
	      PAGE A1r
	      PAGE A4v
	    INNER FORME
	      PAGE A1v
	      PAGE A4r
	  SHEET A2
	    OUTER FORME
	      PAGE A2r
	      PAGE A3v
	    INNER FORME
	      PAGE A2v
              PAGE A3r

For most audiences, the logical division of a work into acts, chapters, poems, etc., is the most important cognitive structure (although division into pages is usually the most important navigational tool). Thus, TEI texts usually allocate the basic <div> structure to these logical divisions using < div> (or < div0>-<div7>) elements. Given the importance of the physical structure to certain audiences (analytic bibliographers jump to mind), it makes sense to use it as the source for the <div> hierarchy of a TEI documentary transcription intended for use by these audiences. For the purposes of our examples, we will assume that the physical structure is the only structure being encoded by the <div> structure of the TEI file.

Because there is no one correct arrangement for pages as printed, but there is one correct order for pages as bound, it makes sense to retain the bound order of pages in a TEI encoding. The <div> hierarchy is thus used to nest pages as parts of leaves. The forme and sheet <div>s, however, do not directly nest their constituent parts. They rather rely on next and prev attributes to indicate their components.

Reading a page of SGML in which every tag is a div tag with a type attribute whose value is structurally important can be quite tiring (for humans). In this example, we have used <gathering>, <sheet>, <leaf>, <formeOuter>, <formeInner>, and <page> as "syntactic sugar" for <div>s of the corresponding types in order to make it more readable. All of the elements used (except for the <seg> elements, which are merely being used as placeholders to indicate where page contents go) are really stand-ins for <div>. The TEI file for one gathering of a folio in fours would have the following basic structure:

	<gathering id="G6A">
	  <sheet part="Y" id="S6A1" next="S6A4">
	    <leaf id="L6A1">
	      <formeOuter part="F" id="F6A1R" next="F6A4V">
	        <page id="P6A1R"><seg>page 1 data</seg></page>	        
	      </formeOuter>
	      <formeInner part="I" id="F6A1V" next="F6A4R">
	        <page id="P6A1V"><seg>page 2 data</seg></page>
	      </formeInner>
	    </leaf>
	  </sheet>
	  <sheet part="Y" id="S6A2" next="S6A3">
	    <leaf id="L6A2">
	      <formeOuter part="F" id="F6A2R" next="F6A3V">
	        <page id="P6A2R"><seg>page 3 data</seg></page>
	      </formeOuter>
	      <formeInner part="I" id="F6A2V" next="F6A3R">
	        <page id="P6A2V"><seg>page 4 data</seg></page>
	      </formeInner>
	    </leaf>
	  </sheet>
	  <sheet part="Y" id="S6A3" next="S6A2">
	    <leaf id="L6A3">
	      <formeInner part="F" id="F6A3R" prev="F6A2V">
	        <page id="P6A3R"><seg>page 5 data</seg></page>
	      </formeInner>
	      <formeOuter part="I" id="F6A3V" prev="F6A2R">	      
	        <page id="P6A3V"><seg>page 6 data</seg></page>	        
	      </formeOuter>
	    </leaf>
	  </sheet>
	  <sheet  part="Y" id="S6A4" prev="S6A1">
	    <leaf id="L6A4">
	      <formeInner part="F" id="F6A4R" prev="F6A1V">
	        <page id="P6A4R"><seg>page 7 data</seg></page>
	      </formeInner>
	      <formeOuter part="I" id="F6A4V" prev="F6A1R">
	        <page id="P6A4V"><seg>page 8 data</seg></page>
	      </formeOuter>
	    </leaf>
	  </sheet>
	</gathering>

In order to extract leaves or pages, a processor merely has to select the correct <div> (or syntactic variant). However, in order to extract a sheet or forme a processor must aggregate the appropriate partial elements into an aggregate element by chaining them using the id/idref mechanism made available via the next and prev attributes.

Although, in some sense, encoding the physical structure of the pages of a book as the <div> structure appears to be the most appropriate encoding, it is without doubt cumbersome for humans. Humans have trouble following all that deep nesting, and performing the "hand-pointing" needed to aggregate the various <div>s. Another possibility is to encode only the structure of the pages themselves using <div>, and then create virtual aggregations of the various other <div>s needed using <join>. For example:

	<div type="page" id="P6A1R"><seg>page 1 data</seg></div>
	<div type="page" id="P6A1V"><seg>page 2 data</seg></div>
	<div type="page" id="P6A2R"><seg>page 3 data</seg></div>
	<div type="page" id="P6A2V"><seg>page 4 data</seg></div>
	<div type="page" id="P6A3R"><seg>page 5 data</seg></div>
	<div type="page" id="P6A3V"><seg>page 6 data</seg></div>
	<div type="page" id="P6A4R"><seg>page 7 data</seg></div>
	<div type="page" id="P6A4V"><seg>page 8 data</seg></div>
	<!-- ... -->
	<joingrp targtype="div" targorder="y" type="leaf" result="div"
	 desc="each JOIN joins two pages into a leaf">
	  <join id="L6A1" targets="P6A1R P6A1V">
	  <join id="L6A2" targets="P6A2R P6A2V">
	  <join id="L6A3" targets="P6A3R P6A3V">
	  <join id="L6A4" targets="P6A4R P6A4V">
	</joingrp>
	<joingrp targtype="div" targorder="y" type="forme" result="div"
	 desc="each JOIN joins two pages into a forme">
	  <join id="O6A1" type="outer" targets="P6A4V P6A1R">
	  <join id="I6A1" type="inner" targets="P6A1V P6A4R">
	  <join id="O6A2" type="outer" targets="P6A3V P6A2R">
	  <join id="I6A2" type="inner" targets="P6A2V P6A3R">
	</joingrp>

If desired, formes or leaves can be <join>ed into sheets, and sheets can be <join>ed into gatherings.

Two confusing points in the Guidelines are worth pointing out. First, the Guidelines state that a <join> element needs to be in "a position where the element indicated by its result attribute would be contextually legal." P3, page 443 This is potentially problematic, as a <div> is not valid inside a <joinGrp>. However, since "a <joinGrp> may appear only where the elements represented by its contents are legal" page 445, we may conclude that the individual <join> elements are relieved of their "valid position per result" restriction by virtue of being in a <joinGrp> that is itself so restricted.

Second, it is not clear whether the targType of a <join> whose evaluate attribute has the value one or all should be the GI of the element to which the <join> points or that of the element which is the final result of the pointing process. However, it is clear that if targOrder is N, then the GI of the elements pointed to by targets may be either of the two specified by targType (or presumably any one of the multiple GIs specified on targType). See the top of page 401.

The authors are unable to express a strong preference for one encoding methodology over the other. The <div> method at first glance seems more intuitive and, in the simplest cases, easier to follow. However, as soon as multiple chaining processes are required, the encoding becomes difficult to follow and maintain. The "syntactic sugar" variant may allow the human reader to feel less overwhelmed on initial examination of the text, but more importantly would allow a capture DTD that could make creating the complicated structures a bit easier.

The <join> method, on the other hand, although not much easier to create (and perhaps harder to create than a "syntactic sugar" version), is dramatically easier to proofread - it is easier to examine all of the formes at one time, then proceed to the leaves, etc.

We are reluctant to admit it, but in the end we would be strongly inclined to use whichever method had stronger software support. As far as we know, no matter which of the TEI methods is used for aggregating partial elements, there currently exists no software that will proccess them in the order indicated by their attributes (rather than sequentially).

Our discussion deals with the relatively simple case of a folio-in-fours. In smaller book formats, the structure and relationships among the various bibliographical elements becomes more complicated. In quartos and octavos (4 and 8 pages to the forme), there are more than two pages imposed per forme, some imposed upside down; furthermore, some folds are cut to enable the turning of pages. Frequently the structure of a book is not found to be so consistent. One common complication is the presence of "cancels". However, we believe the scheme discussed above is extensible enough to serve as the basis for the encoding of smaller formats or cancels.

Glossary

Analytic Bibliography
The study of books as physical objects. A major concern of analytic bibliography is the examination of physical evidence to determine the extent to which the process of book production affects the form of a text as it appears in print. The results of analytic bibliographic study can often enable editors and readers to determine the authority and source of variant or anomalous readings.
Cancel
A leaf substituted for an original leaf, and printed on a different sheet than the others in its gathering.
Compositor
The print house worker assigned the task of setting type into type-pages from printed or manuscript copy.
Conjugate Leaves
Two leaves which are joined along the fold of a printed sheet.
Folio
The largest printed book formats. A folio book is made of printed sheets which have been folded once perpendicularly to the long edge.
Format
The size or structure of a book based on the number of times its constituent printed sheets have been folded.
Forme
The type-pages as imposed to print one side of a sheet. The outer forme refers to the convex side of a printed sheet. The inner forme is on the concave side. The forme is an important unit of bibliographical analysis because any single piece of type can only occur once in a forme. In a folio, each forme consists of two type-pages imposed adjacently.
Gathering
The group of leaves formed in the proper order for binding from a printed sheet; whatever is held into a book by a single act of sewing. In order to avoid thickening of the spine, folios often had gatherings made up of two or more sheets. A folio with two-sheet gatherings is said to be a "folio-in-fours", i.e., a folio made up with four-leaf gatherings. A folio made up of three-sheet gatherings is termed a "folio-in-sixes". The terms quire, or signature are often used for gathering.
Imposition
The arrangement of type-pages on a forme. The type-pages are arranged so that when the printed sheet is folded correctly, the pages appear in the proper reading sequence.
Leaf
The piece of paper produced by the folding of a printed sheet. Each leaf comprises two pages, one on the recto (obverse), and one on the verso (reverse) side.
Octavo
The book format made up of sheets folded three times to produce eight leaves per sheet.
Quarto
The book format made up of sheets folded twice to produce four leaves per sheet.
Recto
The obverse side of a leaf; the right hand page of an open book. Designated in bibliographical notation by a superscript lower-case r. The complement of verso.
Sheet
The piece of paper which receives the impression from the inked type-pages on a printing press. The sheet is the basic printing component of a book. Once printed the sheets that will make up a book are folded, gathered, sewn, and bound. In a folio, each sheet has two leaves and four pages. Each sheet is also made up by two formes, one on each side, each with two adjacent pages.
Signature
The mark left placed by the printer, usually on the bottom of the first page of a gathering to indicate the proper sequence in which to bind the printed sheets. Signatures commonly run from A-Z, omitting J and U, with letters repeated if the alphabet runs out, e.g., AA-ZZ, AAA-ZZZ, AAAA-ZZZZ, etc. . . . Signatures with repeated letters are often referred to in bibliographical notation by the letter used preceded by the number of times it occurs, e.g., AAA is noted as 3A. Gatherings are named by the signature assigned to them, and leaves by their place within a gathering. For example, leaf 3A4 refers to the fourth leaf of gathering AAA.
Type-Page
The block of type set by the compositor which will produce a page of printed text.
Verso
The reverse side of a leaf. The left hand page of an open book. Designated in bibliographical notation by a superscript lower-case v. The complement of recto.

Back to Technical Program