next up previous
Next: 3.4 The Annotation Process Up: 3 Annotation Previous: 3.2 Target Format

3.3 Input Formats

Within a particular electronic version of the Bible, we have observed that data formats are fairly consistent. And once low-level character set issues are dealt with -- some pertaining to non-Latin character sets, and some involving the transition from a PC to Unix platform -- the input formats seem to group according to a reasonably small set of dimensions. These include:

  1. Line breaks: whether or not verses are implicitly delimited by appearing one per line, or broken across lines and delimited in another fashion.
  2. Labels: whether book labels appear explicitly in a file or are implicit in the file name; whether verses are explicitly numbered and, if so, whether those labels also include chapter and verse (e.g. ``1'' vs. ``1:1'').
  3. Header information: whether files contain information regarding the edition, translation, etc.
  4. Formatting codes: whether the documents are essentially in plain-text format or contain embedded formatting.

An on-line Swahili version of the New Testament, for example, illustrates embedded formatting, with separate marking for chapters and verses (Matthew 2:1-2):

  \c 2
  \s Wageni kutoka mashariki
  \v 1 Yesu alizaliwa mjini Bethlehemu, mkoani Yudea, wakati Herode
  alipokuwa mfalme. Punde tu baada ya kuzaliwa kwake, wataalamu wa nyota
  kutoka mashariki walifika Yerusalemu,
  \v 2 wakauliza, <<Yuko wapi mtoto, Mfalme wa Wayahudi, aliyezaliwa?
  Tumeiona nyota yake ilipotokea mashariki, tukaja kumwabudu.>>

A French version illustrates plain text with one verse per line, as well as the name of the book being repeated with each chapter heading (Matthew 2:1-2):gif

Matthieu   2
 1. J\'esus \'etant n\'e \`a Bethl\'ehem en Jud\'ee, au temps du \
 roi H\'erode, voici des mages d'Orient arriv\`erent \`a            \
 2 et dirent: O\'u est le roi des Juifs qui vient de na\^itre? car   \
 nous avons vu son \'etoile en Orient, et nous sommes venus pour      \

The simple, uniform structure of the source text appears to greatly reduce the variation in document encoding for the on-line source documents. Minor variation within a version does occur, for example verse numbers sometimes being followed by a period and sometimes not, but these are easily handled. By organizing the annotated versions book by book, we eliminate potential problems in reordering -- for example, the book of Hebrews is the 58th book in the English Bible, and the 63rd in the German Bible, although the relative order of every other book is identical.

next up previous
Next: 3.4 The Annotation Process Up: 3 Annotation Previous: 3.2 Target Format

Philip Resnik
Tue Oct 21 19:23:13 EDT 1997