Text Encoding Initiative
Tenth Anniversary User Conference

TEI Extensions for Legal Text1

Nicholas D. Finke2

This paper outlines an extension of the Text Encoding Initiative encoding scheme for use by those who wish to study, interpret or otherwise use legal text. This last series of verbs is deliberately vague. It brings under one roof the academic study of legal texts, by lawyers, linguists, historians and others, as well as the use of text by lawyers in their daily work of determining the law for clients or courts. The purpose of these extensions is to provide these researchers with tools that will enable the understanding and use of a text in specifically legal ways by the encoding of those textual features that may be of particular interest in legal study. Although it deals with aspects of text that differentiate it precisely as legal text, this paper takes as a working assumption that the basic reading of text by lawyers is sufficiently like at least some of the reading practiced by humanities scholars to make the extension for lawyers of a humanities encoding system worthwhile.

What is Legal Text?

Before we consider the encoding of legal text, we must ask what "legal text" is. The term is easily applied to what lawyers call "primary sources of law", i.e., codes, statutes. regulations or court cases, the work products of many different types of sovereigns, either presently in force or of historical value. Again, looked at from a different perspective, the term "legal text" refers to instruments created to determine rights in private transactions, such as deeds or wills. "Legal text" can also refer to a scholarly writing on some law-related subject, such as an article in a law journal or a treatise or the Restatements of the American Law Institute. This paper will use the term "legal text" to mean any text which a legal professional could use in the course of her work, i.e., in the development of a rule of law for some particular situation.(Vining, 1990) This use of "legal text" includes the lawyer's "primary sources" as well as the various instruments which are created in the context of these primary sources to determine legal rights. One caveat here: the author's legal experience has been primarily in common-law jurisdictions. Consideration has been given to the fact that there are types of legal regimes that differ greatly from the one(s) with which the author is more familiar. It is planned that the extension of the TEI scheme, as generally proposed, will be able to accommodate the particular features of any legal regime.

Other Legal SGML Projects

There are a number of different projects that have encoded one or another type of legal text in some dialect of SGML. The largest projects have been undertaken by legal publishers who use SGML to enable the publication of legal information in traditional hardcopy systems. Although most such projects are not publicized, an exception is the use of SGML by Thomson Legal Publishing.(Kinney. 1996.) The European Union has developed DTDs to enable synoptic publishing of its Official Journal as well as to capture versions of statutes as they are developed over time.(Doggen. 1996, De Mets. 1996) The Centre de Recherche en Droit Public at the University of Montreal has developed DTDs to encode, among other things, the opinions of the Supreme Court of Canada.(Poulin. 1996.) These projects have in common the desire to publish primary legal sources either electronically or on paper. The DTDs have been built to facilitate publishing of this information in traditional ways and so the elements in their DTDs usually aim to provide the items of information found in a traditional printed legal text .

The Corpus Legis project, a joint project of the Law and Informatics Research Institute (IRI) at the Faculty of Law and Department of Computational Linguistics at Stockholm University, has a more academic bent. The aim of the project is "to create a permanent, computerised legal text resource ... for legal-linguistic studies."(Haider, et al. 1996.) The DTD constructed by the Corpus Legis project includes "structure elements" such as "Preamble" and "Main Provisions" as well as "legal elements" such as "Legal Definitions". These "legal elements" are included in the DTD to enable the project to deal with the way that texts from different jurisdictions handle similar legal subject matters. The Corpus Legis DTD provides for explicit representation of legally necessary features such as amendment of statutes over time. It does not, however, go far enough.

TEI Extensions for Law

The TEI extensions outlined in this paper are aimed at capturing information contained in legal text. This is to be done in three ways:

  1. by the addition of several global attributes that carry basic information necessary to allow a text to be considered in its legal aspect,

  2. by the creation of a new <lawStmt> in the <teiHeader> to carry several new elements that contain various types of legal metadata,

  3. by providing for new legal-specific elements that reflect the specific structures found in legal text.

Global Attributes

There are several basic pieces of data that should be able to be carried as attributes on every element in a legal text. These are:

  1. juris - the jurisdiction in connection with which the text carried by this element may be used as the source of a legal rule, e.g. the State of Maine, the Isle of Man, the Republic of Zaire, the Archdiocese of Milan;
  2. auth - the authority that issued the text e.g., the Congress of the United States, King Louis the Eleventh of France, the Water Board of the City of Tuscaloosa, Alabama;
  3. effdate - the date as of which the text became useable as the source of a legal rule; and
  4. repdate - the date as of which the text ceased to be useable as the source of a legal rule.
Knowing these four pieces of data enables a legal professional or other researcher looking at a text to determine the circumstances under which this text or portion thereof may be or might have been used in the construction of a rule.

The values for the juris attributes should normally be taken from standard lists. Thus, in modern texts they should be the standard abbreviations (if any) used to identify various legal jurisdictions, e.g., for United States jurisdictions this would be the abbreviations given in the Guide to Uniform Citation.. Similarly, the lawmaker(s) identified in the auth attribute should be referred to in a standard way. The <lawStmt> proposed to be added to the <teiHeader> will contain a place for the definition of the values used for these attributes in a particular text or series of texts or for the reference to a standard definition system.

The values of the effdate and repdate attributes should be in the usual form prescribed for dates by the Guidelines for Electronic Text Encoding and Interchange..(Sperberg-McQueen and Burnard. 1994.) In many, if not most, cases there will, of course, be no repdate value because the text has never been repealed. Depending on the type of text that is being encoded, there may well be a need for many more kinds of date attributes than the two given here. These two are only the most basic and necessary forms.

In most cases the values of these global attributes are not themselves legal conclusions, that is, they can be straightforwardly determined from inspection of the text rather than being the content of a legal opinion. In the event that the specification of the value of one or more of these global attributes is a legal conclusion, this fact can be recorded in the <lawStmt> proposed below. When appropriate or convenient, the id/idref mechanism presently used to specify values such as those of the lang attribute and to link each value of a lang attribute to its appropriate Writing System Declaration will be employed to assure standardization of the values of these elements and to link them to larger reference schemes.

The <lawStmt>

If one attempts to encode a document such as a published opinion from an appellate court in the United States using the standard TEI prose tagset, the first page of the document quickly becomes littered with elements to carry various facts set forth at the start of the opinion such as: the date of the argument of the appeal, the venue of the original proceeding, the way in which the case was brought to the higher court, the date of the decision, the judges who formed the court that heard the appeal, the attorneys who argued the case for each side, the points of law that the court felt it was addressing, and so on. These points of information may or may not be of any interest to a reader of the opinion. Lawyers quickly learn to pass over them without a glance. Statutes and regulations have similar clouds of metadata that buzz like gnats around the real core of the text. Because this data is usually felt by users of the text to be descriptive of the circumstances of the text rather than a part of it, this paper proposes that a new element be added to the <teiHeader> to carry this sort of data about a legal text. The particular elements contained in a <lawStmt> for a piece of text will be determined primarily by the type of text involved and they will usually be already present in the standard ways that legal text is presented

In many cases the items of metadata presented in the <lawStmt> may not be considered to be facts but, instead, to be legal conclusions with which another reader might differ. If this is the case, the <lawStmt> contains elements that allow this fact to be recorded.

Legal-specific Structures

One of the features that is very characteristic of much legal text is its highly organized structure. Statutes and regulations are very frequently easily represented structurally as sets of nested ordered lists in which Parts are divided into Sections and the Sections into Paragraphs and the Paragraphs into Subparts and so on (it sometimes seems) forever. In many cases there are also standard structures into which a legal document must be fit in a somewhat Procrustean fashion, if necessary. Examples of this are the formats for legislative acts used in many code jurisdictions which require a preamble, recitals of pre-existing circumstances and then an enacting clause or clauses. The third part of the proposed TEI extensions for legal text will be the provision for standard sets of elements that make up these structures. In many cases the same result can be achieved by nesting <div> or <list> elements and providing them with appropriate type values. Lawyers being the tradition-bound creatures that they are, it proves much easier to provide for these structures to be specifically represented by purpose-built elements.

Another and much more important feature of this set of elements is that it allows the representation of the variation of a text over time. In some jurisdictions statutes and regulations are amended by striking out words or phrases or sections and/or adding new words, etc. This paper proposes that these changes be reflected in the text markup in a way that allows one to see the change(s) that have occurred in the text over time. This feature not only has practical value in helpinga lawyer to determine the appropriate text for a particular point in time but can also be a rich source of data for analysis.

Looking Forward

This paper is concerned with the description of a rather basic set of extensions that will allow better representation of legal text while using the TEI encoding scheme. The completion of this set of extensions and the use of the TEI encoding scheme as so modified to mark up a significant quantity of legal text will be an endeavor that will prove both interesting and useful. In addition to this effort, however, there is much work being currently done in the study of the possible uses of artificial intelligence techniques in a legal setting.(Brouwer, 1994, Gardner. 1987.) In particular, sophisticated knowledge representation techniques are the subject of much current work that seems to hold much promise.(Bankowski, et al. 1995, Valente and Breuker. 1994.) As these attempts at legal knowledge representation become more accurate, the construction of an appropriate feature system may well prove a concrete way to link the representation of legal reasoning to the text that contains it.


Bankowski, Zenon, Ian White, and Ulrike Hahn, eds. Informatics and the Foundations of Legal Reasoning. Edited by Alan Mabe and Aulis Aarnio. Vol. 21, Law and Philosophy Library. Dordrecht: Kluwer Academic Publishers, 1995.

Brouwer, P.W. "Legal Knowledge Representation in the Perspective of Legal Theory." In Legal Knowledge Based Systems: Jurix '94, edited by R.G.F. Winkels, H. Prakken, A.J. Muntjewerff and A. Soeteman, 9-18. Lelystad: Koninklijke Vermande BV, 1994.

De Mets, Guido. "Consleg Interleaf: SGML Applied in Legislation." Paper presented at the SGML '96: Celebrating a Decade of SGML, Boston 1996.

Doggen, Jack. "FORMEX V3: Tagging the Laws: SGML Used for Complex Multilingual Documents." Paper presented at the SGML '96: Celebrating a Decade of SGML, Boston 1996.

Gardner, Anne von der Lieth. An Artificial Intelligence Approach to Legal Reasoning. Edited by L. Thorne McCarty and Edwina L. Rissland, Artificial Intelligence and Legal Reasoning. Cambridge, Mass.: The MIT Press, 1987.

Haider, Georg, Cecilia Magnusson Sjöberg, Gerald Quirchmayer, and Verena Sebald. "The Comparative Part of the Corpus Legis Project - Using SGML for Intelligent Information Retrieval of Legal Documents." In EXPERSYS-96, Artificial Intelligence Applications, edited by J. Zarka, E. Mercier-Laurent, D.L. Crabtree and M. Narasipuram, 181-186. Oxford: Pergamon Press, 1996.

Kinney, Diane. "Reengineering SGML Implementation: Second-Generation SGML Systems." Paper presented at the SGML '96 Conference: Celebrating a Decade of SGML, Boston 1996.

Poulin, Daniel. "Le SGML et son intérêt pour la gestion des documents juridiques." Cybernews II, no. III (1996).

Sperberg-McQueen, C. M., and Lou Burnard. Guidelines for Electronic Text Encoding and Interchange. 2 vols. Chicago: Text Encoding Initiative, 1994.

Valente, André, and Joost Breuker. "Ontologies: the Missing Link Between Legal Theory and AI and the Law." In Legal Knowledge Based Systems: Jurix '94, edited by R.G.F. (adm. ed.) Winkels, H. Prakken, A.J. Muntjewerff and A. Soeteman, 139-149. Lelystad: Koninklijke Vermande BV, 1994.

Vining, Joseph. "Generalization in Interpretive Theory." Representations Vol 30 (Spring, 1990) p.1


1 (c) Copyright 1997 Nicholas D. Finke This paper is an abstract of a paper to be delivered at the Text Encoding Initiative Tenth Anniversary User Conference in November, 1997. Please do not copy or further distribute this paper without the author's express permission.

2 Director, Center for Electronic Text in the Law, Robert S. Marx Law Library, University of Cincinnati College of Law

Back to Technical Program