STG name  
  STG projects publications search staff about contact  


October 10 2000

Award from The National Institute of Standards and Technology (NIST)

"Managing Semantic Heterogeneity in OEB eBooks and other XML Documents"

The National Institute of Standards and Technology, US Department of Commerce, has awarded the Scholarly Technology Group a one year grant to continue STG's ongoing research and development in the area of Open Electronic Book standards.

The specific purpose of this project is to support STG's investigation into the problem of "semantic heterogeneity" across Extensible Markup Language (XML) Document Type Definitions (DTDs) and XML schemas, focusing particularly on schemas for "extended" OEB electronic books.

Elements and attributes in different XML document schemas typically have related -- but also possibly different -- "meanings". At the present time however there is no way at all to represent these relationships in a formal machine-processible way. That is, there is no way to say, for instance, that two element types from different schemas are exact equivalents (e.g., <xx:h1> and <yy:heading1>), or that a combination of two or more elements from one schema is a more detailed treatment of a feature represented by one element in another schema (e.g., <xx:firstname> and <xx:lastname>, vs. <yy:name>), or that an element from one schema is more specific than an element in another (e.g., <xx:person_name> vs. <yy:name>. Even more challenging, we have no way to say that two elements have some sort of partial equivalence, i.e. that they are similar in certain respects although perhaps neither identical nor related in any of the specific ways described above (compare <xx:chapter> and <yy:division> for instance).

These problems currently present very serious practical obstacles to high-performance interoperable electronic publishing in general, and eBook publishing in particular. The successful development of tools and techniques -- for information retrieval, navigation, viewing, rendering -- all require some common representation of data for their effective use. Without a common representation across diverse XML schemas software developers are forced to make an unfortunate choice between either low functionality on the one hand, or low interoperability on the other. This is particularly troubling given the increasing importance of diverse specialized and domain-specific XML schemas used in extended OEB electronic books.

As noted above there is currently no principled way to compare and relate XML schemas and their component elements and attributes, and, therefore, no way to express their semantic relationships in a formal machine-processible language. Ultimately the issues here are very deep: we have no good theory of "markup semantics" and even our empirical knowledge of actual practices is slight. Nevertheless some useful practical measures should be within reach.

This particular grant funds the addition of data collection features to STG's XHub project and begin working with our industry partners to systematically study the problems of semantic heterogeneity raised by XML documents, and to develop and apply technology for addressing this problem. XHub provides a publicly available infrastructure for systematically converting documents among diverse tag-sets. Instrumentation for collecting and analyzing data about the structural nature of the syntactic transformations involved in these conversions will be a valuable source of practical insights into perceived semantic relationships, and, in addition, help create a testbed for evaluating emerging methodologies for managing semantic interoperability.

Lead investigators are Allen Renear and Steve DeRose. For more information contact Allen Renear (Allen_Renear@Brown.Edu or 401 863-7312).