to Home Page

Dog Ears and SGML

by Urs APP


Abstract

This article (which is also published in the Electronic Bodhidharma No. 4) presents a very simple introduction to SGML and the tagging of texts by discussing marks (such bent page corners in books, or "dog-ears") that are used to "mark" printed materials. Electronic earmarking or tagging of documents adds value to electronic text by making its structure and content explicit. If the dog-ears and pencil markups in your books make their use easi er and more focused, electronic markup harbors an enormous potential. With good markup, the results of one's efforts are not forced to remain in one's closet; rather, they can become part of the text itself, and such "value-added text" may in turn enrich others and become part of our common human heritage.


After delivering the thoroughly edited, proofread, and corrected text of my book on Zen Master Yunmen to Kodansha America on a floppy, Kodansha was told by their type setter that it would be cheaper to re-type the whole manuscript than to use my floppy. Why retype a perfectly corrected manuscript and run the inevitable risk of making a whole range of new mistakes which necessitate additional proofreading passes?

The answer lies in the markup. The type setting company in question uses a somewhat antiquated but functional system of marking up texts; for example, when it wants the title "Record of Yunmen" to appear in italics of 15 point size centered on the page, it may have to insert tags like {CEN} {IT} {PT 15} before "Record of Yunmen." My wordprocessing program also marks up the text, but in a different way. Since the typesetting company was unable to translate my markup syste m into theirs, it decided to retype my whole manuscript.

This kind of incompatibility of markup code was the main motive for the development of a document representation standard issued by the International Organization for Standardization (ISO) in 1986. The standard is called SGML: Standardized Generalized Markup Language. It is designed to allow text markup that is independent of any specific word- or text-processing setup. Had I been able to save my computer f ile in SGML format, and the type setter to read this format, weeks of work would have been saved.

But marking up a text is not just limited to matters of layout and appearance; rather, markup refers to any activity that makes explicit much of the information inherent in a text. Thus one can, for example, highlight words or phrases in order to generate a table of content or an index. Or one can insert markup tags that indicate positions in other texts or editions, allowing the reader to consult these texts with ease and precision -- thus linking, for example, an English translation to its Chinese original, or a specific koan to a variety of koan collections. Needless to say, such tags can also permit automatic comparison of various versions of the same text, for example in the Korean and Chinese Buddhist canons. Or, if one adds longitude-latitude tags to names of towns, one can create the possibility of immediate consultation of electronic maps. In turn, lineage-tags would link a master's name to precise positions within lineage tables, citation tags to locations within so urce texts . . . one could go on and on with this list; depending on one's interests, one could cook up any number of useful tags -- and end up with a tag salad that even its creator would be baffled by.

We all know how unorganized and inconsistent markup can be: just look at your books with their dog-ears and jotted remarks in the margins. Very likely, no one but you can take advantage of that kind of markup, and it will probably be lost with your demise, because noone else will know what it means. This kind of markup lacks precision; it is usually like a kind of primitive language consisting of words like "ooh!" and "aha" and "I agree" and "rubbish" or "I need to remember this!" A dog-ear might signify "there is something of interest on this page" or "this is where I stopped reading," "this page is dynamite for my upcoming book review" or "I need to locate an article referred to on this page." You will know what else a dog-ear can signify for you -- and only you will ever know.

This is exactly where the SGML markup standard kicks in. It is a Standardized (ensuring that markup information is interchangeable and consistent) and Generalized (applicable to all sorts of documents) Markup Language, a kind of meta-language that clearly defines many things in a document, from the various categories of dog-ears to the different kinds of commas. To ensure that the terms are less ambiguous than the symbols in your book margins, SGML-co nform documents begin with a list that states the basic ingredients of your marked up text: what character set you are using (for example ASCII or Unicode) and how you distinguish markup symbols from ordinary text. This list is called the SGML declaration. In terms of the book analogy: you would here declare that what is black on the page is text and what is white is the background; that text consists of alphabetic letters and Chinese characters; and that the markup symbols have the form o f dog-ears and grey pencil marks.

It is also likely that, depending on the type of document you are dealing with, a dog-ear might signify various different things. In a draft document given to a secretary, it may mean "watch out!", in a software manual "look here for what's missing in the index," in a newspaper "there's an interesting ad on this page," or in a film script "this scene must be shot again." In SGML-conform documents, the Document Type Definition (DTD) defines the chara cteristics of such different types of document and states the rules that are applied in marking it up. These rules serve without exception a single purpose: to help us making implicit information explicit. For example: if you want to find abbreviations in an English document, you have to search for words ending with periods. However, a period is a very ambiguous symbol: it can end abbreviations, sentences, or a combination of these two; additionally, it can represent a space in numeric sequences (such as T8 .274.232a10) or serve as a decimal stop. Depending on the document type one chooses or defines, one or more of these definitions may have to be made explicit; thus the document type definition may state that the tag stands for an abbreviation stop; for the end of a sentence; for both of the above; for a numeric space character; and for a decimal point. The document type definition (DTD) is usually designed by a specialist; at our institute, my rese arch team is now engaged in developing some DTD models for Zen texts.

The third element of an SGML-conform document is the marked-up text itself. These three elements

  1. the SGML declaration
  2. the document type definition (DTD), and
  3. the marked-up text
together constitute an SGML-conform electronic document. Its structure and content are defined so generically that the document is largely hardware and software independent. Its clear separation of structure from layout and its unambiguous format permit easy storage, intelligent searching, output in a variety of forms, and broad interchange.

Such electronic "earmarking" of documents adds value to electronic text; just like the dog-ears and pencil markups in your books, it makes their use easier and more focused -- but in a vastly more fruitful and comprehensive manner. Thus the results of one's efforts are not forced to remain in one's closet; rather, they can become part of the text itself, and such "value-added text" may in turn enrich others and become part of the common human heritage.


Author:Urs APP
Last updated: 95/05/03