åÐìïò•å„
Although the added usability of extensively tagged texts more than offsets the time spent in tagging it, it would be even better if the tagging could be accomplished with less effort. And it makes more sense to build on previous work than to do the same thing all over again since there is to much other work waiting to be finished! When selecting an edition for the input of the Essentials of the Five Histories of the Transmission of the Lamp, compiled by Pu Ji in 1253, the Zen KnowledgeBase team thus decided to use the edition prepared under the name of Su Yuantian, published at the Zhonghua Shuju. Though the overall quality of that edition is not beyond doubt, we decided to go with it because it contains printed markers (such as lines next to names of persons and distinct opening and closing quotes) that would come in handy.
I first sat down to prepare a set of symbols that could later be used as the basis for automated tagging. In the printed edition, this information is contained, for example, in
After receiving the full data set I proceeded to convert the keyboarding symbols into TEI-conform SGML tags. Most of this could be done by program, though not with total consistency. At this first stage, all straight lines (marking various kinds of names) of the printed version became marked up by <name>, without any specification. Book titles, indicated by a wavy line in modern Chinese typesetting conventions, where easily mapped to the tag <title>, and smaller printed comments to the <note> tag as well as the additional attribute type="inline". The footnotes (which were mostly notes on textual variantions among the three collated versions, where also marked with the <note>tag;later, they were manually changed to the appropriate tag sequence for textual variants consisting of <app>, <lem> and <rdg> marks.
In this first conversion step, the fact that SGML based markup can be validated proved to be of crucial importance in discovering numerous errors in the input. While incrementally increasing the amount of markup, the validity of the markup was checked to make sure that no errors crept into the file.
In the second stage of conversion, the typeset opening and closing quote symbols were replaced by markup, differentiating between quotes used for speech and for citations, so that any quotations can be easily identified. Every biography, which is by itself already tagged as a structural division, was made addressable by assigning ID codes that were then also used for any occurrence of "the master said" within that biography. This will make it possible to utterances of specific masters even though their name is not explicitly mentioned. Of course, these ID codes can be linked to other texts and databases we are developing, thus building the basis on which a complex network of references can be developed.
In one more step, which has not yet been completed, all the characters tagged as "name" will also be assigned a corresponding ID, so that a person can be identified and references located even if the text uses a different name than usual (for example "the founder" for Bodhidharma). This can partly be achieved by program but will certainly involve a considerable amount of revision.
The current version of the TEI conforming tagged Wudeng huiyuan is included on the ZenBase CD1 although the markup is not yet completed. It could serve as an example to inspire other projects to try to envisage and incorporate a basic markup process from the initial stages of an input project. Using a modern punctuated version as the base for the input may be an exception, but there is always some information beyond characters in the structure and layout of a text. Even the capture of paragraph divisions at the input level will later greatly help in the further processing of the text and the transformation of data for different purposes.
Author:Christian Wittern