to Home Page

Markup at the Input Level

A case study by Christian Wittern


Abstract

This article (which is also published in the Electronic Bodhidharma No. 4) presents a case study of the manual input and program-driven tagging of a large Chinese Zen text, the Wudeng huiyuan (Jap. Gotô egen åÐìïò•å„


Choosing the edition

Although the added usability of extensively tagged texts more than offsets the time spent in tagging it, it would be even better if the tagging could be accomplished with less effort. And it makes more sense to build on previous work than to do the same thing all over again since there is to much other work waiting to be finished! When selecting an edition for the input of the Essentials of the Five Histories of the Transmission of the Lamp, compiled by Pu Ji in 1253, the Zen KnowledgeBase team thus decided to use the edition prepared under the name of Su Yuantian, published at the Zhonghua Shuju. Though the overall quality of that edition is not beyond doubt, we decided to go with it because it contains printed markers (such as lines next to names of persons and distinct opening and closing quotes) that would come in handy.

Preparation for input

I first sat down to prepare a set of symbols that could later be used as the basis for automated tagging. In the printed edition, this information is contained, for example, in

I then typed in a couple of pages of the text, containing a set of easy-to-remember symbols, to show their usage, printed them out, copied them beside the printed page, and wrote detailed instructions for our input crew in Shanghai. In the letter enclosed with this example, I requested a sample of the input so that we could avoid misunderstandings and test automated tagging. That's easier than having all three volumes input and then finding out that there were problems. Since the crew had never done anything like this and was accustomed to straight text input, I was quite surprised when the sample arrived and proved to be exactly as I had wanted it.

First tagging stage

After receiving the full data set I proceeded to convert the keyboarding symbols into TEI-conform SGML tags. Most of this could be done by program, though not with total consistency. At this first stage, all straight lines (marking various kinds of names) of the printed version became marked up by <name>, without any specification. Book titles, indicated by a wavy line in modern Chinese typesetting conventions, where easily mapped to the tag <title>, and smaller printed comments to the <note> tag as well as the additional attribute type="inline". The footnotes (which were mostly notes on textual variantions among the three collated versions, where also marked with the <note>tag;later, they were manually changed to the appropriate tag sequence for textual variants consisting of <app>, <lem> and <rdg> marks.

In this first conversion step, the fact that SGML based markup can be validated proved to be of crucial importance in discovering numerous errors in the input. While incrementally increasing the amount of markup, the validity of the markup was checked to make sure that no errors crept into the file.



Second tagging stage

In the second stage of conversion, the typeset opening and closing quote symbols were replaced by markup, differentiating between quotes used for speech and for citations, so that any quotations can be easily identified. Every biography, which is by itself already tagged as a structural division, was made addressable by assigning ID codes that were then also used for any occurrence of "the master said" within that biography. This will make it possible to utterances of specific masters even though their name is not explicitly mentioned. Of course, these ID codes can be linked to other texts and databases we are developing, thus building the basis on which a complex network of references can be developed.



Third tagging stage

In one more step, which has not yet been completed, all the characters tagged as "name" will also be assigned a corresponding ID, so that a person can be identified and references located even if the text uses a different name than usual (for example "the founder" for Bodhidharma). This can partly be achieved by program but will certainly involve a considerable amount of revision.

The current version of the TEI conforming tagged Wudeng huiyuan is included on the ZenBase CD1 although the markup is not yet completed. It could serve as an example to inspire other projects to try to envisage and incorporate a basic markup process from the initial stages of an input project. Using a modern punctuated version as the base for the input may be an exception, but there is always some information beyond characters in the structure and layout of a text. Even the capture of paragraph divisions at the input level will later greatly help in the further processing of the text and the transformation of data for different purposes.

Author:Christian Wittern
Last updated: 95/04/23