to Home Page

The Importance of Markup

A case study of the Foguangshan project

by Urs App


Abstract

This article (which is also published in the Electronic Bodhidharma No. 4) presents the Foguangshan project, a large input project of Chinese Buddhist texts. When in 1987 its first collection of books was printed, no electronic text was used. Similarly, the 8-volume Foguangshan dictionary of Buddhism published in 1989 was not input but produced in the old manner. For the 51-volume Chan text collection, published in December of 1994 (see detailed description below), computers were extensively used. However, there was no plan yet for a digital edition. Only close to the completion of work on the printed edition did this new aim come into play.

The problems (and opportunities) I could observe during this visit are a case study for the production of such large corpora of texts. It soon became clear that the failure to appropriately mark up the electronic text for both a printed and an electronic edition will result in enormous amounts of additional work -- work that could well have been done at an earlier stage.


1. The Foguangshan project

Founded in 1977, the Foguangshan project had the initial aim of producing a large dictionary of Buddhism (8 vols, printed in 1989) and a new printed edition of large collections of Chinese Buddhist texts. These comprise the Agama collection (17 vols, printed in 1987), the Prajna literature collection (now being prepared), the Chan text collection (51 vols., printed in December of 1994), and as many as thirteen more collections comprising the whole spectrum of Chinese Buddhist scriptures including Huayan, Lo tus, Mere-Representation, vinaya, Theravada, Vajrayana etc. collections.

The project staff consists of several dozen people, mostly nuns. The editorial offices are located in the Foguangshan monastery near Kaohsiung in Southern Taiwan. They are spacious and well equipped with Chinese and Japanese reference materials, text collections, etc.

In the beginning, only printed editions were planned; but after 1990, computers were used for preparing the basis for the printed editions. Then it became suddenly clear to the editorial staff that computers are more than convenient typewriters. The following is a case study of a change of perspective that is still ongoing.


2. The Foguangshan Chan text collection

This 51-volume collection of some of the most important Chan texts and related materials, published at the end of 1994, was entirely input on computer. It consists of four parts plus some appendices:
  1. Historical texts and compendia (23 texts):
    iB^AVA^A^AvA^AcWAtLA@LAWmBAmꄓBA^A_LkA^ApA@AsaE^A@^ARa^ARG^AtzaBAiAqSWts^AvWmB.
  2. Chan records (29 texts):
    ZcdSAPdm^AǍǧm^ABS@vA^A^AA^A^A^AzWt^ASWt^AR^AR^AGqA^Ad^Ak^A^Au^A^Ajc^A䲒Wt^Ah^Ad^AdNA^AI^A_aNA⠒Wt^A_Tb.
  3. Chan treatises, poems, and various other materials (23 texts)
    S_AZA·@_Av_A_ABSnWtPAiWATFWsA@\K_AWPAlVAW@^WASABSA]xPARWxA@AWiA@|[vABSA\AVW@\Aq.
  4. Miscellaneous materials (6 texts)
    ^AAne^AW`lAcAW̒.
  5. Appendices: reference materials; and indexes of important terms, names of persons, places, and temples.
This long list indicates that many people spent years inputting and correcting these text data until they could be printed. I will now outline the editorial process that led to the 51 printed Chan text collection volumes.


3. The editorial process

During work on the Chan text corpus, the editorial process was refined to a considerable degree. The following outline is based on a chart for internal use at the editing office and other materials that I received during my December 94 visit. I will insert some indented suggestions about what in my opinion should be done at these various stages. These remarks are less a critique of past procedures than suggestions for improvement that take electronic text into account.
  1. Decision on which printed text edition is to be used for input
  2. Copying of that text for input purposes
  3. Punctuation (two passes; handwritten comments on correction sheets)
    Remark: All such correction sheets should be safeguarded for later perusal. It also is appropriate to train personnel to use specific formats for their comments; such comments may even be input in the form of markup; this may make finding critical spots for analysis by experts much easier at a later correction stage.
  4. Manual computer input (using the PE2 editor on the ETEN surface). As many characters are missing, certain decisions were made to make very frequently encountered characters conform to the Big-5 shapes (thus, for example, p in the text was replaced with , by , ~ by , by , etc). Additionally, 1,714 characters were produced for use as "gaiji" (i.e. non-standard, self-made characters). Each member of the input staff got a list of characters to be normalized; I did not see such a list, but it is fair to assume that many decisions were not made at the beginning but as input went along. In the final printing stage, the total number of non-Big5 characters rose to 2,135.
    Remark: For electronic editions, it is important to keep at the first stages as much information as possible of the original texts. Electronic texts will survive much longer than the Big5 code, and in twenty years nobody could care less what code problems we now have. Thus it would be a better strategy to normalize little or not at all at this stage; this can very well be done later (and tailored for the specific need, such as: printing, concordance, Unicode display on screen, etc). In view of i ncluded texts such as the Zutangji (Sodoshu) with lots of variant character forms, little over 2000 self-made characters implies a great degree of normalization. Such normalization often is arbitrary and, worse, not reversible. Furthermore, with electronic texts it can also be handled on the search engine level (fuzzy searches, normalizing filters, etc.)
    I would thus suggest that at all early stages one should record (in form of electronic markup symbols and lots of self-made characters, if necessary) the original as faithfully as possible, down to tagging different character sizes, line breaks, page breaks, etc. (See Christian Wittern's case study of such markup at the input level and its usefulness). As a general rule: the more information one keeps and the better one organizes it at an early stage, the less work and more possibi lities one will have later on. It is appropriate to look beyond present contingencies and think in terms of centuries of lifespan rather than decades.
  5. Printout for first data correction pass
    Remark: In my experience, printout of format (vertical or horizontal) and character size similar to the original improves the proofreading efficiency and causes less headaches.
  6. Correction of the printout (two passes)
    Remark: Correctors should be instructed to fill out sheets where all variant character forms that are not reproduced and other problems that strike their eye are noted down. With large-volume input, such correction sheets come in very handy when suddenly, as it invariably happens, one finds some problem at a later project stage or may decide to change one's approach.
  7. Joint text verification (three passes). It did not become quite clear to me what this phase consisted of; it may have been group study of texts.
  8. Data correction of computer files
  9. First collation. During this phase, already input data containing similar stories was compared, and when needed the text was revised. The similar passages were referred to in pencil in the margins of the correction sheets. From this stage onward, the Foguangshan text may or may not be indentical with the source text. All such revisions were not inserted as markup tags but survive only, if at all, in form of notes in the margins of correction sheets.
    Remark: Whenever one makes such corrections, one ought to input them in form of a markup tag. In this way, one will later be able to generate two different versions of the text, together with an automatic list of all corrections. In the absence of such markup, textual decisions are either lost in heaps of paper or have to be verified from scratch.
    Furthermore, one should keep original line breaks and formatting information in the form of tags; these can come in very handy in an electronic edition where one can display the original page and line information even if the text was heavily edited and is in a completely different format.
    Cross-references, too, are extremely useful for electronic editions; just push a button and you see related passages from various texts. As the Foguangshan people get ready for an electronic edition of their Chan corpus, they will have to input all such penciled cross-references -- if they still have all those correction sheets. Naturally, the thoroughness of such cross-referencing depends very much on the scholarship of the editors. In general, the little I saw at Foguangshan made me think that they did a pretty thorough job, though without realizing how useful this could be at a later stage. They simply wanted to produce a "good" text and peel out the "true" story.
  10. General arrangement of data. In this phase, the printed arrangement of the Foguangshan edition was prepared. In this edition, each question and answer goes on a separate line, and much modern punctuation is added to facilitate reading. Chapter headings were inserted, stories separated by spaces, etc. In the course of this work, all kinds of pencil marks were made on paper, for example to mark beginnings and endings of stories and separate units of text from others.
    Remark: If one keeps in mind that it is best to use the same data set both for printing and for electronic text publication, it is self-defeating to format the data just with the printer in mind. Rather, all such formatting should be done in form of markup tags (best: SGML and TEI-conform tags). Some of these tags can be geared towards printing, others toward use of electronic data. For example, chapter headings of various levels should be consistently tagged; thus one will be able to simply ord er level 1 titles to be 18 point, level 2 titles 14 point, or whatever else is needed. It is very easy to remove certain kinds of tags for certain purposes or tell a program to ignore them. Formatting such data as one does in ordinary word processing and making the layout program-specific is a bad strategy that effectively prevents other uses than printing. It is far better to make the extra effort of inserting markup tags; this will save months or years of additional work if one decides to port the data to another medium (such as a CD-ROM).
  11. Three more proofreading passes, resulting in the basic data set
  12. Correction on computer of the basic data set
  13. Second collation phase and two additional proofreading passes
  14. Elimination of unneeded information and revision of the manuscript, followed by another two proofreading passes
  15. Printout of the entire data set
  16. Final collation and correction.
  17. Final proofs. I presume that it as this level that information used for the production of the indexes was marked up on paper: person's names, place names, etc. However, the printed output I saw contained much more handwritten markup than the printed version contains.
    Remark: Usually, proof reading is done by two distinct groups of people. The first simply looks for different characters and compares the printed output more or less mechanically with the original. The second group of people, however, reads for content and is thus also able to spot mistakes or questionable things in the original. Moreover, the second group can readily identify much of the inherent information of a text: names, places, temples, stories, references to time, quotes from texts, conv ersations, etc. At Foguangshan, some of this information was used for the printed product, but much remains just as pencil marks on proofing sheets. It is sensible to include such information --as much as possible-- in form of markup tags; this will not only help automating the layout process but also provide the "hooks" for hypertext functions and automatically generated indexes. Once again, a shortsighted focus only on printing or some primitive database use leaves much of the achieved valuable work unus ed.
  18. Final cleanup stage, insertion of illustrations
  19. Last proofreading pass
  20. Printing


Conclusion

While I may have missed some of the details of the process, it seemed to me that the whole is rather typical of this transition period from printed to electronic text. It was amazing to see how much work had been achieved over the years by scholars -- and how much of it remained in handwriting, too. I felt that much of that effort could have been put to better use if it had been made part of the electronic text in form of tags. For the electronic edition of the Foguangshan Chan text collection, much of what these scholars pencilled on paper will have to be input anyway -- if it still exists.
In December of 1994, Foguangshan employed a computer specialist to be in charge of the electronic edition. It was quite apparent to me, however, that the core work to be done for a good electronic edition must be achieved by scholars. This work deals with content, and while computer experts may be able to point out "how" things can be done, they are often unable to grasp "what" ought to be done, and "why." I was astonished to see how much of the needed work had in fact already been achieved, albeit unsystem atically and in the "analog" form of pencil marks on proofing sheets.
I am looking forward to see the appropriate transfer of much of this to the digital medium, and I am sure that this very laborious exercise will change the working process at Foguangshan along some of the lines pointed out in my remarks. Tagging is the word. Projects of this gigantic scale make thorough planning even more desirable; and the production of both printed and electronic text from a single data matrix by using markup tags is not only needed but absolutely inevitable. The richer that data matrix, the longer it will live and the more forms it can take.
Just as 75 rpm and long play records faded away, audio cassettes and CDs will rather soon be surpassed, and a good record company knows that well: its master tapes will be used for many different media and generations. Similarly, a rich electronic text matrix will long survive Big5, JIS, and even Unicode; only "master data" with much tagged information will not be obsolete in ten or twenty years. Thus the only sensible long-term strategy is to avoid the narrow focus on the printing press and woefully inade quate character code sets. The small effort of using tags to preserve information (and make implicit information explicit) will soon be richly rewarded: much less work for the transfer to other media and uses, and above all greater scholarly value, flexibility of use, and thus longevity.
Author:Urs APP
Last updated: 95/05/13